Docs

August 9, 2025 ยท View on GitHub

CI/CD NuGet Pre Release

UTF Unknown

Detect character set for files, streams and other bytes.

Detection of character sets with a simple and redesigned interface.

This package is based on Ude and since version 2 also on uchardet, which are ports of the Mozilla Universal Charset Detector.

The interface and other classes has been resigned so it's easier to use and better object oriented design (OOD). Unit tests and CI has been added.

Features:

  • New API
  • Moved to .NET Standard
  • Added more unit tests
  • Builds on CI (GitHub Actions)
  • Strong named
  • Documentation added
  • Multiple bugs from Ude fixed

Supported Platforms

  • .NET 6 (Will be dropped in the future)
  • .NET 8
  • .NET Standard 2.0

Remarks: You can still register your EncodingProvider so that the Encoding.GetEncoding(...) method first tries to find in it.

Usage

Use the static detectX methods from CharsetDetector.

Synchronous Methods

// Detect from File 
DetectionResult result = CharsetDetector.DetectFromFile("path/to/file.txt"); // or pass FileInfo

// Detect from Stream
result = CharsetDetector.DetectFromStream(stream);

// Detect from bytes
results = CharsetDetector.DetectFromBytes(byteArray);

// Get the best Detection
DetectionDetail resultDetected = results.Detected;

// Get the alias of the found encoding
string encodingName = resultDetected.EncodingName;

// Get the System.Text.Encoding of the found encoding (can be null if not available)
Encoding encoding = resultDetected.Encoding;

// Get the confidence of the found encoding (between 0 and 1)
float confidence = resultDetected.Confidence;

// Get all the details of the result
IList<DetectionDetail> allDetails = result.Details;

Asynchronous Methods

// Detect from File asynchronously
DetectionResult result = await CharsetDetector.DetectFromFileAsync("path/to/file.txt", cancellationToken); // or pass FileInfo

// Detect from Stream asynchronously
result = await CharsetDetector.DetectFromStreamAsync(stream, cancellationToken);

Docs

The article "A composite approach to language/encoding detection" describes the charsets detection algorithms implemented by the library.

The following charsets are supported

Encodings with BOM: utf-7, utf-8, utf-16be/utf-16le, utf-32be/utf-32le, X-ISO-10646-UCS-4-34121/X-ISO-10646-UCS-4-21431, gb18030.

Encodings without BOM are presented in the table, separated by languages:

LanguageEncodings
International (Unicode)utf-8
Arabiciso-8859-6, windows-1256
Bulgarianiso-8859-5, windows-1251
Chineseiso-2022-cn, big5, euc-tw, gb18030, hz-gb-2312
Croatianiso-8859-2, iso-8859-13, iso-8859-16, windows-1250, ibm852, x-mac-ce
Czechwindows-1250, iso-8859-2, ibm852, x-mac-ce
Danishiso-8859-1, iso-8859-15, windows-1252
Englishascii
Esperantoiso-8859-3
Estonianiso-8859-4, iso-8859-13, iso-8859-13, windows-1252, windows-1257
Finnishiso-8859-1, iso-8859-4, iso-8859-9, iso-8859-13, iso-8859-15, windows-1252
Frenchiso-8859-1, iso-8859-15, windows-1252
Germaniso-8859-1, windows-1252, CP 850/IBM 00850
Greekiso-8859-7, windows-1253
Hebrewiso-8859-8, windows-1255
Hungarianiso-8859-2, windows-1250
Irish Gaeliciso-8859-1, iso-8859-9, iso-8859-15, windows-1252
Italianiso-8859-1, iso-8859-3, iso-8859-9, iso-8859-15, windows-1252
Japaneseiso-2022-jp, shift-jis, euc-jp
Koreaniso-2022-kr, euc-kr/uhc, cp949
Lithuanianiso-8859-4, iso-8859-10, iso-8859-13
Latvianiso-8859-4, iso-8859-10, iso-8859-13
Malteseiso-8859-3
Polishiso-8859-2, iso-8859-13, iso-8859-16, windows-1250, ibm852, x-mac-ce
Portugueseiso-8859-1, iso-8859-9, iso-8859-15, windows-1252
Romanianiso-8859-2, iso-8859-16, windows-1250, ibm852
Russianiso-8859-5, koi8-r, windows-1251, x-mac-cyrillic, ibm855, ibm866
Slovakwindows-1250, iso-8859-2, ibm852, x-mac-ce
Sloveneiso-8859-2, iso-8859-16, windows-1250, ibm852, x-mac-ce
Spanishiso-8859-1, iso-8859-15, windows-1252
Swedishiso-8859-1, iso-8859-4, iso-8859-9, iso-8859-15, windows-1252
Thaitis-620, iso-8859-11
Turkishiso-8859-3, iso-8859-9
Vietnameseviscii, windows-1258
Otherswindows-1252

Remarks: For some aliases of encoding not available: cp949, iso-2022-cn, euc-tw, iso-8859-10, iso-8859-16, viscii, X-ISO-10646-UCS-4-34121/X-ISO-10646-UCS-4-21431. Some of them have been offered a suitable replacement for the return result by DetectionDetail.Encoding:

  • cp949: use ks_c_5601-1987
  • iso-2022-cn: use x-cp50227

License

The library is subject to the Mozilla Public License Version 1.1 (the "License"). Alternatively, it may be used under the terms of either the GNU General Public License Version 2 or later (the "GPL"), or the GNU Lesser General Public License Version 2.1 or later (the "LGPL").

Test data has been extracted from Wikipedia and The Project Gutenberg books and is subject to their licenses.