Detect the encoding of a file

October 8, 2021 · View on GitHub

Well this is a doozy, and bound to require constant upkeep.

I always start off reading files like this...

using (var sr = new StreamReader(fileName))

Then, someone complains that their non-ascii files weren't read correctly, I ask for example files, perform some tests and end up with this:

using (var sr = new StreamReader(fileName, System.Text.Encoding.UTF8))

It works for a while, then I receive more complaints and more test files. I see that it doesn't work for the new test files. Hmm.

I studiously avoid thinking about "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

I can open the file in NotePad++ and see under the conveniently named "Encoding" menu, exactly which encoding NotePad++ decides on for a given file.

I've learned that some files have byte order marks that give pretty important clues about the encoding.

Here's my re-creation fo the table from Wikipedia: Byte order marks by encoding

EncodingRepresentation (hexadecimal)Representation (decimal)Might look like...
UTF-8EF BB BF239 187 191
UTF-16 (BE)FE FF254 255þÿ
UTF-16 (LE)FF FE255 254ÿþ
UTF-32 (BE)00 00 FE FF0 0 254 255NULNULþÿ (where NUL means the NULL character)
UTF-32 (LE)FF FE 00 00255 254 0 0ÿþNULNUL
UTF-72B 2F 7643 47 118+/v
UTF-1F7 64 4C247 100 76÷dL
UTF-EBCDICDD 73 66 73221 115 102 115Ýsfs
SCSU0E FE FF[c]14 254 255^Nþÿ (where ^N is the shift out character)
BOCU-1FB EE 28251 238 40ûî(
GB-1803084 31 95 33132 49 149 51„1•3

Wikipedia points out that for UTF-8, UTF-7, UTF-1, UTF-EBCDIC, SCSU, BOCU-1, GB-18030 these starting bytes of the file are not literally a "byte order mark" as these encoding don't take multiple bytes to encode a characters, hence no byte order is needed. instead in those cases these bytes act as an indicator of the encoding.

(But this doesn't help if no byte order mark is present.)

From stack overflow I've found an answer, with a lot of upvotes, which almost works for me. It relies on the presence of a BOM as listed above, and failing that returns default.

Only change I had to make is highlighted below:

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endian-ness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
private static Encoding GetEncoding(string filename)
{
	// Read the BOM
	var bom = new byte[4];
	using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
	{
		file.Read(bom, 0, 4);
	}

	// Analyze the BOM
	if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
	if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
	if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; // UTF-16LE
	if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; // UTF-16BE
	if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
	return Encoding.Default; // **Changed this line**
}

And use it thus:

var encoding = GetEncoding(fileName);
using (var sr = new StreamReader(fileName, encoding)) // System.Text.Encoding.UTF8))

I am certain this will require further changes in future.

Bonus West Wind Version

Note that Rick Strahl has blogged a version of this here: Detecting Text Encoding for StreamReader

His work is always battle tested.

Source

See also