
Signature
- Nicholas Paldino [.NET/C# MVP]
- mvp@spam.guard.caspershouse.com
>I am at an absolute loss on what is going on here. I have a text file with
>some Spanish writing. Some of the characters have accents. I have not
[quoted text clipped - 11 lines]
> What am I missing?
> Amy
> Amy,
>
> Well, it's possible that you are reading the file correctly from
> UTF-8, but the font for the console doesn't support those characters.
> What is the font that you are using and does it support those characters?
In testing I decided to print each char to the screen along with its
byte value. The code is merely a (int)c where c is a char.
When using StreamReader with Encoding.UTF8 the ñ gets displayed as a ?
with a code of 65535
When using StreamReader with Encoding.Default the ñ gets displayed as a
ñ with a code of 241
When using FileStream with no encoding (don't believe you can set it)
and than printing the characters of the bytes ñ gets displayed as a ñ
with a code of 241.
When attempting to convert the byte array returned from the FileStream
to a String in UTF8 via below the sting does not convert properly (I get
the ? for the accented characters).
UTF8Encoding temp = new UTF8Encoding( true );
Console.WriteLine( temp.GetString( b ) );
However, if I do
Console.WriteLine( System.Text.Encoding.Default.GetString( b ) );
It prints correctly.
I have read that using "Encoding.Default" is not good - however it seems
to be the only thing that works. I know the characters are for the most
part being read in correctly especially with FileStream. It just seems
like I am lost on what to do about the encoding of them.
Thoughts?
Darrell
Jon Skeet [C# MVP] - 12 Oct 2007 07:53 GMT
<snip>
> However, if I do
> Console.WriteLine( System.Text.Encoding.Default.GetString( b ) );
[quoted text clipped - 5 lines]
> part being read in correctly especially with FileStream. It just seems
> like I am lost on what to do about the encoding of them.
*Characters* are not read at all by a FileStream. Bytes are read by a
FileStream. An Encoding is the way of converting between bytes and
characters.
If your file is effectively encoded using Encoding.Default, that's what
you should use. It would be generally better if you were able to start
with a UTF-8 file, but if you can't control whatever produces the file,
then you need to follow its lead.
Picking an encoding is a bit like picking an image format - you might
prefer PNG to BMP, but if someone gives you a BMP file and you try to
read it as if it were a PNG, you won't get the right picture.

Signature
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Christof Nordiek - 12 Oct 2007 09:04 GMT
>> Amy,
>>
[quoted text clipped - 7 lines]
> When using StreamReader with Encoding.UTF8 the ñ gets displayed as a ?
> with a code of 65535
This is a non-character in Unicode. So the file seems not to be UTF-8
encoded
> When using StreamReader with Encoding.Default the ñ gets displayed as a ñ
> with a code of 241
This is hexadecimal 00F1
This is the right Unicode for ñ (LATIN SMALL LETTER N WITCH TILDE)
So this seems to be the right encoding.
> When using FileStream with no encoding (don't believe you can set it) and
> than printing the characters of the bytes ñ gets displayed as a ñ with a
> code of 241.
So this must be the byte stored in the file.
The ANSI-Encoding if your system seems to map unicode 00F1 to byte F1
In UTF-8 this would be the beginning byte of a 4 byte charcter. Very
probable the next byte can't be a following character of a UTF-8 character.
So obviously the encding uses FFFF as substitution character for incorrect
encoding.
> I have read that using "Encoding.Default" is not good - however it seems
> to be the only thing that works.
As Jon said, if this is the way, the file was encoded, this is the right
encoding to read the file.
Christof
Ben Mc - 12 Oct 2007 10:41 GMT
Hi Amy,
Just a quick bit of info of the top of my head, (i havent read into
detail about your problem in the above discussion), but what first
comes to mind is why you are trying to use UTF-8 and NOT UTF-16.
The 8 stands for 8bits which is can hold 0-255 decimal values (ala
ASCII character set). UTF-16 was introduced to handle international
character-sets, as it is 16bit, hence a capacity to hold 65536
different characters - from 0 - 65535 (64k)
Hope this helps.
Cheers,
Ben
Jon Skeet [C# MVP] - 12 Oct 2007 10:51 GMT
> Just a quick bit of info of the top of my head, (i havent read into
> detail about your problem in the above discussion), but what first
[quoted text clipped - 4 lines]
> character-sets, as it is 16bit, hence a capacity to hold 65536
> different characters - from 0 - 65535 (64k)
No, you've completely misunderstood UTF-8, as well as claiming that
ASCII has 256 values (it doesn't - it's only 7 bit).
UTF-8 is perfectly capable of encoding all Unicode characters. A
Unicode character is encoded in 1-4 bytes by UTF-8. UTF-8 is a
pleasantly compact format because it encodes ASCII characters (which
make up the majority of most documents in the Western world) as single
bytes, and is ASCII-compatible in that any valid ASCII document is
also a valid UTF-8 document with the same meaning.
See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more information
(and ignore the fact that it says it's about Linux/Unix).
Jon