Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / C# / October 2007

Tip: Looking for answers? Try searching our database.

Reading a text file with spanish accents

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Amy L. - 12 Oct 2007 02:46 GMT
I am at an absolute loss on what is going on here.  I have a text file
with some Spanish writing.  Some of the characters have accents.  I have
not found anyway to read this text file and echo the output to the
console showing the accents.

I have tried using UTF-8 but it does not like the accent characters.

It basically converts
Añoro esta situación

to
A?oro esta situaci?n

What am I missing?
Amy
Nicholas Paldino [.NET/C# MVP] - 12 Oct 2007 03:19 GMT
Amy,

   Well, it's possible that you are reading the file correctly from UTF-8,
but the font for the console doesn't support those characters.  What is the
font that you are using and does it support those characters?

Signature

         - Nicholas Paldino [.NET/C# MVP]
         - mvp@spam.guard.caspershouse.com

>I am at an absolute loss on what is going on here.  I have a text file with
>some Spanish writing.  Some of the characters have accents.  I have not
[quoted text clipped - 11 lines]
> What am I missing?
> Amy
Amy L. - 12 Oct 2007 06:03 GMT
> Amy,
>
>    Well, it's possible that you are reading the file correctly from
> UTF-8, but the font for the console doesn't support those characters.  
> What is the font that you are using and does it support those characters?

In testing I decided to print each char to the screen along with its
byte value.  The code is merely a (int)c where c is a char.

When using StreamReader with Encoding.UTF8 the ñ gets displayed as a ?
with a code of 65535

When using StreamReader with Encoding.Default the ñ gets displayed as a
ñ with a code of 241

When using FileStream with no encoding (don't believe you can set it)
and than printing the characters of the bytes ñ gets displayed as a ñ
with a code of 241.

When attempting to convert the byte array returned from the FileStream
to a String in UTF8 via below the sting does not convert properly (I get
the ? for the accented characters).

UTF8Encoding temp = new UTF8Encoding( true );
Console.WriteLine( temp.GetString( b ) );

However, if I do
Console.WriteLine( System.Text.Encoding.Default.GetString( b ) );

It prints correctly.

I have read that using "Encoding.Default" is not good - however it seems
to be the only thing that works.  I know the characters are for the most
part being read in correctly especially with FileStream.  It just seems
like I am lost on what to do about the encoding of them.

Thoughts?
Darrell
Jon Skeet [C# MVP] - 12 Oct 2007 07:53 GMT
<snip>

> However, if I do
> Console.WriteLine( System.Text.Encoding.Default.GetString( b ) );
[quoted text clipped - 5 lines]
> part being read in correctly especially with FileStream.  It just seems
> like I am lost on what to do about the encoding of them.

*Characters* are not read at all by a FileStream. Bytes are read by a
FileStream. An Encoding is the way of converting between bytes and
characters.

If your file is effectively encoded using Encoding.Default, that's what
you should use. It would be generally better if you were able to start
with a UTF-8 file, but if you can't control whatever produces the file,
then you need to follow its lead.

Picking an encoding is a bit like picking an image format - you might
prefer PNG to BMP, but if someone gives you a BMP file and you try to
read it as if it were a PNG, you won't get the right picture.

Signature

Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Christof Nordiek - 12 Oct 2007 09:04 GMT
>> Amy,
>>
[quoted text clipped - 7 lines]
> When using StreamReader with Encoding.UTF8 the ñ gets displayed as a ?
> with a code of 65535

This is a non-character in Unicode. So the file seems not to be UTF-8
encoded

> When using StreamReader with Encoding.Default the ñ gets displayed as a ñ
> with a code of 241

This is hexadecimal 00F1
This is the right Unicode for ñ (LATIN SMALL LETTER N WITCH TILDE)
So this seems to be the right encoding.

> When using FileStream with no encoding (don't believe you can set it) and
> than printing the characters of the bytes ñ gets displayed as a ñ with a
> code of 241.

So this must be the byte stored in the file.
The ANSI-Encoding if your system seems to map unicode 00F1 to byte F1
In UTF-8 this would be the beginning byte of a 4 byte charcter. Very
probable the next byte can't be a following character of a UTF-8 character.
So obviously the encding uses FFFF as substitution character for incorrect
encoding.

> I have read that using "Encoding.Default" is not good - however it seems
> to be the only thing that works.

As Jon said, if this is the way, the file was encoded, this is the right
encoding to read the file.

Christof
Ben Mc - 12 Oct 2007 10:41 GMT
Hi Amy,

Just a quick bit of info of the top of my head, (i havent read into
detail about your problem in the above discussion), but what first
comes to mind is why you are trying to use UTF-8 and NOT UTF-16.

The 8 stands for 8bits which is can hold 0-255 decimal values (ala
ASCII character set). UTF-16 was introduced to handle international
character-sets, as it is 16bit, hence a capacity to hold 65536
different characters - from 0 - 65535 (64k)

Hope this helps.

Cheers,
Ben
Jon Skeet [C# MVP] - 12 Oct 2007 10:51 GMT
> Just a quick bit of info of the top of my head, (i havent read into
> detail about your problem in the above discussion), but what first
[quoted text clipped - 4 lines]
> character-sets, as it is 16bit, hence a capacity to hold 65536
> different characters - from 0 - 65535 (64k)

No, you've completely misunderstood UTF-8, as well as claiming that
ASCII has 256 values (it doesn't - it's only 7 bit).

UTF-8 is perfectly capable of encoding all Unicode characters. A
Unicode character is encoded in 1-4 bytes by UTF-8. UTF-8 is a
pleasantly compact format because it encodes ASCII characters (which
make up the majority of most documents in the Western world) as single
bytes, and is ASCII-compatible in that any valid ASCII document is
also a valid UTF-8 document with the same meaning.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more information
(and ignore the fact that it says it's about Linux/Unix).

Jon
Cor Ligthert[MVP] - 12 Oct 2007 05:15 GMT
Amy,

The Spanish characters are in the 1252 characterset. It is in my idea good
to check that in  the Country settings . The way to handle this seems for me
in almost every Windows OS version different, so I cannot tell you that.  I
have had problems enough with this where in not every application the
characters were showed right although that was when using combined set 1250
and 1252.

http://msdn2.microsoft.com/en-us/library/aa912040.aspx

Cor

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.