Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / New Users / September 2006

Tip: Looking for answers? Try searching our database.

Possible bug in UnicodeEncoding

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
KrippZ@gmail.com - 12 Sep 2006 17:31 GMT
Hello!

We here at the office have discovered something odd. Can somebody
please verify this potential bug for us?

This code generates a byte buffer fills it with 256 bytes ranging from
0 to 255, and the bug appers when the Unicode Encoder gets the bytes
from another Unicode Encoder that gives it a string from a bytebuffer.

The bytebuffers should not differ but in Net 2.0 they do.
We have run the testcode  in VS 2003 and VS 2005 and the results of
2003 don´t differ.

bytes 216,217 and 222, 223 seem to go missing?!?

      static void Main(string[] args)
       {
           byte[] bytearrBuffer = new byte[256];
           for (int i = 0; i < 256; i++)
           {
               bytearrBuffer[i] = (byte)i;
           }
           WriteBuffer(bytearrBuffer, "Buffer.txt");
           WriteBuffer(new System.Text.UnicodeEncoding().GetBytes(new
System.Text.UnicodeEncoding().GetString(bytearrBuffer)),
"Buffer2.txt");
       }

       public static void WriteBuffer(byte[] arrbyteBuffer, string
filename)
       {
           try
           {
               string sLogFileName = Path.Combine("c:\\", filename);

               FileStream fs = new
FileStream(sLogFileName,FileMode.Create,FileAccess.Write,FileShare.Write);
               BinaryWriter bw = new BinaryWriter(fs);

               for (int i = 0; i < arrbyteBuffer.Length; i++)
               {
                   bw.Write(arrbyteBuffer[i].ToString());
               }

               bw.Flush();
               bw.Close();
           }
           catch
           {
           }
       }

Cheers
//KrippZ
Mattias Sjögren - 13 Sep 2006 21:37 GMT
>We here at the office have discovered something odd. Can somebody
>please verify this potential bug for us?

I wouldn't call it a bug. There's no guarantee that a random byte
array will come back the same after a
Encoding.GetString/Encoding.GetBytes roundtrip. Some byte values may
have spacial meaning or may be invalid according to that encoding. So
you can't take an arbitrary blob and decode it to a string like that.

Mattias

Signature

Mattias Sjögren [C# MVP]  mattias @ mvps.org
http://www.msjogren.net/dotnet/ | http://www.dotnetinterop.com
Please reply only to the newsgroup.

Jon Skeet [C# MVP] - 13 Sep 2006 22:12 GMT
> We here at the office have discovered something odd. Can somebody
> please verify this potential bug for us?

Not a bug, or at least not the bug you think it is.

> This code generates a byte buffer fills it with 256 bytes ranging from
> 0 to 255, and the bug appers when the Unicode Encoder gets the bytes
[quoted text clipped - 5 lines]
>
> bytes 216,217 and 222, 223 seem to go missing?!?

Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
are reserved for surrogate pairs - you need to have a value in
[0xd800-0xdbff] followed by [0xdc00-0xdfff]. So, 216/217 isn't valid,
and neither is 222/223.

In fact, 218/219 and 220/221 shouldn't be valid either, which just goes
to show: garbage in, garbage out.

The moral of the story is that you shouldn't treat arbitrary binary
data as text.

Signature

Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Jon Skeet [C# MVP] - 13 Sep 2006 23:10 GMT
> > The bytebuffers should not differ but in Net 2.0 they do.
> > We have run the testcode  in VS 2003 and VS 2005 and the results of
[quoted text clipped - 9 lines]
> In fact, 218/219 and 220/221 shouldn't be valid either, which just goes
> to show: garbage in, garbage out.

Sorry, I've realised what I'd done wrong in the above analysis. My
general principle was right (as was the conclusion that the byte array
didn't represent a valid Unicode string) but the logic was off.

This bit is right:
> Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
> are reserved for surrogate pairs - you need to have a value in
> [0xd800-0xdbff] followed by [0xdc00-0xdfff].

and the bytes 216-225 end up being 16-bit values of:

0xd9d8 0xdbda 0xdddb 0xdfde 0xe1e0

Now, the Encoding looks at the first of those (0xd9d8) and expects a
high surrogate character to follow. It doesn't, so it presumably
ignores the character. It moves on to 0xdbda, which is "correctly"
followed by 0xdddb, so those end up forming a surrogate pair. The
0xdfde should have been preceded by a low surrogate, so it ignores it
and moves on to the rest - which are valid in themselves.

Signature

Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.