>We here at the office have discovered something odd. Can somebody
>please verify this potential bug for us?
I wouldn't call it a bug. There's no guarantee that a random byte
array will come back the same after a
Encoding.GetString/Encoding.GetBytes roundtrip. Some byte values may
have spacial meaning or may be invalid according to that encoding. So
you can't take an arbitrary blob and decode it to a string like that.
Mattias

Signature
Mattias Sjögren [C# MVP] mattias @ mvps.org
http://www.msjogren.net/dotnet/ | http://www.dotnetinterop.com
Please reply only to the newsgroup.
> We here at the office have discovered something odd. Can somebody
> please verify this potential bug for us?
Not a bug, or at least not the bug you think it is.
> This code generates a byte buffer fills it with 256 bytes ranging from
> 0 to 255, and the bug appers when the Unicode Encoder gets the bytes
[quoted text clipped - 5 lines]
>
> bytes 216,217 and 222, 223 seem to go missing?!?
Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
are reserved for surrogate pairs - you need to have a value in
[0xd800-0xdbff] followed by [0xdc00-0xdfff]. So, 216/217 isn't valid,
and neither is 222/223.
In fact, 218/219 and 220/221 shouldn't be valid either, which just goes
to show: garbage in, garbage out.
The moral of the story is that you shouldn't treat arbitrary binary
data as text.

Signature
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Jon Skeet [C# MVP] - 13 Sep 2006 23:10 GMT
> > The bytebuffers should not differ but in Net 2.0 they do.
> > We have run the testcode in VS 2003 and VS 2005 and the results of
[quoted text clipped - 9 lines]
> In fact, 218/219 and 220/221 shouldn't be valid either, which just goes
> to show: garbage in, garbage out.
Sorry, I've realised what I'd done wrong in the above analysis. My
general principle was right (as was the conclusion that the byte array
didn't represent a valid Unicode string) but the logic was off.
This bit is right:
> Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
> are reserved for surrogate pairs - you need to have a value in
> [0xd800-0xdbff] followed by [0xdc00-0xdfff].
and the bytes 216-225 end up being 16-bit values of:
0xd9d8 0xdbda 0xdddb 0xdfde 0xe1e0
Now, the Encoding looks at the first of those (0xd9d8) and expects a
high surrogate character to follow. It doesn't, so it presumably
ignores the character. It moves on to 0xdbda, which is "correctly"
followed by 0xdddb, so those end up forming a surrogate pair. The
0xdfde should have been preceded by a low surrogate, so it ignores it
and moves on to the rest - which are valid in themselves.

Signature
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too