I am using IE and IIS.
I am facing some problems with non-English characters and so did a test with a simple FORM with ENCTYPE='application/x-www-form-urlencoded' and a single <INPUT NAME='name'> box and and then wrote out the data that was received. My results are:
a. If I input just Chinese characters: , it was received and written out as: . I captured the HTTP POST stream and the data passed was: name=%C3%AB%D4%F3%B6%AB.
b. If I put another character, the copyright symbol (Unicode +U00A9), before my Chinese characters: , then the data is received correctly, ie the same as I have entered them. The data transmitted was: name=%A9%26%2327611%3B%26%2327901%3B%26%2319996%3B. I can understand that this is the same as: 毛泽东.
I am at a loss of what is going on. Why would the browser encode it differently in the two cases? How can I force it to stick to one method (the second one)?
I am not totally familiar with Unicode. In Character Map, why does some characters have two codes, eg for , it is shown as U+6BDB (0xC3AB). What is C3AB? It is the one giving problem in my first case above.
Thanks in advance.
js
U+6bdb is a CJK ideograph.
U+c3ab is a Hangul syllable -- but also if you look at the bytes in UTF-8
form than two of those bytes will indeed be 0xc3 and 0xab.
Have you properly set the Response encoding? The bytes are right but it is
how they are being interpretted that is causing you problems....

Signature
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.
I am using IE and IIS.
I am facing some problems with non-English characters and so did a test with
a simple FORM with ENCTYPE='application/x-www-form-urlencoded' and a single
<INPUT NAME='name'> box and and then wrote out the data that was received.
My results are:
a. If I input just Chinese characters: ???, it was received and written out
as: ëÔó¶«. I captured the HTTP POST stream and the data passed was:
name=%C3%AB%D4%F3%B6%AB.
b. If I put another character, the copyright symbol (Unicode +U00A9),
before my Chinese characters: ©???, then the data is received correctly, ie
the same as I have entered them. The data transmitted was:
name=%A9%26%2327611%3B%26%2327901%3B%26%2319996%3B. I can understand that
this is the same as: ©毛泽东.
I am at a loss of what is going on. Why would the browser encode it
differently in the two cases? How can I force it to stick to one method
(the second one)?
I am not totally familiar with Unicode. In Character Map, why does some
characters have two codes, eg for ?, it is shown as U+6BDB (0xC3AB). What
is C3AB? It is the one giving problem in my first case above.
Thanks in advance.
js
Chris Y - 20 Apr 2006 07:12 GMT
Thanks, I did a few more experiments and found the cause.
I didn't put a charset statement in my web page. It appears that if no
charset is explicitly indicated, IE will try to find the most appropriate
charset to use. If it is just European characters, it will try and use the
Western European charset. If it's only Asian characters, it will use UTF-8.
If there are combination of both, it will use the method b I described
earlier.
If I force charset=UTF-8, then it will always come out correct. However, I
sent my web data to an external application and I have a tough time
converting a string that is UTF-8 encoded. Is first converting the string
(assuming it contains only one-byte characters) to bytes and then
UTF8Encoding.GetString() the way to go?
I don't know what charset setting can give the result in method b. The data
is sort of HTML encoded. I actually prefer this as I can retrieve the data
at one go with HttpUtility.HtmlDecode().
> U+6bdb is a CJK ideograph.
>
[quoted text clipped - 32 lines]
>
> js