Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / Internationalization / April 2006

Tip: Looking for answers? Try searching our database.

How does the browser send data in a form for non-English characters?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Chris Y - 20 Apr 2006 01:16 GMT
I am using IE and IIS.

I am facing some problems with non-English characters and so did a test with a simple FORM with ENCTYPE='application/x-www-form-urlencoded' and a single <INPUT NAME='name'> box and and then wrote out the data that was received.  My results are:

a.  If I input just Chinese characters: , it was received and written out as: .  I captured the HTTP POST stream and the data passed was: name=%C3%AB%D4%F3%B6%AB.

b.  If I put another character, the copyright symbol (Unicode +U00A9), before my Chinese characters: , then the data is received correctly, ie the same as I have entered them.  The data transmitted was: name=%A9%26%2327611%3B%26%2327901%3B%26%2319996%3B.  I can understand that this is the same as: &#27611;&#27901;&#19996.

I am at a loss of what is going on.  Why would the browser encode it differently in the two cases?  How can I force it to stick to one method (the second one)?

I am not totally familiar with Unicode.  In Character Map, why does some characters have two codes, eg for , it is shown as U+6BDB (0xC3AB).  What is C3AB?  It is the one giving problem in my first case above.

Thanks in advance.

js
Michael (michka) Kaplan [MS] - 20 Apr 2006 04:25 GMT
U+6bdb is a CJK ideograph.

U+c3ab is a Hangul syllable -- but also if you look at the bytes in UTF-8
form than two of those bytes will  indeed be 0xc3 and 0xab.

Have you properly set the Response encoding? The bytes are right but it is
how they are being interpretted that is causing you problems....

Signature

MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

I am using IE and IIS.

I am facing some problems with non-English characters and so did a test with
a simple FORM with ENCTYPE='application/x-www-form-urlencoded' and a single
<INPUT NAME='name'> box and and then wrote out the data that was received.
My results are:

a.  If I input just Chinese characters: ???, it was received and written out
as: ëÔó¶«.  I captured the HTTP POST stream and the data passed was:
name=%C3%AB%D4%F3%B6%AB.

b.  If I put another character, the copyright symbol (Unicode +U00A9),
before my Chinese characters: ©???, then the data is received correctly, ie
the same as I have entered them.  The data transmitted was:
name=%A9%26%2327611%3B%26%2327901%3B%26%2319996%3B.  I can understand that
this is the same as: ©&#27611;&#27901;&#19996.

I am at a loss of what is going on.  Why would the browser encode it
differently in the two cases?  How can I force it to stick to one method
(the second one)?

I am not totally familiar with Unicode.  In Character Map, why does some
characters have two codes, eg for ?, it is shown as U+6BDB (0xC3AB).  What
is C3AB?  It is the one giving problem in my first case above.

Thanks in advance.

js
Chris Y - 20 Apr 2006 07:12 GMT
Thanks, I did a few more experiments and found the cause.

I didn't put a charset statement in my web page.  It appears that if no
charset is explicitly indicated, IE will try to find the most appropriate
charset to use.  If it is just European characters, it will try and use the
Western European charset.  If it's only Asian characters, it will use UTF-8.
If there are combination of both, it will use the method b I described
earlier.

If I force charset=UTF-8, then it will always come out correct.  However, I
sent my web data to an external application and I have a tough time
converting a string that is UTF-8 encoded.  Is first converting the string
(assuming it contains only one-byte characters) to bytes and then
UTF8Encoding.GetString() the way to go?

I don't know what charset setting can give the result in method b.  The data
is sort of HTML encoded.  I actually prefer this as I can retrieve the data
at one go with HttpUtility.HtmlDecode().

> U+6bdb is a CJK ideograph.
>
[quoted text clipped - 32 lines]
>
> js

Rate this thread:







Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.