Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / Internationalization / September 2004

Tip: Looking for answers? Try searching our database.

UTF-16 and UTF-8 - ASP.NET Clarification

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Madhanmohan S - 26 Aug 2004 13:29 GMT
Hi All,
       Dotnet framework uses the UTF-16 internally for all strings. But
when we write using the Response.Write or the Reponse from the ASP.NET
application is UTF-8. Does a conversion happens during this. Please clarify
these. Please let me know in case you need further information.

Thanks and Regards
Madhanmohan S
Joerg Jooss - 26 Aug 2004 17:31 GMT
> Hi All,
>        Dotnet framework uses the UTF-16 internally for all strings.
> But when we write using the Response.Write or the Reponse from the
> ASP.NET application is UTF-8.

Note that this could be any other supported encoding as well. UTF-8 is
simply the default encoding.

> Does a conversion happens during this.
> Please clarify these. Please let me know in case you need further
> information.

Yes, the Unicode code points are translated into byte sequences of the
configured character encoding. This means you will lose those characters,
that cannot be represented by the target character encoding. You cannot for
example encode my first name "Jörg" in US-ASCII, since 'ö' isn't an ASCII
character. Thus, you'll get "Jrg" (please don't ;->).

Cheers,

Signature

Joerg Jooss
joerg.jooss@gmx.net

Madhanmohan S - 27 Aug 2004 06:11 GMT
Hi Joerg,
       Thank you very much for your help.
   I have some more clarifications
1. Will all the UTF-16 Characters can be converted to UTF-8 equivalent
character(s)?
2. To my knowledge, the data transfer in web for multilingual sites happen
using UTF-8 character set. Am i Correct? If yes, Whether there will be a
conversion in the server side for the Request related data?
3. I tried using UTF-16 in my application for Request and Response encoding
using web.config file. It was looping the request and response
continuiously. Is it a problem or limitation of IE or Limitation of
protocol?

       Basically, we are converting all the XML files in our application to
UTF-16. SO we want to make sure that things work smoothly. I got some more
confused based on the two articles. One from Microsoft and another from
Unicode.
       1. http://www.unicode.org/notes/tn12/ - Supports Unicode 16 for all
data
       2. http://www.microsoft.com/globaldev/getWR/nwr/nwrpartVI.mspx -
Supports UTF-8.
   After going some more articles, i got all these queries.
Please help me.

Thanks and Regards
Madhanmohan S

> > Hi All,
> >        Dotnet framework uses the UTF-16 internally for all strings.
[quoted text clipped - 15 lines]
>
> Cheers,
Joerg Jooss - 27 Aug 2004 08:48 GMT
> Hi Joerg,
>        Thank you very much for your help.
>    I have some more clarifications
> 1. Will all the UTF-16 Characters can be converted to UTF-8 equivalent
> character(s)?

UTF means Unicode Transformation Format -- it is an algorithm to translate a
Unicode code point to sequence of bytes. UTF-8 and UTF-16 are simply
different algorithms, but they can both represent any (16 bit) Unicode
character. Thus, you cannot lose characters when converting between the two.

> 2. To my knowledge, the data transfer in web for multilingual sites
> happen using UTF-8 character set. Am i Correct?

Generally speaking, no. It's completely up to the web site's authors,
operators, or developers to choose which character encoding is used when
delivering web content. A well-behaving web site announces its character
encoding in the HTTP "Content-Type" header, or at least using a HTML META
tag.

Last year, a colleague and I argued about the use of UTF-8 on the web. He
said that in our project we should stick to ISO-8859-1, or avoid natively
encoded characters altogether. I argued that he was taking too much vacation
on the moon ;-)

Anyway, he was willing to bet that from a list of 20 big web sites, not a
single one would be using UTF-8. So we agreed on a list (Microsoft, IBM,
Intel, CNN, Yahoo, ...). Some time later he returned with a somewhat sour
expression on his face -- on our list, there was actually *one* (and only
one) site using UTF-8. Thanks, Dell ;-)

But even though I won the bet, the result really really surprised me.

> If yes, Whether there
> will be a conversion in the server side for the Request related data?

Request data is of course subject to character conversion as well. You could
even use different encodings for HTTP requests and responses, but other than
some weird legacy support scenarios, I'd stay away from that.

> 3. I tried using UTF-16 in my application for Request and Response
> encoding using web.config file. It was looping the request and
> response continuiously. Is it a problem or limitation of IE or
> Limitation of protocol?

I haven't tried this yet, but it should work (both ASP.NET and IE). Did you
try nother browsers as well?

>        Basically, we are converting all the XML files in our
> application to UTF-16. SO we want to make sure that things work
[quoted text clipped - 6 lines]
>    After going some more articles, i got all these queries.
> Please help me.

Actually, these documents are not contradictory. The first simply says that
*within* a system, UTF-16 is the best choice for representing Unicode
characters. But this is not the scenario when talking about web
applications. In this case, it's characters exchanged *between* systems, and
here UTF-8 is preferable. Just think of an English language web site using
UTF-16. Every character delivered over the web would take two bytes. When
using UTF-8, almost all characters would take only one bytes, save for
special cases like the Euro symbol. UTF-16 can become quite expensive ;-)

So for web apps, stick with UTF-8 unless your web application spits out
mostly characters that take more bytes to encode in UTF-8 than in UTF-16.

Cheers,

Signature

Joerg Jooss
joerg.jooss@gmx.net

Madhanmohan S - 27 Aug 2004 10:10 GMT
> > Hi Joerg,
> >        Thank you very much for your help.
[quoted text clipped - 6 lines]
> different algorithms, but they can both represent any (16 bit) Unicode
> character. Thus, you cannot lose characters when converting between the two.

       I got your point.

> > 2. To my knowledge, the data transfer in web for multilingual sites
> > happen using UTF-8 character set. Am i Correct?
[quoted text clipped - 17 lines]
>
> But even though I won the bet, the result really really surprised me.

     Ok. Good Point. We have to take care of this point in our application.
       If we are not mentioning any character set, i noticed that IE is
taking the UTF-8. Am i correct?

> > If yes, Whether there
> > will be a conversion in the server side for the Request related data?
>
> Request data is of course subject to character conversion as well. You could
> even use different encodings for HTTP requests and responses, but other than
> some weird legacy support scenarios, I'd stay away from that.

   I got your point.

> > 3. I tried using UTF-16 in my application for Request and Response
> > encoding using web.config file. It was looping the request and
[quoted text clipped - 3 lines]
> I haven't tried this yet, but it should work (both ASP.NET and IE). Did you
> try nother browsers as well?

       I tried with a new web application using UTF-16.It is working fine.
I think there is some problem with my application.
I will check it up.
       I have only IE in my machine. So i didn't try in any other browser.
One more reason is, my application will be deployed in
windows environment only.

> >        Basically, we are converting all the XML files in our
> > application to UTF-16. SO we want to make sure that things work
[quoted text clipped - 20 lines]
>
> Cheers,

   I got your point.

   Thank you very much for your help. Happy Coding.
Thanks and Regards
Madhanmohan S
Joerg Jooss - 27 Aug 2004 10:31 GMT
[...]

>> Last year, a colleague and I argued about the use of UTF-8 on the
>> web.
[...]
>> Anyway, he was willing to bet that from a list of 20 big web sites,
>> not a single one would be using UTF-8. So we agreed on a list
[quoted text clipped - 9 lines]
> noticed that IE is
> taking the UTF-8. Am i correct?

No, I think IE uses the OS default code page (e.g. Windows-1252) in this
case -- at least here on my machine (XP Pro SP2 using IE 6.0 SP2).

[...]
>        I tried with a new web application using UTF-16.It is working
> fine. I think there is some problem with my application.
> I will check it up.
>        I have only IE in my machine. So i didn't try in any other
> browser. One more reason is, my application will be deployed in
> windows environment only.

So it's an intranet web app? That of course makes things a lot easier.

Cheers,

Signature

Joerg Jooss
joerg.jooss@gmx.net

Madhanmohan S - 27 Aug 2004 10:50 GMT
> [...]
> >>
[quoted text clipped - 17 lines]
> No, I think IE uses the OS default code page (e.g. Windows-1252) in this
> case -- at least here on my machine (XP Pro SP2 using IE 6.0 SP2).

       You are correct Joerg.  I checked in english language page of my
application. By default it is Windows-1252. But, when i have Japanese
characters in the page and if i am not having the charset, it is
automatically changing to UTF-8!?.

> [...]
> >        I tried with a new web application using UTF-16.It is working
[quoted text clipped - 5 lines]
>
> So it's an intranet web app? That of course makes things a lot easier.

       Yes, Joerg. It allows us to focus on IE and Microsoft environment
alone.

Thanks and Regards
Madhanmohan S
Mihai N. - 28 Aug 2004 07:56 GMT
Some notes:

>> Unicode code point to sequence of bytes. UTF-8 and UTF-16 are simply
>> different algorithms, but they can both represent any (16 bit) Unicode
>> character. Thus, you cannot lose characters when converting between the
Even more. They can also represent the surogate area, which need more than
16 bits. The corect term is "Unicode code point"

>         If we are not mentioning any character set, i noticed that IE is
> taking the UTF-8. Am i correct?
If it does, then it is incorrect. Acording to the HTTP protocol (RFC2616),
the default character encoding when the "charset" meta parameter is missing
is ISO-8859-1. I guess tries to be smart, not to be standard compliant (as
usual).

>> > 3. I tried using UTF-16 in my application for Request and Response
>> > encoding using web.config file. It was looping the request and
[quoted text clipped - 4 lines]
> you
>> try nother browsers as well?
If you want to also support other browsers, go with utf8. Costs you nothing.

>         I have only IE in my machine. So i didn't try in any other browser.
> One more reason is, my application will be deployed in
> windows environment only.
There are other browsers for Windows. I did use Opera and now Firefox.
And if a web store does not work with my browser, I just go somewhere else.

>> >        Basically, we are converting all the XML files in our
>> > application to UTF-16.
These are 100% equivalent. I see no reason to spend the time doing it.

Signature

Mihai
-------------------------
Replace _year_ with _ to get the real email

Madhanmohan S - 29 Aug 2004 10:49 GMT
Thanks, Mihai

Regards
Madhanmohan S

> Some notes:
>
[quoted text clipped - 30 lines]
> >> > application to UTF-16.
> These are 100% equivalent. I see no reason to spend the time doing it.
Shawn Steele [MS] - 09 Sep 2004 17:58 GMT
> From: "Joerg Jooss" <joerg.jooss@gmx.net>
> Anyway, he was willing to bet that from a list of 20 big web sites, not a
> single one would be using UTF-8. So we agreed on a list (Microsoft, IBM,
> Intel, CNN, Yahoo, ...). Some time later he returned with a somewhat sour
> expression on his face -- on our list, there was actually *one* (and only
> one) site using UTF-8. Thanks, Dell ;-)

Umm, our sites use UTF-8 :-)

For web apps I usually recommend UTF-8.  As Joerg pointed out, it is just
another way of representing the UTF-16 code points so there is no loss in
the conversion.  

> From: "Madhanmohan S" <ermadhan@hotmail.com>
>
>> No, I think IE uses the OS default code page (e.g. Windows-1252) in this
>> case -- at least here on my machine (XP Pro SP2 using IE 6.0 SP2).

>        You are correct Joerg.  I checked in english language page of my
> application. By default it is Windows-1252. But, when i have Japanese
> characters in the page and if i am not having the charset, it is
> automatically changing to UTF-8!?.

IE tries to figure out what the encoding is based on the page's content .  
Hopefully the charset is specified in the header, however if the encoding
hasn't been declared, then IE makes its best guess.  Many code pages are
difficult to distinguish however, so sometimes it guesses wrong.  If the
page has Japanese characters it depends on the encoding of the page.  UTF-8
(recommended), Shift-JIS and ISO-2022-JP are some of the possible choices.

It is best if the web site author explicitly uses UTF-8 because it is the
least likely to cause confusion, and that the http headers are
appropriately set to the encoding of the page (hopefully UTF-8, but if not
be sure to make sure its being declared correctly.)  Its also a really good
idea to make sure the pages are sending the UTF-8 byte order mark at the
beginning of the page as that pretty much guarantees to the browser that
the correct encoding is being used.

We recommend UTF-8 for all web pages, even for US web sites.  Its supported
by just about everything, there's no ambiguity in the meaning (many
different versions/implementations exist for ISO-2022-JP and other code
pages), and its really easy to detect with a proper byte order mark, even
if the rest of the header stuff gets mangled.

- Shawn

Shawn Steele
Software Design Engineer
Windows International
.Net Framework CLR

Signature

This posting is provided "AS IS" with no warranties, and confers no rights.
Use of included script samples are subject to the terms specified at
http://www.microsoft.com/info/cpyright.htm

Note:  For the benefit of the community-at-large, all responses to this
message are best directed to the newsgroup/thread from which they
originated.  
--------------------

Joerg Jooss - 09 Sep 2004 19:31 GMT
>> From: "Joerg Jooss" <joerg.jooss@gmx.net>
>> Anyway, he was willing to bet that from a list of 20 big web sites,
[quoted text clipped - 5 lines]
>
> Umm, our sites use UTF-8 :-)

I'm pretty sure MS was on our list, but all that happened last December --  
the other guy is from the Linux camp and would have really hated losing
because of MS ;-)

Cheers,

Signature

Joerg Jooss
joerg.jooss@gmx.net

Mihai N. - 10 Sep 2004 09:41 GMT
> Its also a really good
> idea to make sure the pages are sending the UTF-8 byte order mark at the
> beginning of the page as that pretty much guarantees to the browser that
> the correct encoding is being used.
I do agree 100% with all you recomend, except for this one.
Please see http://www.unicode.org/faq/utf_bom.html#BOM
And more precise "Where the precise type of the data stream is known (e.g.
Unicode big-endian or Unicode little-endian), the BOM should not be used"

And since there is a standard meta tag to specify the encoding of an
html document, there is really no need to put a BOM there.
And I don't think "put it just in case" is a good answer :-)

Signature

Mihai
-------------------------
Replace _year_ with _ to get the real email

Shawn Steele [MS] - 10 Sep 2004 18:12 GMT
> I do agree 100% with all you recomend, except for this one.
> Please see http://www.unicode.org/faq/utf_bom.html#BOM
> And more precise "Where the precise type of the data stream is known (e.g.
> Unicode big-endian or Unicode little-endian), the BOM should not be used"

In theory I think that's reasonable.  In practice if you don't have a BOM
and then open the file in a text editor, or if your content type was set
with real HTTP headers instead of http-equiv embedded in the file, then the
data stream type can be lost and lead to corruption.  I find it easier to
maintain a server if the pages have a BOM.  The downside is that you can't
just blindly concatenate files with a BOM.

Quickly looking at a few sites it seems that some don't have a BOM (Google,
Yahoo) and some do (Microsoft, Dell).  (Why does cnn use latin-1? and zdnet
doesn't even declare their encoding?)

Anyway the important part is to declare what you're using.

- Shawn

Shawn Steele
Software Design Engineer
Windows International
.Net Framework CLR
Mihai N. - 11 Sep 2004 07:26 GMT
> In practice if you don't have a BOM
> and then open the file in a text editor
> ...
> then the data stream type can be lost and lead to corruption.
This is true only if you use notepad. A decent editor should ask me
what the code page is and, in any case, not add the BOM without asking,
if the original file didn't have one.
Notepad is nice and handy for utf8, but not really an html editing tool.

> Anyway the important part is to declare what you're using.
I should understand that respecting standards is not?

In general, convenient or not, if there is a standard, it should be followed.

Signature

Mihai
-------------------------
Replace _year_ with _ to get the real email

Michael (michka) Kaplan [MS] - 12 Sep 2004 17:39 GMT
"Mihai N." <nmihai_year_2000@yahoo.com> wrote...

> Notepad is nice and handy for utf8, but not really an html editing tool.

Its still my #1 HTML-editing tool, and I know I am not alone.... :-)

> I should understand that respecting standards is not?

The FAQ pointer you give is not a normative part of any standard, it is a
suggested best practice, given that there are text editors that trip on that
BOM. But given the time it takes to check a whole file, it is a handy
shortcut to tagging text (and I doubt apps like Notepad would ever be
enhanced to start reading charset tags!).

> In general, convenient or not, if there is a standard, it should be followed.

Yes -- when there is a standard. But we are not dealing with standards here,
we are dealing mostly with people liking notepad so much that they use it in
places beyond simple text files....

Signature

MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies
Windows International Division

This posting is provided "AS IS" with
no warranties, and confers no rights.

Shawn Steele [MS] - 13 Sep 2004 20:47 GMT
> In general, convenient or not, if there is a standard, it should be followed.

Certainly, although Michael points out that this isn't strictly required by
the standard.  In fact the standard says "Its (the BOM) usage at the
beginning of a UTF-8 data stream is neither required nor recommended by the
Unicode Standard, but its presence does not affect conformance to the UTF-8
encoding scheme.... (The UTF-8 BOM) can be taken as near-certain indication
that the data stream is using the UTF-8 encoding scheme."  (3.10 D39
Unicode 4.0)

The wording of that FAQ is a little bit odd.  Interpreted literally I could
suppose that an actual HTTP header would declare the "precise type" of the
data stream, however having an HTTP-EQUIV in the middle of a data stream
would be less "precise".  Also I don't see any language in the actual
standard that says "MUST NOT" be used, rather it seems the other way
around.  Additionally Unicode 4.0 has deprecated use of the BOM for
anything other than use as a BOM.

My point is, that the information may not always available, and, in fact
with the HTTP-EQUIV tag, the information cannot be known until after at
least some of the file has been read (<HTML><HEAD>)   Even that might not
be available (although I'd recommend including it) if, for example, you
have a dozen authors working on a site, some independent or outsourced,
that might not all be using use the same tools or processes.  

Its great to require that one's authors/tools use the appropriate processes
and tag(s) so that this information is always known, however other shops
may not and therefore they might need to use the BOM.  I've seen enough
problems with code page misinterpretation that I'd rather be cautious and
include the BOM.

- Shawn

Shawn Steele
Software Design Engineer
Windows International
.Net Framework CLR
Mihai N. - 14 Sep 2004 10:38 GMT
> The wording of that FAQ is a little bit odd.  Interpreted literally I could
> suppose that an actual HTTP header would declare the "precise type" of the
[quoted text clipped - 3 lines]
> around.  Additionally Unicode 4.0 has deprecated use of the BOM for
> anything other than use as a BOM.
I do agree the FAQ is not clear enough.
And indeed it does not say "must not", but it does say "should not"

> Its great to require that one's authors/tools use the appropriate processes
> and tag(s) so that this information is always known, however other shops
> may not and therefore they might need to use the BOM.  I've seen enough
> problems with code page misinterpretation that I'd rather be cautious and
> include the BOM.
Here is a contradiction.
If you have no control on the authors/tools and can't ask them to add the
http-equiv, then you don't have enough control to ask them to add the BOM.
You can't even ask them to produce utf8 html files.

Main issue here (but maybe is the wrong newsgroup to rise it):
is BOM at the beginning of the file ok for all borwsers on all platforms?
I now it is ok for IE and Windows, and this may be enough for some,
but not for all of us.

Signature

Mihai
-------------------------
Replace _year_ with _ to get the real email

Shawn Steele [MS] - 16 Sep 2004 07:12 GMT
> Main issue here (but maybe is the wrong newsgroup to rise it):
> is BOM at the beginning of the file ok for all borwsers on all platforms?
> I now it is ok for IE and Windows, and this may be enough for some,
> but not for all of us.

Good (& important!) question!

I haven't had problems with a BOM at the beginning of the file, however
most of my experience is with IE, although I've written to other browsers
in the past.  The Unicode spec is pretty clear that a BOM at the beginning
of a file is to be treated as such, so I would hope that other vendors
handled it properly.

I have had difficulty when trying to concatenate files because a dump
lumping of the data together results in a BOM becomes a zero width non
breaking space in the middle of a file, which is annoying.  If you have
server code that grabs headers, footers or other content from other pages
you might run in to this issue.  In my case it was some simple scripting I
was doing to generate site wide header files.  In my experience that error
is readily discoverable :-)

- Shawn

Shawn Steele
Software Design Engineer
Windows International
.Net Framework CLR

Signature

This posting is provided "AS IS" with no warranties, and confers no rights.
Use of included script samples are subject to the terms specified at
http://www.microsoft.com/info/cpyright.htm

Note:  For the benefit of the community-at-large, all responses to this
message are best directed to the newsgroup/thread from which they
originated.  


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.