.NET Forum / .NET Framework / New Users / February 2007
Cyrillic characters in VS2005
|
|
Thread rating:  |
Laurent Bugnion [MVP] - 04 Feb 2007 13:28 GMT Hi,
Not totally on topic for this group, but...
A colleague of mine wants to have the HTML editor in VS2005 display cyrillic characters. While setting the encoding using a META tag works fine (and the encoding is also displayed accordingly in the document properties in VS2005), the display itself still doesn't show correct characters.
We reviewd together the editor's many options, but were unable to see something about encoding. Is that even possible? If yes, how?
Thanks and greetings, Laurent
 Signature Laurent Bugnion [MVP ASP.NET] Software engineering: http://www.galasoft-LB.ch PhotoAlbum: http://www.galasoft-LB.ch/pictures Support children in Calcutta: http://www.calcutta-espoir.ch
Mihai N. - 04 Feb 2007 21:14 GMT > A colleague of mine wants to have the HTML editor in VS2005 display > cyrillic characters. While setting the encoding using a META tag works [quoted text clipped - 4 lines] > We reviewd together the editor's many options, but were unable to see > something about encoding. Is that even possible? If yes, how? There are only two options to support Cyrillic in the VS Studio editor: - UTF-8: "File" -> "Advanced Save Options..." then select "Unicode (UTF-8 with signature) - Codepage 65001 And you will have to match the meta - Cyrillic - Windows (1251): set the default system locale to Russian and reboot (http://www.mihai-nita.net/20050611a.shtml)
Problems:
1. Although the "Advanced Save Options..." option allows for many other encodings, when the file is opened next time it is interpreted as being in the ANSI code page (determined by default system locale). So you can work on a US system, save as "Cyrillic (KOI8-R) - Codepage 20886", but when you open the file you will have to say "File" -> "Open", select the file, click the down-arrow on the "Open" button, click "Open With...", then select "Source Code (Text) Editor With Encoding" and in that list select again "Cyrillic (KOI8-R) - Codepage 20886". Quite a pain! From what I know there is no way to associate a certain encoding to a certain file, so that you don't have to do this encoding selections every single time. A possible work-arrouns is to set the ANSI code page to what you want (by changing the default system locale), but this means that some encodings cannot be used (for instance Cyrillic 1251 can be ANSI cp, but KOI8-R cannot).
2. UTF-8 with BOM can be recognized by VS, but in my opinion (based on the W3C specs) is that this is not standard. Many browsers will deal with it, but this does not make it right. UTF-8 without BOM might be recognized by VS, by "guessing", which means it can fail for some files.
Long story short: - VS is not a good editor for stuff in code pages other than ANSI - VS is not a good editor for HTML - UTF-8 (no BOM) is a good encoding for HTML pages, no matter the editor
Ok, this is my opinion, I am open for flaming :-)
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Laurent Bugnion [MVP] - 04 Feb 2007 21:57 GMT Mihai,
Thanks a lot for your very comprehensive post. Do you have another editor to recommend for HTML with cyrillic characters (or generally non ANSI characters)?
Laurent
> There are only two options to support Cyrillic in the VS Studio editor: > - UTF-8: "File" -> "Advanced Save Options..." then select "Unicode (UTF-8 [quoted text clipped - 33 lines] > > Ok, this is my opinion, I am open for flaming :-)
 Signature Laurent Bugnion [MVP ASP.NET] Software engineering, Blog: http://www.galasoft-LB.ch PhotoAlbum: http://www.galasoft-LB.ch/pictures Support children in Calcutta: http://www.calcutta-espoir.ch
Mihai N. - 05 Feb 2007 02:40 GMT > Thanks a lot for your very comprehensive post. Do you have another > editor to recommend for HTML with cyrillic characters (or generally non > ANSI characters)? Unfortunately, not really :-( I don't have one preferred editor, I am moving between Homesite, Dreamweaver, Notepad, Word, and a Notepad clone that I have written )and supports whatever encoding I want).
Homesite 5.5 = sucks for all but Latin 1 Dreamweaver 6 (MX) = dows kind of ok for popular encodings (Latin 1, Japanese, Chinese, Russian), but bad for others (Hindi, Arabic, for some reason Korean). Notepad = ok for UTF-8, since I have a Perl script automatically removing the BOM before uploading. Word = ok for everything. I am using it as a "smarter Notepad", since in the end I run a macro to convert Word styles to html tags and save as Encoded text (I don't like the HTML produced by Word)
Now, Homesite 5.5 and Dreamweaver 6 are old, so I cannot tell you how the newer versions behave.
I cannot talk about the new MS HTML editors (Expression familiy), because I did not use them (and FrontPage is a long-long time, and hated it time :-) I am not a WYSIWYG type of guy, and I don't write very fancy struff :-)
Also, I have given up encodings a while ago (using Unicode only :-), so this saves some of the grief.
In the end, my 2 cents:
First option: a. Go with whatever editor you like for the features/price, with the only condition that is should support UTF-8 (VS is also ok for this) and that you can type using the script you need b. Write a small script to remove the BOM if the editor wants one, and also to do code page conversion, if for some reason UTF-8 is not acceptable (although I don't see any reason to do that)
Second option: If the preferred editor does not support UTF-8, then try setting the default system locale to the preferred language, or use AppLocale (http://www.microsoft.com/globaldev/tools/apploc.mspx) This is an option only for code pages that can be ANSI code pages (so, for Russian, Windows 1251 is ok, but KOI8-R or MacCyrillic are not)
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Laurent Bugnion [MVP] - 05 Feb 2007 06:32 GMT Hi,
> Unfortunately, not really :-( > I don't have one preferred editor, I am moving between Homesite, Dreamweaver, > Notepad, Word, and a Notepad clone that I have written )and supports whatever > encoding I want). <snip>
Great, thanks, Laurent
 Signature Laurent Bugnion [MVP ASP.NET] Software engineering, Blog: http://www.galasoft-LB.ch PhotoAlbum: http://www.galasoft-LB.ch/pictures Support children in Calcutta: http://www.calcutta-espoir.ch
Alexey Smirnov - 05 Feb 2007 08:24 GMT The problem is either in document encoding or in system settings. That is not the problem of the editor, I think.
> Hi, > [quoted text clipped - 7 lines] > Great, thanks, > Laurent Mihai N. - 06 Feb 2007 06:27 GMT > The problem is either in document encoding or in system settings. > That is not the problem of the editor, I think. I am not sure what you mean. It depends what one expects from an editor.
Encodings used to be a problem some 7 years ago, and the limitations where from the system (although NT was Unicode, Win 2000 was the first version that was really usefull for multilingual work).
At this time (2007 :-) I do expect from an editor to support any encoding and script I want, if it is running on a post-Win 2000 OS and the support is installed. If it does not, then it is the problem of the editor, in my book.
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Joerg Jooss - 06 Feb 2007 22:50 GMT Thus wrote Mihai N.,
>> A colleague of mine wants to have the HTML editor in VS2005 display >> cyrillic characters. While setting the encoding using a META tag [quoted text clipped - 19 lines] > encodings, when the file is opened next time it is interpreted as > being in the ANSI code page (determined by default system locale). Which isn't suprising unless VS would store the last applied character encoding for each file -- which wouldn't hurt ;-)
[...]
> 2. UTF-8 with BOM can be recognized by VS, but in my opinion (based on > the [quoted text clipped - 4 lines] > means it > can fail for some files. There's nothing non-standard about UTF-8 with a BOM. The only thing you might want to avoid is saving a pure HTML file with a BOM, because some (older) browsers happily include the BOM in the rendered text :-/
But for a compilation unit such as .aspx or .cs, there should be no problem using a BOM. In this case, the build tool (page translator, compiler, etc.) deals with it. The buildtime encoding doesn't need to match the runtime encoding anyway[*], and using UTF-8 as runtime encoding shouldn't procude a BOM in the output stream, unless you really ask for it.
That also means that UTF-16 or UTF-32 work as universal buildtime encoding as well, because these encoding always include a BOM.
[*] That's the technical point of view. I would never ever use a buildtime encoding that can represent more characters than my runtime encoding, because this is an excellent way to introduce broken content...
Cheers,
 Signature Joerg Jooss news-reply@joergjooss.de
Mihai N. - 07 Feb 2007 01:17 GMT > Which isn't suprising unless VS would store the last applied character > encoding for each file -- which wouldn't hurt ;-) The project file can be a great place to store that kind of info. But, the fact still remains: VS is not a creation HTML tools. I like it, is very strong, can do a decent job for a lot of file formats, bat I would not push it too much :-) And in fact, MS does not do it either, this is whay they have dedicated html tools :-)
> There's nothing non-standard about UTF-8 with a BOM. The only thing you > might > want to avoid is saving a pure HTML file with a BOM, because some (older) > browsers happily include the BOM in the rendered text :-/ When in doubt, I go to the standard :-) I have no opinion on UTF-8 + BOM in general, but I do have opinions on UTF-8 + BOM in the context of various file formats. For some formats is good, for some is not only bad, but non-standard. One of the good documents is this: http://unicode.org/unicode/faq/utf_bom.html#BOM <<Where the precise type of the data stream is known, the BOM should not be used.>> In general, Unicode consistently tries to leave decisions to higher-levels protocols. There are clear standard methods to identify the encoding of an HTML page, both as stand-alone file, and as served over HTTP. There is no need for another one. And both the HTML and XML (implying XHTML) standards have clear ways to determine the encoding. The fact that some browsers handle it properly does not mean is standard.
Fast, from the top of your head, who is the winner here: - the http header from the server (determined by the server's config) says Content-Type: text/html; charset=ISO-8859-1 - the html file has a Content-Type in the head section <META http-equiv="Content-Type" content="text/html; charset=EUC-JP"> - the html file has a UTF-8 BOM at the very beginning (EF BB BF) - the content itself is in fact Shift-JIS
If you got the winner right, then why (according to what standard)? :-)
> But for a compilation unit such as .aspx or .cs, there should be no problem > using a BOM. In this case, the build tool (page translator, compiler, etc.) > deals with it. > The buildtime encoding doesn't need to match the runtime encoding > anyway[*], and using UTF-8 as runtime encoding shouldn't procude a BOM in > the output stream, unless you really ask for it. Agree. Although these days mixing encodings is just a way to ask for trouble, or show off (look how cool I am, you can master such a mess :-) There is no reason to be anything other than Unicode. Ten years ago, yes. Now there are still some exceptions, but fewer and fewer.
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Joerg Jooss - 08 Feb 2007 20:57 GMT Thus wrote Mihai N.,
>> Which isn't suprising unless VS would store the last applied >> character encoding for each file -- which wouldn't hurt ;-) >> > The project file can be a great place to store that kind of info. > But, the fact still remains: VS is not a creation HTML tools. Sure. Go Expression Web :-)
> I like it, is very strong, can do a decent job for a lot of file > formats, > bat I would not push it too much :-) And in fact, MS does not do it > either, > this is whay they have dedicated html tools :-) I really don't think character encoding is an HTML phenonemon, so having such a feature in VS would be useful for everybody ;-)
>> There's nothing non-standard about UTF-8 with a BOM. The only thing >> you [quoted text clipped - 11 lines] > not be > used.>> In our case (VS), the precise type of data stream is often unknown.
> In general, Unicode consistently tries to leave decisions to > higher-levels [quoted text clipped - 9 lines] > The fact that some browsers handle it properly does not mean is > standard. I don't know how this relates to our topic... I was not talking about standards? VS usually doesn't load source files via HTTP, nor is every source file XML or META tagged. These standards aren't applicable as a whole to a design time environment.
Cheers,
 Signature Joerg Jooss news-reply@joergjooss.de
Mihai N. - 09 Feb 2007 08:36 GMT > I really don't think character encoding is an HTML phenonemon, so having > such a feature in VS would be useful for everybody ;-) Nothing against :-)
> In our case (VS), the precise type of data stream is often unknown. This is VS's fault (because it is unaware of the HTML ways of specifying encoding).
> VS usually doesn't load source files via HTTP, nor is every source file XML > or META tagged. These standards aren't applicable as a whole to a design > time environment. Well, some of the standards are applicable. And, happily enough VS respects them. It is a nice surprise! VS respects the meta in the head section! <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Try this: 1. save the first page from www.yahoo.co.jp It is encoded using euc-jp. 2. Try opening it in Notepad, you will see junk. 3. Open it in VS, you see Japanese (if you have Japanese support installed). change the meta from <meta http-equiv="Content-Type" content="text/html; charset=euc-jp"> to <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> and save. 4. Open it in Notepad again, you will see Japanese. The ending is utf-8
So VS will save the HTML according to the meta. Standard compliant, no need to add a BOM, store the encoding somewhere, or ask every single time. Nice and correct!
So, a better answer for the original question! I have tried setting the proper encoding in the meta and Russian works fine. Tested it with KOI8-R, windows-1251 and utf-8
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Joerg Jooss - 10 Feb 2007 10:59 GMT Thus wrote Mihai N.,
>> In our case (VS), the precise type of data stream is often unknown. >> > This is VS's fault (because it is unaware of the HTML ways of > specifying encoding). My point was that there are source files to which HTML specific rules don't apply -- such as C#, VB, JavaScript.
>> VS usually doesn't load source files via HTTP, nor is every source >> file XML or META tagged. These standards aren't applicable as a whole [quoted text clipped - 4 lines] > VS respects the meta in the head section! > <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> I know, although in some quick tests VS didn't keep file encoding and META tag synchronized all the time. I don't create a lot of plain HTML content, so I don't have any insight how reliable that feature is.
> Try this: > 1. save the first page from www.yahoo.co.jp It is encoded using [quoted text clipped - 13 lines] > somewhere, > or ask every single time. Nice and correct! Mihai, fire up your favorite hex editor and check the first three bytes of the file: EF BB BF here. ;-)
Cheers,
 Signature Joerg Jooss news-reply@joergjooss.de
Mihai N. - 10 Feb 2007 20:41 GMT > My point was that there are source files to which HTML specific rules don't > apply -- such as C#, VB, JavaScript. Ah! I have no problem with this. As stated somewhere "I do have opinions on UTF-8 + BOM in the context of various file formats." So my opinion on the BOM depends on the file format. This is *no* for html/xml, but might be yes for C#, VB, JavaScript. And, in fact, C# & VB are MS formats, so if they decide to use 3 BOMs at the end to identify the encoding, I have no problem with it (but I will say "WTF?" :-)
Ok, joking aside, I think BOM in C# and VB files is a good thing. I need to think a bit more about JavaScript.
> Mihai, fire up your favorite hex editor and check the first three bytes of > the file: EF BB BF here. ;-) Damn! Me no like it :-)
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|