Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / New Users / February 2007

Tip: Looking for answers? Try searching our database.

Cyrillic characters in VS2005

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Laurent Bugnion [MVP] - 04 Feb 2007 13:28 GMT
Hi,

Not totally on topic for this group, but...

A colleague of mine wants to have the HTML editor in VS2005 display
cyrillic characters. While setting the encoding using a META tag works
fine (and the encoding is also displayed accordingly in the document
properties in VS2005), the display itself still doesn't show correct
characters.

We reviewd together the editor's many options, but were unable to see
something about encoding. Is that even possible? If yes, how?

Thanks and greetings,
Laurent
Signature

Laurent Bugnion [MVP ASP.NET]
Software engineering: http://www.galasoft-LB.ch
PhotoAlbum: http://www.galasoft-LB.ch/pictures
Support children in Calcutta: http://www.calcutta-espoir.ch

Mihai N. - 04 Feb 2007 21:14 GMT
> A colleague of mine wants to have the HTML editor in VS2005 display
> cyrillic characters. While setting the encoding using a META tag works
[quoted text clipped - 4 lines]
> We reviewd together the editor's many options, but were unable to see
> something about encoding. Is that even possible? If yes, how?
There are only two options to support Cyrillic in the VS Studio editor:
- UTF-8: "File" -> "Advanced Save Options..." then select "Unicode (UTF-8
with signature) - Codepage 65001
And you will have to match the meta
- Cyrillic - Windows (1251): set the default system locale to Russian
and reboot (http://www.mihai-nita.net/20050611a.shtml)

Problems:

1. Although the "Advanced Save Options..." option allows for many other
encodings, when the file is opened next time it is interpreted as
being in the ANSI code page (determined by default system locale).
So you can work on a US system, save as "Cyrillic (KOI8-R) - Codepage 20886",
but when you open the file you will have to say "File" -> "Open", select the
file, click the down-arrow on the "Open" button, click "Open With...", then
select "Source Code (Text) Editor With Encoding" and in that list select
again "Cyrillic (KOI8-R) - Codepage 20886". Quite a pain!
From what I know there is no way to associate a certain encoding to a certain
file, so that you don't have to do this encoding selections every single
time.
A possible work-arrouns is to set the ANSI code page to what you want
(by changing the default system locale), but this means that some encodings
cannot be used (for instance Cyrillic 1251 can be ANSI cp, but KOI8-R
cannot).

2. UTF-8 with BOM can be recognized by VS, but in my opinion (based on the
W3C specs) is that this is not standard. Many browsers will deal with it,
but this does not make it right.
UTF-8 without BOM might be recognized by VS, by "guessing", which means it
can fail for some files.

Long story short:
- VS is not a good editor for stuff in code pages other than ANSI
- VS is not a good editor for HTML
- UTF-8 (no BOM) is a good encoding for HTML pages, no matter the editor

Ok, this is my opinion, I am open for flaming :-)

Signature

Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Laurent Bugnion [MVP] - 04 Feb 2007 21:57 GMT
Mihai,

Thanks a lot for your very comprehensive post. Do you have another
editor to recommend for HTML with cyrillic characters (or generally non
ANSI characters)?

Laurent

> There are only two options to support Cyrillic in the VS Studio editor:
>  - UTF-8: "File" -> "Advanced Save Options..." then select "Unicode (UTF-8
[quoted text clipped - 33 lines]
>
> Ok, this is my opinion, I am open for flaming :-)

Signature

Laurent Bugnion [MVP ASP.NET]
Software engineering, Blog: http://www.galasoft-LB.ch
PhotoAlbum: http://www.galasoft-LB.ch/pictures
Support children in Calcutta: http://www.calcutta-espoir.ch

Mihai N. - 05 Feb 2007 02:40 GMT
> Thanks a lot for your very comprehensive post. Do you have another
> editor to recommend for HTML with cyrillic characters (or generally non
> ANSI characters)?

Unfortunately, not really :-(
I don't have one preferred editor, I am moving between Homesite, Dreamweaver,
Notepad, Word, and a Notepad clone that I have written )and supports whatever
encoding I want).

Homesite 5.5 = sucks for all but Latin 1
Dreamweaver 6 (MX) = dows kind of ok for popular encodings (Latin 1,
Japanese, Chinese, Russian), but bad for others (Hindi, Arabic, for some
reason Korean).
Notepad = ok for UTF-8, since I have a Perl script automatically removing the  
BOM before uploading.
Word = ok for everything. I am using it as a "smarter Notepad", since in the
end I run a macro to convert Word styles to html tags and save as Encoded
text (I don't like the HTML produced by Word)

Now, Homesite 5.5 and Dreamweaver 6 are old, so I cannot tell you how the
newer versions behave.

I cannot talk about the new MS HTML editors (Expression familiy), because I
did not use them (and FrontPage is a long-long time, and hated it time :-)
I am not a WYSIWYG type of guy, and I don't write very fancy struff :-)

Also, I have given up encodings a while ago (using Unicode only :-), so this
saves some of the grief.

In the end, my 2 cents:

First option:
a. Go with whatever editor you like for the features/price, with the
only condition that is should support UTF-8 (VS is also ok for this)
and that you can type using the script you need
b. Write a small script to remove the BOM if the editor wants one, and
also to do code page conversion, if for some reason UTF-8 is not acceptable
(although I don't see any reason to do that)

Second option:
If the preferred editor does not support UTF-8, then try setting the default
system locale to the preferred language, or use AppLocale
(http://www.microsoft.com/globaldev/tools/apploc.mspx)
This is an option only for code pages that can be ANSI code pages
(so, for Russian, Windows 1251 is ok, but KOI8-R or MacCyrillic are not)

Signature

Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Laurent Bugnion [MVP] - 05 Feb 2007 06:32 GMT
Hi,

> Unfortunately, not really :-(
> I don't have one preferred editor, I am moving between Homesite, Dreamweaver,
> Notepad, Word, and a Notepad clone that I have written )and supports whatever
> encoding I want).

<snip>

Great, thanks,
Laurent
Signature

Laurent Bugnion [MVP ASP.NET]
Software engineering, Blog: http://www.galasoft-LB.ch
PhotoAlbum: http://www.galasoft-LB.ch/pictures
Support children in Calcutta: http://www.calcutta-espoir.ch

Alexey Smirnov - 05 Feb 2007 08:24 GMT
The problem is either in document encoding or in system settings.
That is not the problem of the editor, I think.

> Hi,
>
[quoted text clipped - 7 lines]
> Great, thanks,
> Laurent
Mihai N. - 06 Feb 2007 06:27 GMT
> The problem is either in document encoding or in system settings.
> That is not the problem of the editor, I think.

I am not sure what you mean.
It depends what one expects from an editor.

Encodings used to be a problem some 7 years ago, and the limitations where
from the system (although NT was Unicode, Win 2000 was the first version that
was really usefull for multilingual work).

At this time (2007 :-) I do expect from an editor to support any encoding
and script I want, if it is running on a post-Win 2000 OS and the support
is installed.
If it does not, then it is the problem of the editor, in my book.

Signature

Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Joerg Jooss - 06 Feb 2007 22:50 GMT
Thus wrote Mihai N.,

>> A colleague of mine wants to have the HTML editor in VS2005 display
>> cyrillic characters. While setting the encoding using a META tag
[quoted text clipped - 19 lines]
> encodings, when the file is opened next time it is interpreted as
> being in the ANSI code page (determined by default system locale).

Which isn't suprising unless VS would store the last applied character encoding
for each file -- which wouldn't hurt ;-)

[...]
> 2. UTF-8 with BOM can be recognized by VS, but in my opinion (based on
> the
[quoted text clipped - 4 lines]
> means it
> can fail for some files.

There's nothing non-standard about UTF-8 with a BOM. The only thing you might
want to avoid is saving a pure HTML file with a BOM, because some (older)
browsers happily include the BOM in the rendered text :-/

But for a compilation unit such as .aspx or .cs, there should be no problem
using a BOM. In this case, the build tool (page translator, compiler, etc.)
deals with it. The buildtime encoding doesn't need to match the runtime encoding
anyway[*], and using UTF-8 as runtime encoding shouldn't procude a BOM in
the output stream, unless you really ask for it.

That also means that UTF-16 or UTF-32 work as universal buildtime encoding
as well, because these encoding always include a BOM.

[*] That's the technical point of view. I would never ever use a buildtime
encoding that can represent more characters than my runtime encoding, because
this is an excellent way to introduce broken content...  

Cheers,
Signature

Joerg Jooss
news-reply@joergjooss.de

Mihai N. - 07 Feb 2007 01:17 GMT
> Which isn't suprising unless VS would store the last applied character
> encoding for each file -- which wouldn't hurt ;-)
The project file can be a great place to store that kind of info.
But, the fact still remains: VS is not a creation HTML tools.
I like it, is very strong, can do a decent job for a lot of file formats,
bat I would not push it too much :-) And in fact, MS does not do it either,
this is whay they have dedicated html tools :-)

> There's nothing non-standard about UTF-8 with a BOM. The only thing you
> might
> want to avoid is saving a pure HTML file with a BOM, because some (older)
> browsers happily include the BOM in the rendered text :-/
When in doubt, I go to the standard :-)
I have no opinion on UTF-8 + BOM in general, but I do have opinions on
UTF-8 + BOM in the context of various file formats.
For some formats is good, for some is not only bad, but non-standard.
One of the good documents is this:
http://unicode.org/unicode/faq/utf_bom.html#BOM
<<Where the precise type of the data stream is known, the BOM should not be
used.>>
In general, Unicode consistently tries to leave decisions to higher-levels
protocols.
There are clear standard methods to identify the encoding of an HTML page,
both as stand-alone file, and as served over HTTP. There is no need for
another one.
And both the HTML and XML (implying XHTML) standards have clear ways to
determine the encoding.
The fact that some browsers handle it properly does not mean is standard.

Fast, from the top of your head, who is the winner here:
- the http header from the server (determined by the server's config) says
    Content-Type: text/html; charset=ISO-8859-1
- the html file has a Content-Type in the head section
    <META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
- the html file has a UTF-8 BOM at the very beginning (EF BB BF)
- the content itself is in fact Shift-JIS

If you got the winner right, then why (according to what standard)? :-)

> But for a compilation unit such as .aspx or .cs, there should be no problem
> using a BOM. In this case, the build tool (page translator, compiler, etc.)
> deals with it.
> The buildtime encoding doesn't need to match the runtime encoding
> anyway[*], and using UTF-8 as runtime encoding shouldn't procude a BOM in
> the output stream, unless you really ask for it.
Agree. Although these days mixing encodings is just a way to ask for trouble,
or show off (look how cool I am, you can master such a mess :-)
There is no reason to be anything other than Unicode. Ten years ago, yes.
Now there are still some exceptions, but fewer and fewer.

Signature

Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Joerg Jooss - 08 Feb 2007 20:57 GMT
Thus wrote Mihai N.,

>> Which isn't suprising unless VS would store the last applied
>> character encoding for each file -- which wouldn't hurt ;-)
>>
> The project file can be a great place to store that kind of info.
> But, the fact still remains: VS is not a creation HTML tools.

Sure. Go Expression Web :-)

> I like it, is very strong, can do a decent job for a lot of file
> formats,
> bat I would not push it too much :-) And in fact, MS does not do it
> either,
> this is whay they have dedicated html tools :-)

I really don't think character encoding is an HTML phenonemon, so having
such a feature in VS would be useful for everybody ;-)

>> There's nothing non-standard about UTF-8 with a BOM. The only thing
>> you
[quoted text clipped - 11 lines]
> not be
> used.>>

In our case (VS), the precise type of data stream is often unknown.

> In general, Unicode consistently tries to leave decisions to
> higher-levels
[quoted text clipped - 9 lines]
> The fact that some browsers handle it properly does not mean is
> standard.

I don't know how this relates to our topic... I was not talking about standards?
VS usually doesn't load source files via HTTP, nor is every source file XML
or META tagged. These standards aren't applicable as a whole to a design
time environment.

Cheers,
Signature

Joerg Jooss
news-reply@joergjooss.de

Mihai N. - 09 Feb 2007 08:36 GMT
> I really don't think character encoding is an HTML phenonemon, so having
> such a feature in VS would be useful for everybody ;-)
Nothing against :-)

> In our case (VS), the precise type of data stream is often unknown.
This is VS's fault (because it is unaware of the HTML ways of specifying
encoding).

> VS usually doesn't load source files via HTTP, nor is every source file XML
> or META tagged. These standards aren't applicable as a whole to a design
> time environment.
Well, some of the standards are applicable.
And, happily enough VS respects them. It is a nice surprise!
VS respects the meta in the head section!
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Try this:
1. save the first page from www.yahoo.co.jp It is encoded using euc-jp.
2. Try opening it in Notepad, you will see junk.
3. Open it in VS, you see Japanese (if you have Japanese support installed).
change the meta from <meta http-equiv="Content-Type" content="text/html;
charset=euc-jp"> to <meta http-equiv="Content-Type" content="text/html;
charset=utf-8"> and save.
4. Open it in Notepad again, you will see Japanese. The ending is utf-8

So VS will save the HTML according to the meta.
Standard compliant, no need to add a BOM, store the encoding somewhere,
or ask every single time. Nice and correct!

So, a better answer for the original question!
I have tried setting the proper encoding in the meta and Russian works fine.
Tested it with KOI8-R, windows-1251 and utf-8

Signature

Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Joerg Jooss - 10 Feb 2007 10:59 GMT
Thus wrote Mihai N.,

>> In our case (VS), the precise type of data stream is often unknown.
>>
> This is VS's fault (because it is unaware of the HTML ways of
> specifying encoding).

My point was that there are source files to which HTML specific rules don't
apply -- such as C#, VB, JavaScript.  

>> VS usually doesn't load source files via HTTP, nor is every source
>> file XML or META tagged. These standards aren't applicable as a whole
[quoted text clipped - 4 lines]
> VS respects the meta in the head section!
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

I know, although in some quick tests VS didn't keep file encoding and META
tag synchronized all the time. I don't create a lot of plain HTML content,
so I don't have any insight how reliable that feature is.

> Try this:
> 1. save the first page from www.yahoo.co.jp It is encoded using
[quoted text clipped - 13 lines]
> somewhere,
> or ask every single time. Nice and correct!

Mihai, fire up your favorite hex editor and check the first three bytes of
the file: EF BB BF here.  ;-)

Cheers,
Signature

Joerg Jooss
news-reply@joergjooss.de

Mihai N. - 10 Feb 2007 20:41 GMT
> My point was that there are source files to which HTML specific rules don't
> apply -- such as C#, VB, JavaScript.
Ah!
I have no problem with this.
As stated somewhere "I do have opinions on UTF-8 + BOM in the context of
various file formats."
So my opinion on the BOM depends on the file format. This is *no* for
html/xml, but might be yes for C#, VB, JavaScript.
And, in fact, C# & VB are MS formats, so if they decide to use 3 BOMs
at the end to identify the encoding, I have no problem with it
(but I will say "WTF?" :-)

Ok, joking aside, I think BOM in C# and VB files is a good thing.
I need to think a bit more about JavaScript.

> Mihai, fire up your favorite hex editor and check the first three bytes of
> the file: EF BB BF here.  ;-)
Damn! Me no like it :-)

Signature

Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.