.NET Forum / Languages / Managed C++ / May 2007
wide character (unicode) and multi-byte character
|
|
Thread rating:  |
George - 03 May 2007 04:08 GMT Hello everyone,
Wide character and multi-byte character are two popular encoding schemes on Windows. And wide character is using unicode encoding scheme. But each time I feel confused when talking with another team -- codepage -- at the same time.
I am more confused when I saw sometimes we need codepage parameter for wide character conversion, and sometimes we do not need for conversion. Here are two examples,
code page is used in WideCharToMultiByte when dealing with unciode character
int WideCharToMultiByte ( UINT CodePage, DWORD dwFlags, LPCWSTR lpWideCharStr, int cchWideChar, LPSTR lpMultiByteStr, int cbMultiByte, LPCSTR lpDefaultChar, LPBOOL lpUsedDefaultChar );
code page is not used in wcstombs when dealing with unciode character
size_t wcstombs ( char* mbstr, const wchar_t* wcstr, size_t count );
My question is, what is codepage (seems my current understanding is not correct)? Does codepage have anything to do with multi-byte character or only have relationship with wide character? Could anyone explain the meaning and relationship between codepage, wide character and multi-byte character?
thanks in advance, George
Mihai N. - 03 May 2007 06:14 GMT About code page: http://www.mihai-nita.net/article.php?artID=20060806a
> code page is not used in wcstombs when dealing with unciode character wcstombs is a "dumb-down" version of WideCharToMultiByte It uses the default system code page (or ANSI code page), and the user has less control of the various conversion options (dwFlags)
In fact, wcstombs is implemented in terms of WideCharToMultiByte.
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
George - 03 May 2007 13:38 GMT Thanks Mihai!
It is a very good article and I read through it twice. It solves and clarifies most of my questions. I still want to let you help to confirm,
1. Unicode, ANSI UTF-8 and UTF-16 is character set or code page, number mapping between character and number, is that correct?
2. What is the encoding approach? Does it has a name? I only see B or Q in the samples in the article to represent encoding approach.
I think encoding approach is another level of mapping between code page character number and storage bytes. Is my understanding correct?
regards, George
> About code page: > http://www.mihai-nita.net/article.php?artID=20060806a [quoted text clipped - 6 lines] > > In fact, wcstombs is implemented in terms of WideCharToMultiByte. Mihai N. - 04 May 2007 11:01 GMT > 1. Unicode, ANSI UTF-8 and UTF-16 is character set or code page, number > mapping between character and number, is that correct? Unicode = code page
UTF-8, UTF-16, UTF-32 = Character Encoding Forms http://www.unicode.org/glossary/#character_encoding_form
ANSI = in the Windows lingo ANSI is a misnomer, meaning “the default system code page.” See http://www.mihai-nita.net/article.php?artID=glossary
The Unicode lingo is a bit more complicated (you also have a "Character Encoding Scheme", etc.), but you probably don't need the whole enchilada to get a grasp of the basics.
> 2. What is the encoding approach? Does it has a name? I only see B or Q in > the samples in the article to represent encoding approach. B = BASE64, Q = Quoted-Printable http://www.faqs.org/rfcs/rfc2047.html
> I think encoding approach is another level of mapping between code page > character number and storage bytes. Is my understanding correct? Yes. It is also called "byte serialization" Since for normal text in a computer (let's say in code page 1252, Western European) the maping from code value to byte is 1:1, direct storage, the encoding part is not quite obvious. The code for 'a' is 0x61 and it is stored as the byte 61. This is why many programmers don't "grok" this extra level.
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
George - 04 May 2007 13:42 GMT Thanks Mihai!
Your reply is very great! I have read some more articles and have two more questions.
1. Previously I think wide character representation in computer is a specific encoding (codepage) approach -- like UTF-16, and multi-byte character representation in computer is another specific encoding (codepage) approach.
But now from your help, I think my previously understanding is wrong. Wide character and multi-byte character are just general terms used on Windows to represent mapping between a character and multiple (more than one) bytes. Is that correct?
Differences between multi-byte and wide character? I think they are both characters which are represented by more than one bytes. Why on Windows they are distinguished?
2. I am wondering where can I find the mapping table of each codepage (or encoding)? (how a number is mapped to a character)
regards, George
> > 1. Unicode, ANSI UTF-8 and UTF-16 is character set or code page, number > > mapping between character and number, is that correct? [quoted text clipped - 24 lines] > as the byte 61. > This is why many programmers don't "grok" this extra level. Mihai N. - 05 May 2007 08:56 GMT > 1. Previously I think wide character representation in computer is a > specific encoding (codepage) approach -- like UTF-16, and multi-byte > character representation in computer is another specific encoding > (codepage) approach. You are right, they are slightly different approaches.
> But now from your help, I think my previously understanding is wrong. Wide > character and multi-byte character are just general terms used on Windows > to represent mapping between a character and multiple (more than one) > bytes. Is that correct? Not quite. It is a bit more complicated. First, MultiByteToWideChar is not quite a correct name. It does conversion from all kind of code pages, including single byte (like 1252). So naming it MultiByte... it not quite accurate. But hey, is jut an API name.
The main difference between a multi-byte character (let's say in Shift-JIS for Japanese) and a wide character is that the multi-byte one was really thought as multi-byte. "Ni" in "Nihon" was "93 FA" in Shift-JIS. They are really two bytes, and nobody considered it to be a number, 0x93FA, or 0xFA93.
It is a bit like numbering systems. In base 10, you think about 12 as being represented by two digits, 1 and 2. If you switch to base 16, 12 is a digit (represented as the digit 'C').
It is a difference in perception, some might say just philosophical. But sometimes just looking at a problems differently helps solving it.
> Differences between multi-byte and wide character? I think they are both > characters which are represented by more than one bytes. Why on Windows > they are distinguished? They are distinguished on all platforms, not only in Windows.
> 2. I am wondering where can I find the mapping table of each codepage (or > encoding)? (how a number is mapped to a character) First, it is better to use standard API. But if you want to check some code pages, you can take a look here: ftp://ftp.unicode.org/Public/MAPPINGS/ Or you can download ICU (International Components for Unicode) http://www.icu-project.org/download/ and take a look, they have lots of tables.
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
George - 05 May 2007 13:27 GMT Thanks Mihai!
I am wondering what kinds of codepage (encoding) method could be called as multibyte character codepage (encoding), and what kinds of codepage (encoding) method could be called as wide character codepage (encoding)?
For example, if we are given a codepage (encoding) name like UTF-7, how could we make a conclusion whether it is wide character or multibyte character?
I think UTF-8 is multibyte and UTF-16 is wide character -- in my current limited knowledge level. But I am not sure about others, could you help to list some others (like popular ANSI code page?) or identify what is the rule used to distinguish whether a codepage (encoding) method is multibyte character or wide character?
I am still confused that why we distinguish multibyte character and wide character -- because I think wide character is also multibyte character, since wide character is of 2 bytes -- multiple bytes. :-)
I have performed some self-study, I think on Windows only UTF-16 is wide character codepage (encoding). Is that correct?
regards, George
David Wilkinson - 05 May 2007 15:35 GMT > Thanks Mihai! > [quoted text clipped - 18 lines] > I have performed some self-study, I think on Windows only UTF-16 is wide > character codepage (encoding). Is that correct? George:
On Windows, wide-character means an encoding using 16-bit characters (unsigned short, or wchar_t). There is only one wide character encoding in Windows, UTF-16. Most Unicode code points in UTF-16 are just one 16-bit character, but some languages use code points that requite two 16-bit characters.
All other encodings used in Windows use 8-bit characters (unsigned char, or char).
UTF-8 can represent all Unicode code points, using up to four 8-bit characters.
The "ANSI" code pages in Windows are 8-bit encodings in which (I think) at most two 8-bit characters are used for each code point. Each code page can only represent a subset of the Unicode code points, so different languages require different code pages.
Both UTF-8 and ANSI code pages are MBCS.
 Signature David Wilkinson Visual C++ MVP
Mihai N. - 06 May 2007 07:50 GMT > Most Unicode code points in UTF-16 are just one > 16-bit character, but some languages use code points that requite two > 16-bit characters. There are no two 16-bit characters. The 16-bit thing is not a character, but a code unit. For characters above FFFF (which cannot be represented on 16 bit) you use two code units (in the surrogates area). So, the techically corect statement is "some languages use characters that requite two 16-bit code points" (exactly the other way around :-)
> UTF-8 can represent all Unicode code points, using up to four 8-bit > characters. "up to four 8-bit code units"
> Both UTF-8 and ANSI code pages are MBCS. Not quite.
UTF-8 is not a character set, is a character encoding scheme for Unicode. So it is not a MBCS (Multi Byte Character Set).
Also, ANSI code pages that have only 256 values (like 1250, 1251, 1252, etc.) are SBCS (Single Byte Character Sets).
The only ANSI true MBCS are 932 (Japanese), 936 (Simplified Chinese), 949 (Korean) and 950 (Traditional Chinese). All of them use maximum 2 bytes, so they are also called DBCS (Double Byte Character Set). An example of MBCS that is not DBCS is GB 18030 (which cannot be ANSI code page).
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
David Wilkinson - 07 May 2007 11:22 GMT >> Most Unicode code points in UTF-16 are just one >> 16-bit character, but some languages use code points that requite two [quoted text clipped - 23 lines] > Character Set). An example of MBCS that is not DBCS is GB 18030 (which cannot > be ANSI code page). Hi Mihai:
I just knew that I would mess this up, an that someone would have to correct me. And I just knew it would be you :).
Yes, the word "character" is overloaded. I was using it in the sense of what you call code unit (char or wchar_t in C/C++). I won't do that any more.
But what exactly is the other meaning of "character". Is it the same as "glyph"? From an abstract point of view, I think there are just two concepts: "(Unicode) code points" and "code units". Each encoding uses one or more code units to represent some subset (possibly all) of the code points. It's just what the code points represent that I'm not quite sure of.
But anyway, don't you think the correct statement is
"Some languages use Unicode code points that require two 16-bit code units." ?
Because any 8-bit encoding can be used in WideCharToMultiByte() and MultiByteToWideChar(), I was thinking that any 8-bit encoding could be regarded as MBCS. SBCS is a special case where the selected code points can be represented by a single code unit. DBCS is a special case where the selected code points can be represented with two code units. UTF-8 is a special case where all the code points can be represented, using up to four code units. It's the old "Is a square a rectangle" thing.
 Signature David Wilkinson Visual C++ MVP
Mihai N. - 08 May 2007 08:49 GMT > But what exactly is the other meaning of "character". Is it the same as > "glyph"? From an abstract point of view, I think there are just two > concepts: "(Unicode) code points" and "code units". Each encoding uses > one or more code units to represent some subset (possibly all) of the > code points. It's just what the code points represent that I'm not quite > sure of. I would say: go to http://www.unicode.org/glossary/ A character is what the user percieves as a character. An user means: the real guy on the street who has no clue about languages. This is a cultural thing. For some countries the ae ligature (U+00E6) is one character, for others they are two. A glyph is a form (think vectors in a TTF file). The 'a' in Times New Roman and in Arial has two different glyphs. You can have a glyph representing more than one character (ligatures) or part of a character (combining marks). The code units depend on the character encoding scheme. They are 8 bits for UTF-8, 16 bit for UTF-16 and 32 bit for UTF-32. The Unicode code points are in the range 0-10FFFF, and do not necesarily map one code point to one character. They are the real value. Think numbers: the concept of "eighteen" is one and the same, although you can represent it as 12h, or 0x12, or 18 or 11000 (bin) or 030 (oct) Same, the U+020D is the "Latin small letter o with double grave" You might represent it in UTF-8 or 16, you can even use Java escape (\u525) or HTML ȍ or ȍ. It is the same thing.
> But anyway, don't you think the correct statement is > > "Some languages use Unicode code points that require two 16-bit code > units." ? Right.
> I was thinking that any 8-bit encoding could be > regarded as MBCS. SBCS is a special case where the selected code points > can be represented by a single code unit. DBCS is a special case where > the selected code points can be represented with two code units. You can put it this way, if you want (especially for MultiByteToWideChar) For me you are not multi-millionaire if you only have one million :-) So in my book you are not MBCS if you are only SBCS :-)
> UTF-8 > is a special case where all the code points can be represented, using up > to four code units. It's the old "Is a square a rectangle" thing. But UTF-8 is not a code page or a character set. It is more like BASE64, or "quoted printable" But it was easier to squize thru MultiByteToWideChar instead of adding another API. Imagine the questions: "so if I convert from Japanese code page, I use MultiByteToWideChar, but if I convert from UTF-8 code page I use UTF8ToWideChar? Why? Crap design! What do you mean UTF-8 is not code page?"
:-D
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Norman Diamond - 09 May 2007 01:55 GMT > Think numbers: the concept of "eighteen" is one and the same, although you > can represent it as 12h, or 0x12, or 18 or 11000 (bin) or 030 (oct) Some of your coding conversions changed your numbers when you weren't expecting it. Yeah, numbers have a lot in common with characters ^_^
Mihai N. - 09 May 2007 09:23 GMT >> Think numbers: the concept of "eighteen" is one and the same, although you >> can represent it as 12h, or 0x12, or 18 or 11000 (bin) or 030 (oct) > > Some of your coding conversions changed your numbers when you weren't > expecting it. Yeah, numbers have a lot in common with characters ^_^ Right :-) Got the 18 (decimal), considered it hex, and converted that to bin and oct. Another try: 12h, or 0x12, or 18 or 10010 (bin) or 022 (oct)
:-D
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
George - 07 May 2007 08:53 GMT Thanks David! Your reply is very clear. I am still confused about just one concept you are using.
8-bit character or 8-bit encoding, I think you mean in the encoding approach, the basic unit is 8-bit, and in the encoding approach, more than one 8-bit basic units could be used. Is that correct?
About 16-bit character or 16-bit encoding, I think you mean the basic unit is 16-bit.
Is my understanding correct?
regards, George
> > Thanks Mihai! > > [quoted text clipped - 39 lines] > > Both UTF-8 and ANSI code pages are MBCS. David Wilkinson - 07 May 2007 11:22 GMT > Thanks David! Your reply is very clear. I am still confused about just one > concept you are using. [quoted text clipped - 7 lines] > > Is my understanding correct? George:
Yes. What I was calling "character" is perhaps better called "code unit."
 Signature David Wilkinson Visual C++ MVP
Mihai N. - 06 May 2007 08:07 GMT > I am wondering what kinds of codepage (encoding) method could be called as > multibyte character codepage (encoding), and what kinds of codepage > (encoding) method could be called as wide character codepage (encoding)? To make it simple, in the Windows world wide characters are WCHAR/wchar_t. That would be UTF-16 (16 bit code units). So when you call MultiByteToWideChar you will convert anything to UTF-16. And WideCharToMultiByte will convert UTF-16 to whatever. That whatever is not always technically a "multi byte character set," but you should not care.
> For example, if we are given a codepage (encoding) name like UTF-7, how > could we make a conclusion whether it is wide character or multibyte > character? Although UTF-7 (or UTF-8) is not a code page or an encoding, technically you will use it on the "multibyte side" of MultiByteToWideChar.
> But I am not sure about others, could you help to > list some others (like popular ANSI code page?) or identify what is > the rule used to distinguish whether a codepage (encoding) method > is multibyte character or wide character? - UTF-16 is wide - UTF-32 would be wide, but is not supported. And if it will be supported, it will probably go on the MultiByte part of the conversion API :-) - UTF7 and UTF-8 are not code pages, but work as MBCS for the Windows conversion API - all the rest are SBCS, DBCS, MBCS - SBCS (Single Byte Character Set): needs only 1 byte to represent all the characters (max 256 char). Most of the Windows code pages. - DBCS (Double Byte Character Set): needs 1 or two bytes to represent all the characters (more than 256 char) This are used for CCJK: Chinese Simplified (936), Chinese Traditional (950) Japanese (932) Korean (949) - MBCS (Multi Byte Character Set): needs 1 or more bytes to represent all the characters. Now, SBCS and SBCS are a particular case of MBCS (because of the "1 or more" part). A code page that is MBCS and not SBCS/DBCS is GB 18030.
> I am still confused that why we distinguish multibyte character and wide > character -- because I think wide character is also multibyte character, > since wide character is of 2 bytes -- multiple bytes. :-) It is about the design of the thing, not on how many bytes is represented. It is a bit tricky to "grok," but the good news is that you don't need to grok it in order to use it. The basic rule: in the Windows world the ony wide is UTF-16, all the rest is MBCS. This is not techically corect, but it works as a general rule. Unless you care about the philosophical aspects, you should not mind :-)
> I have performed some self-study, I think on Windows only UTF-16 is wide > character codepage (encoding). Is that correct? Yes.
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
George - 07 May 2007 09:03 GMT Thanks Mihai! You are so knowledgeable about codepage! So great to meet with you here!
One more simple question about your reply,
the coding unit, I think you mean the basic units to represent the number (or storage, hex) form of a character in computer.
For example, about wide character, the coding unit is 16-bit, so each character is in a form of multiple 16-bits, for example, 16 bits, 32 bits or 64 bits.
For another example, about multibyte character, the coding unit is 8-bit, so each character is in a form of multiple 8-bits, for example, 8bits, 16 bits or 24 bits. -- But on Windows, only UTF-16 is used, means each character is represented by 16 bits.
I have also read through the resource (link) you recommended before, and I learned that ANSI codepage is a very special codepage name, which has different meaning on different locales, for example,
1252 (English) 932 (Japanese), 936 (Simplified Chinese), 949 (Korean) and 950 (Traditional Chinese).
Is my understanding correct?
regards, George
> > I am wondering what kinds of codepage (encoding) method could be called as > > multibyte character codepage (encoding), and what kinds of codepage [quoted text clipped - 47 lines] > > character codepage (encoding). Is that correct? > Yes. Mihai N. - 08 May 2007 08:21 GMT > the coding unit, I think you mean the basic units to represent the number > (or storage, hex) form of a character in computer. No, the code unit is the smallest piece of data that you can manipulate. It is not a character. A character can take several code units (up to 4 in UTF-8, 1 or 2 in UTF-16)
> For example, about wide character, the coding unit is 16-bit, so each > character is in a form of multiple 16-bits, for example, 16 bits, > 32 bits or 64 bits. Nope :-) For 16 bits the encoding is UTF-16. You can have one code unit (representing code points up to FFFF) or two units for the range FFFF-10FFFF. The mechanism is called "surrogates," if you want to search for it and read more.
> For another example, about multibyte character, the coding unit is 8-bit, > so > each character is in a form of multiple 8-bits, for example, 8bits, 16 bits > or 24 bits. -- But on Windows, only UTF-16 is used, means each character is > represented by 16 bits. Everything in the PC world is a multiple of 8 bits. But it is not accurate to say this about encodings. In MBCS characters are not 16 bits. They are 2 bytes. It is about how things are see: as a 16 bit unit vs a succession of 8 bit units. Think 123 vs 1 2 3. First one means one hundred twenty three. The second one is a succession of digits: one two three They both take 3 digits, but you thing about them differently.
> I have also read through the resource (link) you recommended before, and I > learned that ANSI codepage is a very special codepage name, which has [quoted text clipped - 5 lines] > > Is my understanding correct? Yes. In the Windows world ANSI code page == default system code page. It is the same as the code page used by localized non-Unicode versions of Windows (from Win 3.0 to Win Me). This is why in the XP UI is described as "Language for non-Unicode programs" Except for legacy, Unicode applications should not care about it.
Anyway, I think you got enough dry theory. I would say start working, and you will grok more once you start hitting problems. After a while you will hopefully have that "aha, now I get it!" moment :-)
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|