.NET Forum / Languages / C# / March 2008
Checking character - problem in non-English languages?
|
|
Thread rating:  |
Jon - 07 Mar 2008 15:04 GMT I'm checking a character in a string for whitespace (spacebar or tab) or comment character (I use # for this, signifying that anything following it is a comment so should be ignored). My C# code is:
string str=???? int ch=???? //character position in string if(str[ch]==' ' || str[ch]=='\t' || str[ch]=='#'){ //delimiter found ???? }
It's what I've been using for many years (mainly in C - I've recently converted it to C#). It's occurred to me that there might be problems with the above for non-English languages.
For instance, I know that in MS Word there are different types of spacebar (eg non-breaking space) and also different lengths of spacebar (eg em-space, en-space). I wondered if these also exist in Unicode, and if I should be checking for them. The same goes for tab and #.
If there are problems, how can I fix the above code?
Jon Skeet [C# MVP] - 07 Mar 2008 15:09 GMT > I'm checking a character in a string for whitespace (spacebar or tab) or comment character (I use # > for this, signifying that anything following it is a comment so should be ignored). My C# code is: [quoted text clipped - 14 lines] > > If there are problems, how can I fix the above code? Well, what's the actual context here? If it's a plain text file, it's likely to just contain a normal space. I wouldn't worry about that.
On the other hand, there's always Char.IsWhiteSpace.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet World class .NET training in the UK: http://iterativetraining.co.uk
Jon - 07 Mar 2008 15:24 GMT Thanks for your reply Jon,
It is a normal text file - a sort of a configuration file, although I can't guarantee how the end user will generate it (eg Notepad or maybe a different editor).
Are you implying that normal text files use 1-byte characters rather than unicode characters (which I assume are 2-byte)?
Thanks for the tip on using Char.IsWhiteSpace.
Jon
<"Jon" <.>> wrote:
> I'm checking a character in a string for whitespace (spacebar or tab) or comment character (I use > # [quoted text clipped - 15 lines] > > If there are problems, how can I fix the above code? Well, what's the actual context here? If it's a plain text file, it's likely to just contain a normal space. I wouldn't worry about that.
On the other hand, there's always Char.IsWhiteSpace.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet World class .NET training in the UK: http://iterativetraining.co.uk
Jon Skeet [C# MVP] - 07 Mar 2008 15:29 GMT > Thanks for your reply Jon, > [quoted text clipped - 4 lines] > Are you implying that normal text files use 1-byte characters rather > than unicode characters (which I assume are 2-byte)? No - I'm implying that tools like notepad usually won't generate "fancy" whitespace, regardless of which encoding they save the file in.
If you're confronted with a Word document, that might have different kinds of spaces in - but in a plaintext document you're *likely* to just have normal spaces.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet World class .NET training in the UK: http://iterativetraining.co.uk
Jon - 07 Mar 2008 15:45 GMT OK, I understand - thanks.
Now the following is the bit that concerns me, probably because my knowledge of Unicode isn't that great.
I assume that Unicode is effectively lots of code pages, most of which are 256 bytes in length. So, 0-255 is US English, 256-511 is some other language, etc. Some of these won't even be Latin-based. Some (eg Chinese) will use more than 256 characters.
If so, then presumably there are lots of space characters in Unicode. Perhaps even lots of # characters in Unicode. Let's say that French is 256-511 (I've no idea where it really is). Is there a # in US English, and one also in French? Or is there only one # in the entire Unicode.
If someone in France, with their PC set up with French locale, writes a normal text file, then gives it to me in the UK which I use with my PC set to UK English locale, will the French person's # character be the same character as the one that I'm checking for?
Jon
<"Jon" <.>> wrote:
> Thanks for your reply Jon, > [quoted text clipped - 4 lines] > Are you implying that normal text files use 1-byte characters rather > than unicode characters (which I assume are 2-byte)? No - I'm implying that tools like notepad usually won't generate "fancy" whitespace, regardless of which encoding they save the file in.
If you're confronted with a Word document, that might have different kinds of spaces in - but in a plaintext document you're *likely* to just have normal spaces.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet World class .NET training in the UK: http://iterativetraining.co.uk
Jon Skeet [C# MVP] - 07 Mar 2008 15:51 GMT > OK, I understand - thanks. > [quoted text clipped - 3 lines] > I assume that Unicode is effectively lots of code pages, most of which are 256 bytes > in length. So, 0-255 is US English, 256-511 is some other language, etc. Not really...
> If so, then presumably there are lots of space characters in Unicode. Perhaps even lots of # > characters in Unicode. Let's say that French is 256-511 (I've no idea where it really is). Is there > a # in US English, and one also in French? Or is there only one # in the entire Unicode. No, that's not the way it works.
See http://pobox.com/~skeet/csharp/unicode.html for an overview of Unicode, and the difference between an *encoding* and a character set.
Basically whatever encoding the file is in, you'll end up with Unicode strings in memory when you read the file in - it's up to you to specify the encoding.
When you've loaded the file, only " " means space. There are various different kinds of spaces (non-breaking, wide etc) but only one "normal" one, U+0032.
Jon
Jon - 10 Mar 2008 10:28 GMT Thanks Jon, that has cleared up my misconceptions of Unicode. Sorry for the late reply (just moved house, no internet yet so can't reply weekends).
I started to look through your link on Friday and will continue today.
Jon
On Mar 7, 3:45 pm, "Jon" <.> wrote:
> OK, I understand - thanks. > [quoted text clipped - 3 lines] > I assume that Unicode is effectively lots of code pages, most of which are 256 bytes > in length. So, 0-255 is US English, 256-511 is some other language, etc. Not really...
> If so, then presumably there are lots of space characters in Unicode. Perhaps even lots of # > characters in Unicode. Let's say that French is 256-511 (I've no idea where it really is). Is > there > a # in US English, and one also in French? Or is there only one # in the entire Unicode. No, that's not the way it works.
See http://pobox.com/~skeet/csharp/unicode.html for an overview of Unicode, and the difference between an *encoding* and a character set.
Basically whatever encoding the file is in, you'll end up with Unicode strings in memory when you read the file in - it's up to you to specify the encoding.
When you've loaded the file, only " " means space. There are various different kinds of spaces (non-breaking, wide etc) but only one "normal" one, U+0032.
Jon
Martin Bonner - 07 Mar 2008 17:31 GMT > OK, I understand - thanks. > [quoted text clipped - 3 lines] > I assume that Unicode is effectively lots of code pages, most of which > are 256 bytes in length. No. It is *one* code page, about 4 million characters (I think) long. The four million characters are divided into different regions (Latin letters, Greek, Hangul, etc).
> So, 0-255 is US English, 256-511 is some other language, > etc. Some of these won't even be Latin-based. > Some (eg Chinese) will use more than 256 characters. > > If so, then presumably there are lots of space characters in Unicode. Absolutely not. There is space, and then a few other white space characters (non-breaking, thin, etc).
> Perhaps even lots of # characters in Unicode. Let's say that French is > 256-511 (I've no idea where it really is). Is there > a # in US English, and one also in French? Or is there only one # in the > entire Unicode. Just the one in the entire Unicode. French is written with a combination of Basic Latin (aka ASCII) and Latin-1 supplement.
> If someone in France, with their PC set up with French locale, writes > a normal text file, then gives it to me in the UK which I use with my PC > set to UK English locale, will the French person's # > character be the same character as the one that I'm checking for? Yes.
Jon - 10 Mar 2008 10:30 GMT Thanks Martin, it's now clear to me.
Jon
On Mar 7, 3:45 pm, "Jon" <.> wrote:
> OK, I understand - thanks. > [quoted text clipped - 3 lines] > I assume that Unicode is effectively lots of code pages, most of which > are 256 bytes in length. No. It is *one* code page, about 4 million characters (I think) long. The four million characters are divided into different regions (Latin letters, Greek, Hangul, etc).
> So, 0-255 is US English, 256-511 is some other language, > etc. Some of these won't even be Latin-based. > Some (eg Chinese) will use more than 256 characters. > > If so, then presumably there are lots of space characters in Unicode. Absolutely not. There is space, and then a few other white space characters (non-breaking, thin, etc).
> Perhaps even lots of # characters in Unicode. Let's say that French is > 256-511 (I've no idea where it really is). Is there > a # in US English, and one also in French? Or is there only one # in the > entire Unicode. Just the one in the entire Unicode. French is written with a combination of Basic Latin (aka ASCII) and Latin-1 supplement.
> If someone in France, with their PC set up with French locale, writes > a normal text file, then gives it to me in the UK which I use with my PC > set to UK English locale, will the French person's # > character be the same character as the one that I'm checking for? Yes.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|