Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / C# / July 2007

Tip: Looking for answers? Try searching our database.

string.Compare / OrdinalIgnoreCase

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Rene - 05 Jul 2007 23:23 GMT
Hi,

It was my understanding that when comparing strings using
"OrdinalIgnoreCase" as the method to compare the strings, the .Net compared
the strings by first capitalizing all of the characters on the string and
then making an ordinal comparison (Unicode code point comparison).

I guess I was wrong (or I am not getting something) because my experiments
prove otherwise.

In the code below, compare1 returns zero proving that if I manually
capitalize the strings and then compare them the .Net says they are the
same.

However, compare2 does not return zero, so this means that the .Net is doing
something different that what I assumed.

Could someone please tell me why compare1 and compare2 are returning
different values?

Thank you.

---------------------------------------------------------

// LATIN CAPITAL LETTER I (U+0049)
string capitalLetterI = "I";

// LATIN SMALL LETTER DOTLESS I (U+0131)
string smallLetterDotlessI = "\u0131";

string upper1 = smallLetterDotlessI.ToUpper();
string upper2 = capitalLetterI.ToUpper();
int compare1 = string.Compare(upper1, upper2, StringComparison.Ordinal);

int compare2 = string.Compare(smallLetterDotlessI, capitalLetterI,
StringComparison.OrdinalIgnoreCase);
Jon Skeet [C# MVP] - 05 Jul 2007 23:41 GMT
> It was my understanding that when comparing strings using
>  "OrdinalIgnoreCase" as the method to compare the strings, the .Net compared
> the strings by first capitalizing all of the characters on the string and
> then making an ordinal comparison (Unicode code point comparison).

The process of capitalization itself is culture-sensitive, which is
what's tripping you up. Your call to ToUpper is returning plain "I" in
both cases, because it's using the thread's current culture - if you
specify CultureInfo.InvariantCulture as the culture to use when upper
casing, you'll get the same results for both comparisons.

In this case, I believe that from a culture-neutral point of view,
they're different letters rather than just differently capitalised
letters. It's all a bit tricksy though, to be honest.

Hope this at least explains a bit of what's going on...

Signature

Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Rene - 06 Jul 2007 02:06 GMT
Sure enough, I added

System.Threading.Thread.CurrentThread.CurrentCulture =
System.Globalization.CultureInfo.InvariantCulture;

before doing any comparing and viola, I got the same answers this time (both
show as not being equal).

Looks like I have to do some more reading on "string.Compare". I didn't
think that learning about string/Unicode/culture/etc will take me as long as
it has taken me, the more I research the more new stuff I keep bumping on...
dam it!

Thanks.
Rene - 06 Jul 2007 17:11 GMT
OK, I did some more digging around, according to the following site:

http://www.fileformat.info/info/unicode/char/0131/index.htm

The *Unicode* uppercase equivalent for 'LATIN SMALL LETTER DOTLESS I'
(U+0131) is 'LATIN CAPITAL LETTER I' (U+0049).

Having said that, I was under the impression that the OrdinalIgnoreCase flag
would use the *Unicode conversion tables* (no culture involved) to convert
the characters on the string to uppercase, this means that uppercase
conversion should always be the same no matter what culture is being used.

If above is true, the result for "compare2" should be zero because:

The "smallLetterDotlessI" variable capitalized using the Unicode tables
should return (U+0049).
The "capitalLetterI" variable is already a capital character so after
capitalizing using the Unicode tables should return (U+0049).

So you may think that the line of code below should return zero:

int compare2 = string.Compare(smallLetterDotlessI, capitalLetterI,
StringComparison.OrdinalIgnoreCase);

But it does not. So what's going on? What logic is the .Net using when
comparing with the OrdinalIgnoreCase flag? Is it not uppercasing all
characters using the Unicode conversion tables?

Thanks.
Rene - 06 Jul 2007 17:25 GMT
Well, I think I found the answer here:

http://blogs.msdn.com/michkap/archive/2005/03/10/391564.aspx

Basically the page says:

"Windows and the .NET Framework mainly support simple, reversible casing --  
which is to say single code point casing that have ToUpper() and ToLower()
as inverse operations that can "undo" each other."

So in my example, the 'LATIN SMALL LETTER DOTLESS I'  (U+0131) will need to
uppercase to 'LATIN CAPITAL LETTER I' (U+0049), but then 'LATIN CAPITAL
LETTER I' (U+0049) should in return lowercase to 'LATIN SMALL LETTER DOTLESS
I'  (U+0131) but that is not the case because it will lowercase to 'LATIN
SMALL LETTER I' (U+0069) Since this conversion is not reversible
OrdinalIgnoreCase is not really uppercasing the character and that is why
"compare2" will not return zero.

At least that's what I think is going on.

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.