Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / Internationalization / November 2004

Tip: Looking for answers? Try searching our database.

First letter index

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
François - 28 Oct 2004 15:37 GMT
I want to build an index to a list of words based on the first letter of
these words for 4 languages ( french english czech spanish ).

In english there is no problem. There is no accent and all letter are based
on only one character.  

In french there are accents, and as fas as i know letters are all one
character. Capitalizing a word dont change its meaning.

In other languages there may be accents and letter may be composed of more
than one character and capitalization can change the meaning of a word.

-----
Here an example in french of what i am seeking.  

For the words :
    Abandon
    École
    ennui
    fuite

i would like to obtain the following entries :
    A ( -> Abandon)
    E ( -> École, ennui)
    F ( -> fuite )

-----

I can build for french a lookup table and  so have a solution; french being
my maternal language.

But i cannot for czech, etc.

-----

My questions are :

1) Even if for french and english my aim is meaningfull, is it meaningfull
for czech and spanish?
2) if 1) is answered yes, does the dotnet framework can help me?

I had a look at CompareInfo and SortKey classes.

Thank you.
Michael (michka) Kaplan [MS] - 31 Oct 2004 18:23 GMT
Better to decompose the text (normalization from D) than to use a lookup --
then you can pick the first character off.

However, even better than that is to be sure it is the right thing to do. In
English that letter is "E with a funny line on it" but in some languages
those are considered entirely different letters; attempting to strip them
will lead to an unhappy user community....

Signature

MichKa [MS]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure and Font Technologies
Windows International Division

This posting is provided "AS IS" with
no warranties, and confers no rights.

> I want to build an index to a list of words based on the first letter of
> these words for 4 languages ( french english czech spanish ).
[quoted text clipped - 40 lines]
>
> Thank you.
François - 01 Nov 2004 21:08 GMT
Thank you.

Here's an other examples. This time in czech ( of which language i know
nothing ).

The following words are sorted ( ignoreCase ) with the locale cs-CZ.

Cyklopentan
Částice
Dusík
Ethanol
Chlor
Fluor
Glutaraldehyd
Hydroxid
Chlor

Using the first two bytes of KeyData
[ ...CompareInfo.GetSortKey(s,CompareOptions.IgnoreCase ).KeyData  ]  
as an indicator to the effect that a break occurs on the first letter of the
words,
i obtain the following index.

C Č D E F G H C

Using the first two bytes of KeyData gives satisfying results for english
and almost satisfying results for french ( I cant for example map É to E ).

"C Č D E F G H C" looks weird and ( must be "C Č D E F G H Ch" ).

The Keydata for Chlor in ( cs-CZ, CompareOptions.IgnoreCase ) reads "14 46
14 72 14 124 14 138 1 1 1 1 0" . There is only 4 byte pairs instead of 5 even
if "Chlor" counts 5 characters.  
So the base API knows that "Ch" is only one letter.
It will be interesting to have a reversed map from (14,46) to Ch. ???

I had a look at the Unicode web site on "NFD".
That seems interesting. It will permit to map É to E in french.
Will it help me knowing that "Ch" is a letter in czech. What about the
framework and NFD?

Thanks again.
Michael (michka) Kaplan [MS] - 01 Nov 2004 22:34 GMT
Normalization is not doing a MAPPING -- it turns E Grave into E + Comnbing
Grave, so its two letters, and the first letter is an E.

I would *never* recommend trying to unpack the sort key to get information
as the values are not promised to remain constant.

Signature

MichKa [MS]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure and Font Technologies
Windows International Division

This posting is provided "AS IS" with
no warranties, and confers no rights.

> Thank you.
>
[quoted text clipped - 38 lines]
>
> Thanks again.
François - 02 Nov 2004 16:12 GMT
Thanks.

I had a look a UnicodeData.txt and the ICU website.  

I will build a simple lookup table for each of the languages I intend to
support using ICU collation charts and ICU collation customization rules.

These tables will help me determine the index entry for a word( eg École ->
E in french ; Chlor -> CH in Czech ).

I am a novice in unicode.
Your help was precious.
Michael (michka) Kaplan [MS] - 02 Nov 2004 16:18 GMT
Well, you may want to reconsider this plan -- there sre plenty of languages
for which this is an invalid model.

But beyond that if you are using Whidbey then Unicode normalization is built
in.

Signature

MichKa [MS]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure and Font Technologies
Windows International Division

This posting is provided "AS IS" with
no warranties, and confers no rights.

> Thanks.
>
[quoted text clipped - 8 lines]
> I am a novice in unicode.
> Your help was precious.

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.