Better to decompose the text (normalization from D) than to use a lookup --
then you can pick the first character off.
However, even better than that is to be sure it is the right thing to do. In
English that letter is "E with a funny line on it" but in some languages
those are considered entirely different letters; attempting to strip them
will lead to an unhappy user community....

Signature
MichKa [MS]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure and Font Technologies
Windows International Division
This posting is provided "AS IS" with
no warranties, and confers no rights.
> I want to build an index to a list of words based on the first letter of
> these words for 4 languages ( french english czech spanish ).
[quoted text clipped - 40 lines]
>
> Thank you.
François - 01 Nov 2004 21:08 GMT
Thank you.
Here's an other examples. This time in czech ( of which language i know
nothing ).
The following words are sorted ( ignoreCase ) with the locale cs-CZ.
Cyklopentan
Částice
Dusík
Ethanol
Chlor
Fluor
Glutaraldehyd
Hydroxid
Chlor
Using the first two bytes of KeyData
[ ...CompareInfo.GetSortKey(s,CompareOptions.IgnoreCase ).KeyData ]
as an indicator to the effect that a break occurs on the first letter of the
words,
i obtain the following index.
C Č D E F G H C
Using the first two bytes of KeyData gives satisfying results for english
and almost satisfying results for french ( I cant for example map É to E ).
"C Č D E F G H C" looks weird and ( must be "C Č D E F G H Ch" ).
The Keydata for Chlor in ( cs-CZ, CompareOptions.IgnoreCase ) reads "14 46
14 72 14 124 14 138 1 1 1 1 0" . There is only 4 byte pairs instead of 5 even
if "Chlor" counts 5 characters.
So the base API knows that "Ch" is only one letter.
It will be interesting to have a reversed map from (14,46) to Ch. ???
I had a look at the Unicode web site on "NFD".
That seems interesting. It will permit to map É to E in french.
Will it help me knowing that "Ch" is a letter in czech. What about the
framework and NFD?
Thanks again.
Michael (michka) Kaplan [MS] - 01 Nov 2004 22:34 GMT
Normalization is not doing a MAPPING -- it turns E Grave into E + Comnbing
Grave, so its two letters, and the first letter is an E.
I would *never* recommend trying to unpack the sort key to get information
as the values are not promised to remain constant.

Signature
MichKa [MS]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure and Font Technologies
Windows International Division
This posting is provided "AS IS" with
no warranties, and confers no rights.
> Thank you.
>
[quoted text clipped - 38 lines]
>
> Thanks again.
François - 02 Nov 2004 16:12 GMT
Thanks.
I had a look a UnicodeData.txt and the ICU website.
I will build a simple lookup table for each of the languages I intend to
support using ICU collation charts and ICU collation customization rules.
These tables will help me determine the index entry for a word( eg École ->
E in french ; Chlor -> CH in Czech ).
I am a novice in unicode.
Your help was precious.
Michael (michka) Kaplan [MS] - 02 Nov 2004 16:18 GMT
Well, you may want to reconsider this plan -- there sre plenty of languages
for which this is an invalid model.
But beyond that if you are using Whidbey then Unicode normalization is built
in.

Signature
MichKa [MS]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure and Font Technologies
Windows International Division
This posting is provided "AS IS" with
no warranties, and confers no rights.
> Thanks.
>
[quoted text clipped - 8 lines]
> I am a novice in unicode.
> Your help was precious.