.NET Forum / .NET Framework / Internationalization / September 2007
Determining whether the text is RTL
|
|
Thread rating:  |
Jan Kucera - 11 Sep 2007 11:29 GMT Hello, I entered a little problem concerning automatic text alignment in WPF mentioned at http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2123352 and it seems I'd have to do the workaround myself, yet this group seems more appropriate to look for the answer in.
The application gets some text (from XML) and is supposed to display it. However, this XML contains data from several cultures and some comes from RTL ones (eg. the text is Hebrew). Now I need to find out, wheter I should align the text to the left, or to the right. Is there any function, either in .NET or in Win32 that would determine this for me? I could get the first character and test whether it is Arabic, Hebrew and so on, but I'll likely miss some case (or future one), so I'm looking for more general way of doing that.
Thank you for any hints, Jan
Mihai N. - 12 Sep 2007 06:25 GMT > The application gets some text (from XML) and is supposed to display it. > However, this XML contains data from several cultures and some comes from [quoted text clipped - 4 lines] > miss some case (or future one), so I'm looking for more general way of > doing that. This is how you determine if some culture needs RTL rendering: http://blogs.msdn.com/michkap/archive/2006/07/12/663013.aspx
But you need to have a way in the XML itself to tag data with a culture.
There is no 100% safe way to determine if the text is RTL based on the text content only. Imagine you have a mixture like this: "XXXXX YYYYY" with XXXXX some English text, and YYYYY some Arabic text. Is that English with an Arabic inset, or Arabic with an English inset?
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Jan Kucera - 12 Sep 2007 06:36 GMT Hi Mihai, thank you for answer. However, the Michael's post is expecting to have a CultureInfo. That way, because targeting newer .NET Framework, I could use the CultureInfo.TextInfo.IsRightToLeft.
Okay, I know your sample would be a problem. So, how to check it for a single character? Is there any way to test for all RTL cases?
Actually I think I do have ISO-639-2 tag for the text, but I'm not sure whether it is worth to create separate info about textflow with them.
Jan
>> The application gets some text (from XML) and is supposed to display >> it. [quoted text clipped - 20 lines] > with XXXXX some English text, and YYYYY some Arabic text. > Is that English with an Arabic inset, or Arabic with an English inset? Mihai N. - 13 Sep 2007 04:45 GMT > thank you for answer. However, the Michael's post is expecting to have a > CultureInfo. That way, because targeting newer .NET Framework, I could use > the CultureInfo.TextInfo.IsRightToLeft.
> Okay, I know your sample would be a problem. So, how to check it for a > single character? Is there any way to test for all RTL cases? Withoug a CultureInfo you can try calling (the native) GetStringTypeEx. It takes a locale ID, but you can use whatever you want, The strong attributes in CT_CTYPE2 (C2_RIGHTTOLEFT/C2_LEFTTORIGHT) are not affected by locale.
But there is still no reliable way to test for all RTL cases. Sometimes not even a human can do it.
> Actually I think I do have ISO-639-2 tag for the text, but I'm not sure > whether it is worth to create separate info about textflow with them. I think most of the time text content is in a single language. A document is mostly in language A, with small chunks of other languages. But those areas have to be tagged. Designing a document where all the languages are mixed, without properly tagging them, is not very usefull. Think MS Word, where you can mark text sections with a different language for spell-checking.
If possible it would be a good idea to tag the documents (if not paragraphs, or records, or whatever) with a full locale ID, RFC 4646 style.
There are quite a few things that cannot be done properly without locale info. For example sorting, case conversion are culture sensitive. Font selection (you cannot use a Chinese Traditional font for Chinese Simplified text, even when the text is identical). In fact, unless all you do is move text around (no processing, no display), it is best to know what is the locale of that text.
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Jan Kucera - 13 Sep 2007 11:21 GMT > Withoug a CultureInfo you can try calling (the native) GetStringTypeEx. > It takes a locale ID, but you can use whatever you want, [quoted text clipped - 3 lines] > But there is still no reliable way to test for all RTL cases. > Sometimes not even a human can do it. I will give it a try. I just want to avoid (not mentioning that I did not find any way of such checking in .NET) if (char is Arabic || char is Hebrew || char is Urdu || char is Persian || char is Syriac) and forget the Divehi case, or any new culture that will come. I thought that going the way 'if any version of the Windows (or .NET) I am runnig thinks it is RTL I should think it as well' would do the trick.
> I think most of the time text content is in a single language. > A document is mostly in language A, with small chunks of other languages. [quoted text clipped - 3 lines] > Think MS Word, where you can mark text sections with a different language > for spell-checking. Yes I agree, I wanted to mentioned it with your example too. I know the text I'm displaying will always be whole (or rarely except a word or two) within the same language. So I can afford to just check the first character in a title for example.
> If possible it would be a good idea to tag the documents > (if not paragraphs, or records, or whatever) with a full locale ID, [quoted text clipped - 7 lines] > display), > it is best to know what is the locale of that text. Well fortunately enough, I define the schema here and I could do some changes or improvements. I have set of data coming from different cultures and as Michael has written in the blog and suggested me as well, the user is most likely expecting behaviour based on his culture. So I do sorting of this data and case insensitive searching in context of the user's culture. All I do with data themselves is just to display them. For that reason and because of WPF I need to have an idea, wheter I should mark the document as RTL. The only other reason for knowing CultureInfo I could came up with is the ToTitleCase method, but I expect the titles of documents are already properly cased.
The problem here is, that I have data in languages which do not match with any existing culture. Like Latin, Old or Middle English and so on, artifical languages not foreclased either. Filtering data to show only these in Middle English (enm) is far more important to my application than having a CultureInfo for the language, since I need only to display it. This is the reason I choosed ISO-639-2 table instead of .NET supported cultures.
If there was a table mapping ISO-639-2 or -3 languages to appropriate CultureInfo classes, even if not accurate, my problems would have been solved. The document could be kept with the ISO marks and the application would get corresponding CultureInfo for properly displaying it. Until then, the GetStringTypeEx would do the work I think.
Thank you for your hints and thoughts. Jan
Mihai N. - 14 Sep 2007 05:24 GMT > ... will always be whole (or rarely except a word or two) within > the same language. So I can afford to just check the first character in a > title for example. If you don't notice any performance hit, try going beyond the first character, exactly for the rare "word or two," or digits, or other characters. Maybe calculate a percentage (72% rtl, 12% ltr, 6% others), establish a threshold, and go from there.
> The problem here is, that I have data in languages which do not match with > any existing culture. Like Latin, Old or Middle English and so on, > artifical languages not foreclased either. Yes, I understand how this can be a problem :-)
If you can control the environment (and it is Vista) you can create your own custom locales.
See: http://blogs.msdn.com/shawnste/archive/2005/11/23/496440.aspx http://msdn.microsoft.com/msdnmag/issues/06/12/LocaleHero/ http://msdn.microsoft.com/msdnmag/issues/06/06/CLRInsideOut/ http://windowsvistablog.com/blogs/windowsvista/archive/2006/07/19/442572.aspx
And the tools: - Microsoft Locale Builder (Beta 2) http://www.microsoft.com/downloads/details.aspx?FamilyID=e4588c5e-8f21- 45cc-b862-38df8d9bd528&DisplayLang=en - Microsoft Keyboard Layout Creator http://www.microsoft.com/globaldev/tools/msklc.mspx
 Signature Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
Jan Kucera - 14 Sep 2007 08:28 GMT > Maybe calculate a percentage (72% rtl, 12% ltr, 6% others), establish > a threshold, and go from there. Yes I thought about this already. It should not cost much performance since checking only the title of document. But I think I'll try to keep it simple at the moment (the GetStringTypeEx works as expected, thanks!) untill I find any problematic data, or solve the problem the other way.
> If you can control the environment (and it is Vista) you can create your > own custom locales. Thanks for the links. Regardless whether I could afford to support only Vista...well.. there are 500 items in ISO-639-2 and 7500 in ISO-639-3... Uh.. :-)) About most of them I've never heard, not to say about knowing the culture/language so deeply to be able to create corresponding CultureInfo.
Jan
Michael S. Kaplan [MSFT] - 14 Sep 2007 08:13 GMT Jan,
You can use code like in this post:
http://blogs.msdn.com/michkap/archive/2007/01/06/1421178.aspx
or use GetStringTypeW to get the info back.
 Signature MichKa [Microsoft] Fundamentals Technical Lead Windows International Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with no warranties, and confers no rights.
>> Withoug a CultureInfo you can try calling (the native) GetStringTypeEx. >> It takes a locale ID, but you can use whatever you want, [quoted text clipped - 65 lines] > Thank you for your hints and thoughts. > Jan Jan Kucera - 14 Sep 2007 08:39 GMT > Jan, > > You can use code like in this post: > http://blogs.msdn.com/michkap/archive/2007/01/06/1421178.aspx > or use GetStringTypeW to get the info back. Hmmm... thanks for the managed way, Michael! Although I'd have to find a very good reason to leave PInvoke and move to Reflection... ;-)
Any improvements in .NET 3.0 or 3.5? Jan
Michael S. Kaplan [MSFT] - 14 Sep 2007 14:20 GMT Unfortunately, no -- red bits/green bits rules, you see. :-(
 Signature MichKa [Microsoft] Fundamentals Technical Lead Windows International Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with no warranties, and confers no rights.
>> Jan, >> [quoted text clipped - 8 lines] > Any improvements in .NET 3.0 or 3.5? > Jan
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|