I am faced with the following issue.
We have Unicode text data in a Sql Server 2000 database. We are in the
process of adding Cyrillic, Greek, Hebrew and Japanese data.
Our downstream systems can only accept a subset of Latin characters. The
.Net process that creates output for our downstream systems needs to now
filter out characters to ensure our downstream systems can handle the data.
I need to look at each character returned from the database and determine
whether the character can be sent downstream, but I don't know how to
determine whether the data is say a Kanji character or Latin-1.
I have looked around MSDN for examples, but I haven't found anything that
helps.
Can someone point me in a direction?
Any code or pseudo code is appreciated.
Bill
Hi Bill!
> Our downstream systems can only accept a subset of Latin characters. The
> ..Net process that creates output for our downstream systems needs to now
> filter out characters to ensure our downstream systems can handle the data.
Why not just convert the Unicode-string to the apropriate encoding? All
"unknown" characters will be removed or replaced.
See:
WinAPI:
WideCharToMultiByte
.NET:
Encoding-Class

Signature
Greetings
Jochen
My blog about Win32 and .NET
http://blog.kalmbachnet.de/
Bill Musgrave - 10 Jan 2007 13:47 GMT
Jochen, I did try something like this earlier.
The issue I ran into was that the downstream systems would let through some
Latin-2 characters but not all. The systems I am dealing with are not well
documented regarding anything dealing with encodings or unicode. I am going
to need to experiment letter by letter to determine whether the systems will
accept a letter 'e' with an accent, but not with a caron or umlaut.
I suspect, given my business sponsor's, we'll probably get into substituting
characters, so if the letter 'e' with an umlaut is not acceptable to the
downstream systems, we'll just pass the letter 'e' by iteself.
We also would like to know why a given record was not sent downstream. If I
can somehow determine that the first several characters of a string are in
Russian or Japanese, I can note it in our audit table, and skip to the next
record.
Thanks
Bill
> Hi Bill!
>
[quoted text clipped - 11 lines]
> .NET:
> Encoding-Class
You could probably do something with regex and character blocks, e.g.
\p{IsLatin}.
See regex help on character classes and
http://unicode.org/reports/tr18/tr18-5.1.html#Character%20Blocks

Signature
======================
Clive Dixon
Digita Ltd. (www.digita.com)
>I am faced with the following issue.
>
[quoted text clipped - 18 lines]
>
> Bill
Bill Musgrave - 10 Jan 2007 14:06 GMT
Clive, thanks for the headsup on the regex. I just Goggled around and found
Michael Kaplan had an interesting blog article below. Maybe I can use this
info: http://blogs.msdn.com/michkap/archive/2005/09/13/464416.aspx
which lead me to this:
http://msdn2.microsoft.com/en-us/library/20bw873z.aspx
I might be able to make this work...
Thanks
Bill
> You could probably do something with regex and character blocks, e.g.
> \p{IsLatin}.
[quoted text clipped - 24 lines]
>>
>> Bill