Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / New Users / March 2006

Tip: Looking for answers? Try searching our database.

UTF8 to UTF16 ?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
PL - 24 Mar 2006 13:54 GMT
I'm somewhat confused about Unicode but up until now I havent really seen
much issues with using it up until recently. We recently started using an
SMS gateway that requires a unicode message to be sent as a hexadecimal
string where each byte code has been replaced with their hexadecimal value,
for example: 043104AF0442044D044...

This string according to their documentation must be in UTF-16 before
conversion to the hexadecimal form, we however are using UTF-8 on our
website and all the texts are entered as UTF-8.

When I try to send a unicode formatted message using content from our
website it shows some characters correctly but not all of them, I cannot see
another reason for this than the fact that we are using UTF-8 and they
require it to be in UTF-16.

Now to the questions:

1. How do I convert between UTF-8 and UTF-16 ? I was looking at the Decoder,
Encoder classes but it doesn't really provide a direct way to convert
between encodings that I could see.

2. Since all strings are actually UTF-16 in .NET does this mean that the
conversion already has been made or does it mean it is actually storing
UTF-8 encoded bytes into a UTF-16 string ?

Thank you
PL.
Morten Wennevik - 24 Mar 2006 14:57 GMT
Hi PL,

You can use the System.Text.Encoding class to convert one string to a byte  
array and then back to string in another encoding.

byte[] data = System.Text.Encoding.UTF8.GetBytes(utf8string);
string unicodestring = System.Text.Encoding.Unicode.GetString(data);

Beware that UTF16 can be big endian, in which case use BigEndianUnicode to  
get the string.

As for the second question. Yes all strings are unicode, but the content  
of the string does not have to be unicode encoded.  I believe a string can  
hold UTF8 encoded data without loss, but if you plan on doing string  
manipulation I would convert it to unicode first.

> I'm somewhat confused about Unicode but up until now I havent really seen
> much issues with using it up until recently. We recently started using an
[quoted text clipped - 26 lines]
> Thank you
> PL.

Signature

Happy Coding!
Morten Wennevik [C# MVP]

PL - 24 Mar 2006 16:30 GMT
Thank you, I was looking at the Encoding class without seeing that simple
solution :-/

PL.

> Hi PL,
>
[quoted text clipped - 11 lines]
> hold UTF8 encoded data without loss, but if you plan on doing string
> manipulation I would convert it to unicode first.
Nick Hounsome - 25 Mar 2006 13:25 GMT
> Hi PL,
>
[quoted text clipped - 3 lines]
> byte[] data = System.Text.Encoding.UTF8.GetBytes(utf8string);
> string unicodestring = System.Text.Encoding.Unicode.GetString(data);

This is just wrong.

Strings are strings of characters they are not strings of encodings of
characters hence it is meaningless to have a variable of type System.String
called utf8string.

Consider the simpler situation with Int32:
The integer 10 is not the sequence of characters "10" in decimal and nor is
it "1010" in binary and nor is it the bytes 0x00,0x00,0x00,0x0a - these are
all encodings. The above 2 lines are the equivalent of writing something
like:

int hexInt = 0x42;
string data = hexInt.ToString("X");
int decimalInt = int.Parse(data);

> Beware that UTF16 can be big endian, in which case use BigEndianUnicode to
> get the string.

This brings up the issue of byte order makes (BOM).
If you use BOM then the encoding can be inferred from the first few bytes.

> As for the second question. Yes all strings are unicode, but the content
> of the string does not have to be unicode encoded.  I believe a string can
> hold UTF8 encoded data without loss,

A string is not encoded therefore it is meaningless to say that it holds
UTF8 encoded data.

> but if you plan on doing string  manipulation I would convert it to
> unicode first.

There is no other type of string in .NET therefore all string manipulation
is inherently unicode.

To understand what you need to do you need to specify how your data comes in
and out of your app. If it comes as byte arrays then what you have is this:

byte[] utf8Input = .....;
string inputString = System.Text.Encoding.UTF8.GetString(utf8Input );
byte[] utf16Output = System.Text.Encoding.Unicode.GetBytes(inputString );
OutputHex(utf16Output);

>> I'm somewhat confused about Unicode but up until now I havent really seen
>> much issues with using it up until recently. We recently started using an
[quoted text clipped - 26 lines]
>> Thank you
>> PL.

Rate this thread:







Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.