.NET Forum / .NET Framework / New Users / December 2004
Object serialization and NetworkStream - extraneous characters in output
|
|
Thread rating:  |
jwallison - 08 Dec 2004 20:32 GMT TcpClient client = new TcpClient(AddressFamily.InterNetwork); client.SendTimeout = mSvcConfig.Data.SvcTimeout; // 1000 client.Connect( mSvcConfig.Data.SvcAddress, mSvcConfig.Data.SvcPort); //"localhost", 7024 NetworkStream stream = client.GetStream();
XmlSerializer outserializer = new XmlSerializer(typeof(LinkMessage)); //my data object, all string/int data XmlTextWriter tw = new XmlTextWriter( stream, Encoding.UTF8);
outserializer.Serialize(tw, mMsg ); // ref to my LinkMessage data instance
stream.Flush(); client.Close();
Produces the following output when written via the TcpClient stream (note extraneous "o;?" at beginning of message):
o;?<?xml version="1.0" encoding="utf-8"?><LinkMessage xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><MessageType>Anchor</M essageType><InnerText>Client Side ImageMap</InnerText><Href>http: //www.he.net/~seidel/Map/clientmap.html</Href><ImageSrc /></LinkMessage>
but produces the same output, sans garbage, when the same code writes to an XmlTextWriter based on a disk file (i.e. - seems like changing only the stream type results in spurious "added" output) on a NetwrokStream.
If the encoding is changed to Encoding.Unicode, different garbage (?~) prior to the actual message. If Encoding.ASCII, no garbage - but also wrong encoding in the emitted XML.
What can I do to eliminate this leading junk at the beginning of my messages? The Java app that is a target for this socket communication can't handle it...
TIA
 Signature Regards,
Jim Allison jwallison.1@bellsouth.net (de-mung by removing '.1') JimAllison@newsgroup.nospam
Steven Cheng[MSFT] - 13 Dec 2004 04:16 GMT Hi Jim ,
Thanks for your posting. From your description, you're using the dotnet's XmlSerializer to serialize a certain class instance out to a NetWorkStream and at the other side, when you retrieve the stream and try reading the xmlcontent out, you found there is an additional header "o;?" at the begining of the xml stream ,yes?
As for the problem you mentioned, I think it is likely due to the encoding problem. First, as for UNICODE text stream, there will has a header which indicate the Unicode stream's encoding type. And "o;?" is the one for UTF-8, and when using other ones such as UTF-16, you will get other value( ASCII stream won't have such a header). To verify this, you can also use a UltraEdit to open a unicode(UTF-8) txt file and use hex format to see it, you'll found the header, it is composed of three bytes 239,187,191 , they're all ascii char, and will display as "o;?" if you print them as ascii string. For example:
byte[] bytes = {239,187,191}; MessageBox.Show(System.Text.Encoding.ASCII.GetString(bytes));
So, when you use XmlSerializer to serialize an object into a certain stream, if using Unicode encoding type(use UTF-8 for instance), the header will be added( the first three bytes). However, if you read the xml back from the stream via UTF-8 encoding, you won't get this three bytes, the UTF-8 encoding system will automatically remove the header and return the sequential bytes bebind the header. Here is a simple code snippet to show this:
=============================== byte[] buffer = null;
XmlSerializer serializer = new XmlSerializer(typeof(userInfo)); userInfo ui = new userInfo(); ui.userName = "steven cheng"; ui.age = 20; ui.email = "steven@microsoft.com";
MemoryStream ms = new MemoryStream();
StreamWriter sw = new StreamWriter(ms,System.Text.Encoding.UTF8);
serializer.Serialize(sw,ui);
buffer = ms.GetBuffer(); // will return the xml with "o;?" because we use ASCII to decode the byte which is incorrect MessageBox.Show(System.Text.Encoding.ASCII.GetString(buffer)); // won't display the "o;?" since the UIF-8(correct encoding) will bypass it MessageBox.Show(System.Text.Encoding.UTF8.GetString(buffer)); ==================================
So, If you found the problems occur in your java client that recieve this stream, I suggest you check the java code to see whether it is reading the stream and conver the bytes to string using the correct encoding type(utf-8). I suspect that it is using the default ASCII encoding to read the bytes so that the "o;?" come out.
Please have a look at the above things, if there is anything unclear, please feel free to post here. HTH.
Regards,
Steven Cheng Microsoft Online Support
 Signature Get Secure! www.microsoft.com/security (This posting is provided "AS IS", with no warranties, and confers no rights.)
jwallison - 16 Dec 2004 15:04 GMT My .Net socket test client WAS erroneously using Encoding.ASCII (ah, the joys of midnight testing!), changing that to UTF8 produces the same result that the Java developer is reporting - a "?" is received at the beginning of every deserialized message on the socket.
So, the "o;" is the encoding information on the packet, but the "?" is extraneous.
What is the source of the extraneous character, and can it/should it be eliminated? I seem to recall something like this from the days of DOS - is it just an artifact of Socket communications in general?
> Hi Jim , > [quoted text clipped - 69 lines] > (This posting is provided "AS IS", with no warranties, and confers no > rights.) Steven Cheng[MSFT] - 17 Dec 2004 09:14 GMT Hi Jwallison,
Thanks for your followup. I've just done some tests between .net and java. Write file in .net and read in java, write in java and read in .net( via UTF-8). First, I've found the problem you mentioned, when reading into java stream via UTF-8, an additional "?" occurs. But if I use JAVA to write out a utf-8 encoded xml file, I can load it correctly in .net.
I think there is something different of the file's output between .net and java. I'll do some research and have some further test to check this. I'll update you if I got some info. Also, if you find any ideas meanwhile, please also feel free to post here. Thanks.
Regards,
Steven Cheng Microsoft Online Support
 Signature Get Secure! www.microsoft.com/security (This posting is provided "AS IS", with no warranties, and confers no rights.)
jwallison - 17 Dec 2004 19:32 GMT It doesn't just happen in Java, it ALSO happens with .Net -
private const int portNum = 7024;
public static int Main(String[] args) { bool done = false;
IPAddress localAddr = IPAddress.Parse("127.0.0.1");
TcpListener listener = new TcpListener(localAddr, portNum);
listener.Start();
while (!done) { Console.Write("\nWaiting for connection..."); TcpClient client = listener.AcceptTcpClient();
Console.WriteLine("Connection accepted."); NetworkStream ns = client.GetStream();
try { byte[] bytes = new byte[2048]; int bytesRead;
while( (bytesRead = ns.Read(bytes, 0, bytes.Length)) > 0) Console.WriteLine(Encoding.UTF8.GetString(bytes, 0, bytesRead));
ns.Close(); client.Close(); } catch (Exception e) { Console.WriteLine(e.ToString()); } }
listener.Stop();
return 0; }
returns output identical to the Java client (with a leading "?").
> Hi Jwallison, > [quoted text clipped - 17 lines] > (This posting is provided "AS IS", with no warranties, and confers no > rights.) Steven Cheng[MSFT] - 20 Dec 2004 06:54 GMT Hi Jwallison,
Thanks for your followup. Ok, I've just made some test on Console Application and see the "?" you mentioned which I didn't see in Windows Application(via messagebox) when using the Encoding.UTF8.GetString().
However, as I mentioned in the first message, this is still caused by the byte order Mark(BOM) which is inserted ahead of a Stream( contains UNICODE text ). And the "?" is just the BOM of UTF-8 , it is a three bytes header {239,187,191}
and we can also verify this by using the //get the UTF-8 BOM byte[] bom = System.Text.Encoding.UTF8.GetPreamble(); Console.WriteLine(System.Text.Encoding.UTF8.GetString(bom));
and you can see the "?" you mentioned.
In .net, the StreamWriter will always add such BOM(for unicode encoding type) into the output byte stream, but when we use StreamReader(rather than the Raw Stream such as FileStream or NetworkStream) to read it back, we won't by affected by such BOM, the StreamReader will automatically detect it and process it for us. So when we want to retreive unicode text from a Stream, we need to use a StreamReader (with the correct encoding type ) to wrapper the Raw Stream, for example: ================================ while (!done) { Console.Write("\nWaiting for connection..."); TcpClient client = listener.AcceptTcpClient();
Console.WriteLine("Connection accepted."); NetworkStream ns = client.GetStream(); StreamReader sr = new StreamReader(ns,System.Text.Encoding.UTF8); try { byte[] bytes = new byte[2048]; int bytesRead;
Console.WriteLine(sr.ReadToEnd());
ns.Close(); client.Close(); } catch (Exception e) { Console.WriteLine(e.ToString()); } }
This can ensure that the raw stream is correctly read in via the appropricate encoding.
In addition, based on my test on java IO, the java io's Reader (with the specific encoding) won't detect such BOM, also its Writer class won't output BOM either. So the problem will still occur when you use JAVA's IO Reader to read the unicode text stream output by .net. From my search, I see some one is manualy detect the first 4 bytes to see whether it's a certain BOM when reading unicode text stream in java, but haven't found any buildin means like the StreamReader in .net.
Please have a look in the above things and if there is anything unclear, please feel free to post here .Thanks.
Regards,
Steven Cheng Microsoft Online Support
 Signature Get Secure! www.microsoft.com/security (This posting is provided "AS IS", with no warranties, and confers no rights.)
jwallison - 16 Dec 2004 15:05 GMT My .Net socket test client WAS erroneously using Encoding.ASCII (ah, the joys of midnight testing!), changing that to UTF8 produces the same result that the Java developer is reporting - a "?" is received at the beginning of every deserialized message on the socket.
So, the "o;" is the encoding information on the packet, but the "?" is extraneous.
What is the source of the extraneous character, and can it/should it be eliminated? I seem to recall something like this from the days of DOS - is it just an artifact of Socket communications in general?
> Hi Jim , > [quoted text clipped - 9 lines] > UTF-8, and when using other ones such as UTF-16, you will get other value( > ASCII stream won't have such a header). To verify this, you can also use a
> UltraEdit to open a unicode(UTF-8) txt file and use hex format to see it, > you'll found the header, it is composed of three bytes 239,187,191 , [quoted text clipped - 35 lines] > > // won't display the "o;?" since the UIF-8(correct encoding) will bypass it
> MessageBox.Show(System.Text.Encoding.UTF8.GetString(buffer)); > ================================== [quoted text clipped - 3 lines] > stream and conver the bytes to string using the correct encoding > type(utf-8). I suspect that it is using the default ASCII encoding to read
> the bytes so that the "o;?" come out. > [quoted text clipped - 10 lines] > (This posting is provided "AS IS", with no warranties, and confers no > rights.)
 Signature Regards,
Jim Allison jwallison@nospam.net
jwallison - 16 Dec 2004 15:37 GMT My .Net socket test client WAS erroneously using Encoding.ASCII (ah, the joys of midnight testing!), changing that to UTF8 produces the same result that the Java developer is reporting - a "?" is received at the beginning of every deserialized message on the socket.
So, the "o;" is the encoding information on the packet, but the "?" is extraneous.
What is the source of the extraneous character, and can it/should it be eliminated? I seem to recall something like this from the days of DOS - is it just an artifact of Socket communications in general?
> Hi Jim , > [quoted text clipped - 9 lines] > UTF-8, and when using other ones such as UTF-16, you will get other value( > ASCII stream won't have such a header). To verify this, you can also use a
> UltraEdit to open a unicode(UTF-8) txt file and use hex format to see it, > you'll found the header, it is composed of three bytes 239,187,191 , [quoted text clipped - 35 lines] > > // won't display the "o;?" since the UIF-8(correct encoding) will bypass it
> MessageBox.Show(System.Text.Encoding.UTF8.GetString(buffer)); > ================================== [quoted text clipped - 3 lines] > stream and conver the bytes to string using the correct encoding > type(utf-8). I suspect that it is using the default ASCII encoding to read
> the bytes so that the "o;?" come out. > [quoted text clipped - 10 lines] > (This posting is provided "AS IS", with no warranties, and confers no > rights.)
 Signature Regards,
Jim Allison jwallison@nospam.net
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|