.NET Forum / Windows Forms / WinForm General / August 2005
Character encoding problems reading Html from Clipboard
|
|
Thread rating:  |
Tim_Mac - 26 Aug 2005 19:41 GMT hi, i am accessing some html (originating from MS Word) in the clipboard in my winforms app. i catch it before the paste, clean up the html, set the clipboard with the cleaned Html, and then paste.
here is my (simplified) code: string html = Clipboard.GetDataObject().GetData(DataFormats.Html).ToString();
the problem is that the html string gets lots of substituted strange characters, for example: a dash - character from the word document gets converted into â" a line break gets converted into  an apostrophe gets converted into â
this doesn't happen when i just paste as normal into my html editor. the characters import normally.
is there a way to read from the clipboard without screwing up the characters? i tried Ascii.Encoding.GetString() but it needs a byte[], which i don't know how to get from the DataObject.
many thanks for any help. tim
Michael Phillips, Jr. - 26 Aug 2005 21:09 GMT You need to use the UTF8Encoding class.
string html = Clipboard.GetDataObject().GetData(DataFormats.Html).ToString();
// Create a UTF-8 encoding. UTF8Encoding utf8 = new UTF8Encoding();
// Get the encoded html string. byte[] encodedBytes = utf8.GetBytes(html);
// Decode bytes back to string. String decodedString = utf8.GetString(encodedBytes); Console.WriteLine(); Console.WriteLine("Decoded bytes:"); Console.WriteLine(decodedString);
hi, i am accessing some html (originating from MS Word) in the clipboard in my winforms app. i catch it before the paste, clean up the html, set the clipboard with the cleaned Html, and then paste.
here is my (simplified) code: string html = Clipboard.GetDataObject().GetData(DataFormats.Html).ToString();
the problem is that the html string gets lots of substituted strange characters, for example: a dash - character from the word document gets converted into â?" a line break gets converted into  an apostrophe gets converted into â?~
this doesn't happen when i just paste as normal into my html editor. the characters import normally.
is there a way to read from the clipboard without screwing up the characters? i tried Ascii.Encoding.GetString() but it needs a byte[], which i don't know how to get from the DataObject.
many thanks for any help. tim
Jon Skeet [C# MVP] - 26 Aug 2005 21:54 GMT > You need to use the UTF8Encoding class. > [quoted text clipped - 12 lines] > Console.WriteLine("Decoded bytes:"); > Console.WriteLine(decodedString); I can't see how that would help - it's just encoding and decoding with the same encoding. As UTF-8 can encode any string, I can't envisage any situation where html wouldn't be equal to decodedString - could you give an example of such a situation?
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet If replying to the group, please do not mail me too
Michael Phillips, Jr. - 26 Aug 2005 23:04 GMT You are correct. I thought erroneously that there was some time of problem with the character encoding.
My code snippet certainly doesn't solve anything.
>> You need to use the UTF8Encoding class. >> [quoted text clipped - 17 lines] > situation where html wouldn't be equal to decodedString - could you > give an example of such a situation? "Jeffrey Tan[MSFT]" - 29 Aug 2005 12:00 GMT Hi Tim_Mac,
Thanks for your post.
Yes, I can reproduce out your issue on my side. It seems that this issue only occurs for localized characters, not for standard english characters.
Also, this issue only occurs with DataFormats.Html, but not for DataFormats.Text etc..
Then after doing some research, I found that this issue is documented in our internal database as a known issue. This is not winform side problem. When asked for HTML format, GetData returns an ANSI string which obviously does not have enough information to render chinese script. Currently, I can not think of a better workaround for this issue.
Hope this helps.
Best regards, Jeffrey Tan Microsoft Online Partner Support
 Signature Get Secure! - www.microsoft.com/security This posting is provided "as is" with no warranties and confers no rights.
Tim_Mac - 29 Aug 2005 12:50 GMT hi Jeffrey, many thanks for the reply. the word document i've been testing it off doesn't have localised characters to my knowledge. to reproduce my situation, create a blank MS word document, and insert 2 apostrophes and 2 double-quotes. you'll see Word changes them to the open/close versions of the characters. i also added the ` character, and the elipsis ... character (auto-corrected by word 2003 when you type in 3 period characters followed by a space) i then copy this, and paste it into my application. when i debug, this is the fragment i get via clip.GetData(DataFormats.Html).ToString();
<!--StartFragment-->\r\n\r\n<p class=MsoNormal>âââ â </p>\r\n\r\n<p class=MsoNormal>`⦠</p>\r\n\r\n<!--EndFragment-->
as you can see, there are garbage characters in the middle corresponding to the characters in the word doc.
interestingly, when i paste the content into WordPad, it preserves the open/close quote characters etc., but when i then copy and paste from WordPad, the html string is read correctly in my application. the open/close apostrohpes get demoted back to the normal apostrophe character, and the ellipsis character gets demoted back to 3 period characters.
what's a little bit annoying is that this problem only arose when i attempt to intercept the html in the clipboard before it is pasted. i'm using the Comzept HtmlEditor control for win-forms (a wrapper for MSHTML), and it has it's own Paste() method, which does not produce such character problems as i am experiencing. i presume it just calls the MSHTML Paste() method.
looking forward to your reply tim
Tim_Mac - 29 Aug 2005 17:05 GMT p.s. you can download my word doc here http://tim.mackey.ie/stuff/html_char_encoding.doc
"Jeffrey Tan[MSFT]" - 30 Aug 2005 08:39 GMT Hi Tim_Mac,
Thanks for your feedback.
Yes, I just tested '-' in english, which has no problem. However, with '"' character, I can reproduce out this problem on my side.
After doing some further research, I found that this issue only occurs with Word application, if we copy '"' characters from IE, Winform application will get the characters well without any problem. Even with Excel, it will retrieve well. So it seems that this issue is on Word application side.
Because Winform Clipboard class is just a wrapper of underlying windows Clipboard operation, it seems there is little work can be done in Winform side.
Best regards, Jeffrey Tan Microsoft Online Partner Support
 Signature Get Secure! - www.microsoft.com/security This posting is provided "as is" with no warranties and confers no rights.
Tim_Mac - 30 Aug 2005 11:06 GMT hi Jeffrey, thanks again for the reply. i wonder if Microsoft would consider posting a list of the affected characters and their distorted equivalents in the clipboard after being converted into ANSI? this would allow other applications to work around the problem and map the characters back to their proper equivalents.
so far i can identify the following mappings:
â open single quote â close single quote â open double quote â close double quote ⦠ellipsis  two space characters, (as used by some formatting conventions) after period
thanks tim
"Jeffrey Tan[MSFT]" - 31 Aug 2005 02:42 GMT Hi Tim,
Thanks for your post.
Yes, after doing some more research in this issue, I found that it seems that it is Winform's problem. Because I created a Win32 appliction, which use Win32 Api to get the clipboard CF_HTML format, I can get it without garbled text. Then I converted this Win32 code into managed code with P/invoke:
[DllImport("user32.dll",SetLastError=true)] static extern IntPtr GetClipboardData(uint uFormat); [DllImport("user32.dll",SetLastError=true)] static extern bool OpenClipboard(IntPtr hWndNewOwner); [DllImport("user32.dll",SetLastError=true)] static extern bool CloseClipboard(); [DllImport("user32.dll", SetLastError=true)] static extern uint RegisterClipboardFormatA(string lpszFormat); [DllImport("user32.dll",SetLastError=true)] static extern bool IsClipboardFormatAvailable(uint format); [DllImport("kernel32.dll",SetLastError=true)] static extern IntPtr GlobalLock(IntPtr hMem); [DllImport("kernel32.dll",SetLastError=true)] static extern uint GlobalSize(IntPtr hMem); [DllImport("kernel32.dll",SetLastError=true)] static extern IntPtr GlobalUnlock(IntPtr hMem);
private void button1_Click(object sender, System.EventArgs e) { uint CF_HTML = RegisterClipboardFormatA("HTML Format"); if (IsClipboardFormatAvailable(CF_HTML)) { if(OpenClipboard(this.Handle)) { IntPtr hGMem = GetClipboardData(CF_HTML) ; IntPtr pMFP = GlobalLock(hGMem) ; uint len=GlobalSize(hGMem); byte[] bytes=new byte[len]; Marshal.Copy(pMFP,bytes, 0, (int)len);
string strMFP =System.Text.Encoding.UTF8.GetString(bytes); this.textBox1.Text=strMFP; GlobalUnlock(hGMem) ; CloseClipboard() ; } } }
This works well on my side. Hope this helps. ================================================================= Thank you for your patience and cooperation. If you have any questions or concerns, please feel free to post it in the group. I am standing by to be of assistance.
Best regards, Jeffrey Tan Microsoft Online Partner Support
 Signature Get Secure! - www.microsoft.com/security This posting is provided "as is" with no warranties and confers no rights.
Tim_Mac - 31 Aug 2005 09:29 GMT hi Jeffrey, that's excellent, it works well so far on my side also. not being a COM expert, i'm a little bit wary of relying on the user32 or kernel dlls. will this work on any flavours of windows 2000 and XP with all the different service packs, IE versions etc? many thanks for this solution. tim
"Jeffrey Tan[MSFT]" - 31 Aug 2005 10:24 GMT Hi Tim,
I am glad my reply makes sense to you.
Yes, I think it will not break in all win32 version of OS. Because we are just using Win32 API, which is guarantee to have consistent behavior on all Win32 OS, our solution should be safe.
Thanks
Best regards, Jeffrey Tan Microsoft Online Partner Support
 Signature Get Secure! - www.microsoft.com/security This posting is provided "as is" with no warranties and confers no rights.
Jon Skeet [C# MVP] - 29 Aug 2005 19:30 GMT > many thanks for the reply. the word document i've been testing it off > doesn't have localised characters to my knowledge. But it *does* have characters where aren't in the ANSI code page. That's what Jeffrey meant by "not for standard English characters" I believe.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet If replying to the group, please do not mail me too
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|