Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / XML / June 2007

Tip: Looking for answers? Try searching our database.

XmlTextWriter Encodes HTML Entities?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
clintonG - 28 May 2007 21:28 GMT
Can anybody make sense of this crazy and inconsistent results?

// IE7 Feed Reading View disabled displays this raw XML
<?xml version="1.0" encoding="utf-8" ?>
<!-- AT&T HTML entities & XML <elements> are displayed -->
<rss version="2.0">
<channel>
<title>AT&T HTML entities & XML <elements> are displayed</title>
...
<description>
<![CDATA[ AT&T HTML entities & XML <elements> using CDATA  ]]>
</description>
...

The XML comment data comes directly from the TextBox on the Form
as text. The XmlTextWriter writer.WriteElementString("title", title)
generates
the <title> element and writer.WriteCData(description) generates the
<description> element.

// Drag the testRSS.xml file into NotePad displays
<?xml version="1.0" encoding="utf-8"?>
<!--AT&T HTML entities & XML <elements> are displayed-->
<rss version="2.0">
<channel>
<title>AT&amp;T HTML entities &amp; XML &lt;elements&gt; are
displayed</title>
<description><![CDATA[AT&T HTML entities & XML <elements> using
CDATA]]></description>

// Enable IE7 Feed Reading View and observe that IE7
// either violates XML by encoding HTML entities and XML elements
// or encodes unencoded XML data for display of RSS
AT&T HTML entities & XML <elements> are displayed

Its bad enough IE7 is likely still a sloppy parser and will violate XML
validity rules
by encoding unencoded feed data which really makes life all FUBAR for an
application developer but worse yet what is encoding the HTML entities and
the
XML element in the <title> element when the testRSS.xml file is dragged into
NotePad?

Does the XmlTextWriter encode HTML and XML? How does the data in the
<title> element in the file end up encoded?

<%= Clinton Gallagher
        NET csgallagher AT metromilwaukee.com
        URL http://clintongallagher.metromilwaukee.com/
Martin Honnen - 29 May 2007 12:48 GMT
> Does the XmlTextWriter encode HTML and XML? How does the data in the
> <title> element in the file end up encoded?

With XmlWriter respectively XmlTextWriter you can ensure that your XML
markup is well-formed as methods like WriteElementString make sure that
'&' is escaped as &amp; and '<' is escaped as '&lt;' so for example
  xmlWriter.WriteElementString("title",
    "AT & T, <element>content</element>");
yields
<title>AT &amp; T,&lt;element&gt;content&lt;/element&gt;</title>

That has nothing to do with HTML or HTML entities, rather XML defines
entities like amp or gt or lt itself.

If you wanted that 'title element to have a child 'element' then you
need to use
  xmlWriter.WriteStartElement("title");
  xmlWriter.WriteString("AT & T");
  xmlWriter.WriteElementString("element", "content");
  xmlWriter.WriteEndElement();
which yields

<title>AT &amp; T<element>content</element></title>

Signature

    Martin Honnen --- MVP XML
    http://JavaScript.FAQTs.com/

clintonG - 29 May 2007 17:27 GMT
Thanks for confirming that the XmlTextWriter methods escapes and encodes
specific text characters as HTML character entities. The HTML character
entity naming conventions you attempt to clarify are defined by W3C (24.4.1
The list of characters, Special characters for HTML [1]). My question should
have asked if the method escape and encode "text characters" as HTML
entities. Nitpicker ;-)

Anyhow I didn't observe MSDN documentation make note of this inherent
feature of the class as the escaping and encoded features are not explicitly
documented in any page I have yet to read. There is a pthy comment within
the narrative of the "Writing XML with the XmlWriter" document [2] but the
narrative is poorly written and easily misunderstood.

<%= Clinton Gallagher

[1] http://www.w3.org/TR/html401/sgml/entities.html
[2] http://msdn2.microsoft.com/en-us/library/4d1k42hb(VS.80).aspx

>> Does the XmlTextWriter encode HTML and XML? How does the data in the
>> <title> element in the file end up encoded?
[quoted text clipped - 19 lines]
>
> <title>AT &amp; T<element>content</element></title>
Martin Honnen - 29 May 2007 17:36 GMT
> Thanks for confirming that the XmlTextWriter methods escapes and encodes
> specific text characters as HTML character entities. The HTML character
> entity naming conventions you attempt to clarify are defined by W3C (24.4.1
> The list of characters, Special characters for HTML [1]). My question should
> have asked if the method escape and encode "text characters" as HTML
> entities. Nitpicker ;-)

XML defines its own entities and what XmlWriter does is based on the XML
specification and _not_ on the HTML specification.
See <http://www.w3.org/TR/REC-xml/#sec-predefined-ent>.

Signature

    Martin Honnen --- MVP XML
    http://JavaScript.FAQTs.com/

clintonG - 29 May 2007 19:43 GMT
I kept following links and finally found the arcane documentation: an
XmlWriterSettings.CheckCharacters Property [1]. So it seems to me ASP.NET
developers don't have to fool around with Regular Expressions to validate
and replace text characters that would be illegal when the document is saved
as XML, i.e. RSS feeds for example.

I understand what W3C documents say but XML and HTML derive from SGML and
there are some semantic ambiguities in this context in the W3C documents.
Most of us and most documentation including W3C documentation define &amp;
as an HTML character entity. When we get to the W3C page(s) for XML they
drop the verbiage "HTML" when describing character entities.

As I'm sure you'll have to agree reading the EBNF, the DTDs indicate we're
talking about the same thing using context specific nomenclature.
So we really don't need to quibble about semantics. All I want to do is
write code that will generate valid XML RSS feeds that will be parsed by the
greatest number of aggregators which in itself requires a personal
relationship with all the blessings of Heaven because everybody has been so
FUBAR in their respective implementations.

<%= Clinton Gallagher

[1]
http://msdn2.microsoft.com/en-us/library/system.xml.xmlwritersettings.checkchara
cters(VS.80).aspx


>> Thanks for confirming that the XmlTextWriter methods escapes and encodes
>> specific text characters as HTML character entities. The HTML character
[quoted text clipped - 6 lines]
> specification and _not_ on the HTML specification.
> See <http://www.w3.org/TR/REC-xml/#sec-predefined-ent>.
Bjoern Hoehrmann - 29 May 2007 20:18 GMT
* clintonG wrote in microsoft.public.dotnet.xml:
>I understand what W3C documents say but XML and HTML derive from SGML and
>there are some semantic ambiguities in this context in the W3C documents.
>Most of us and most documentation including W3C documentation define &amp;
>as an HTML character entity. When we get to the W3C page(s) for XML they
>drop the verbiage "HTML" when describing character entities.

It would be very confusing otherwise. As an example, &apos; is valid in
XML but not part of HTML, while &ouml; is part of HTML but not of XML;
so if you speak about the pre-defined entities in XML you refer to five,
if you speak about those in HTML you refer to hundreds of them.
Signature

Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

clintonG - 01 Jun 2007 17:19 GMT
>* clintonG wrote in microsoft.public.dotnet.xml:
>>I understand what W3C documents say but XML and HTML derive from SGML and
[quoted text clipped - 7 lines]
> so if you speak about the pre-defined entities in XML you refer to five,
> if you speak about those in HTML you refer to hundreds of them.

Nobody argues that point Björn except to say the correct use of the English
language used in a formal document requires the use of "narrative" and
"expository" use of the grammar which we native speakers of English are
taught in grade school.

I value consistency in technical documentation which is considered a formal
use of the language. Consistency should not be compromised for the sake of
brevity which in this context results in the obfuscation of terminology. I
mean what are we talking about being needed here? A single paragraph of
narrative supported by a single expository table of five rows to resolve an
apparent contradiction which is not a contradiction at all?

Sometimes the people on the W3C working groups do not always make the best
decisions and are not neccessarily known for their mastery of the English
language which is said to be the most difficult language to master. That
said, over the years having observed how software developers will quibble
with one another for weeks or perhaps months about a single term and its
meaning I'm genuinely surprised this discrepancy has become over-looked.

<%= Clinton

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.