Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / XML / June 2007

Tip: Looking for answers? Try searching our database.

Read XHTML into XML

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Jose Antonio Reyes - 28 Jun 2007 09:04 GMT
Hi all,

I need to read/parse XHTML aspx pages and look for certain tokens and
content. How can I use a XmlTextReader for this? If not, any other ideas?

Thanks in advance,

JA Reyes.
Martin Honnen - 28 Jun 2007 13:02 GMT
> I need to read/parse XHTML aspx pages and look for certain tokens and
> content. How can I use a XmlTextReader for this? If not, any other ideas?

If the pages are well-formed XHTML then it is possible to use XmlReader
(in .NET 2.0/3.0) or XmlTextReader (in .NET 1.x) to parse the XHTML
documents. You can also use the other XML APIs .NET provides so using
XPathNavigator and/or XmlDocument might offer more comfort than XmlReader.

Here is an example using XmlReader that prints out all heading elements
(h1 .. h6 elements) assuming they have no child elements:

    static public void PrintHeadings (string path) {
      XmlReaderSettings settings = new XmlReaderSettings();
      settings.ProhibitDtd = false;
      using (XmlReader xmlReader = XmlReader.Create(path, settings)) {
        while (xmlReader.Read()) {
          if (xmlReader.NodeType == XmlNodeType.Element &&
xmlReader.NamespaceURI == "http://www.w3.org/1999/xhtml") {
            switch (xmlReader.LocalName) {
              case "h1":
              case "h2":
              case "h3":
              case "h4":
              case "h5":
              case "h6":
                Console.Out.WriteLine(
"{0} heading has InnerText: \"{1}\".", xmlReader.LocalName,
xmlReader.ReadString());
                break;
            }
          }
        }
      }

      PrintHeasdings("doc.xhtml");
    }
Signature


    Martin Honnen --- MVP XML
    http://JavaScript.FAQTs.com/

Jose Antonio Reyes - 28 Jun 2007 20:18 GMT
Thanks Martin,

but how can I load the aspx page DTD?? I need to deal with special symbols
like nbsp; and so on...

For example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

Thanks in advance,

Jose Antonio Reyes.

> > I need to read/parse XHTML aspx pages and look for certain tokens and
> > content. How can I use a XmlTextReader for this? If not, any other ideas?
[quoted text clipped - 32 lines]
>        PrintHeasdings("doc.xhtml");
>      }
Martin Honnen - 29 Jun 2007 12:55 GMT
> but how can I load the aspx page DTD?? I need to deal with special symbols
> like nbsp; and so on...
>
> For example:
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

That is an SGML DTD, don't expect to use an XML parser to consume that.
If the document is an XHTML document (not a HTML 4.0) document then you
can parse it with XmlReader, I have already included the settings for that:
    static public void PrintHeadings (string path) {
      XmlReaderSettings settings = new XmlReaderSettings();
      settings.ProhibitDtd = false;
      using (XmlReader xmlReader = XmlReader.Create(path, settings)) {
Signature


    Martin Honnen --- MVP XML
    http://JavaScript.FAQTs.com/

Jose Antonio Reyes - 29 Jun 2007 13:52 GMT
Unfornately I could find some nbsp; items or javascript in the aspx page.

Could be a good solution to parse after the aspx and include CDATA sections??

Thanks.

> > but how can I load the aspx page DTD?? I need to deal with special symbols
> > like nbsp; and so on...
[quoted text clipped - 10 lines]
>        settings.ProhibitDtd = false;
>        using (XmlReader xmlReader = XmlReader.Create(path, settings)) {
Martin Honnen - 29 Jun 2007 14:27 GMT
> Unfornately I could find some nbsp; items or javascript in the aspx page.
>
> Could be a good solution to parse after the aspx and include CDATA sections??

If the document is an XHTML document and the entity nbsp is defined in
the DTD then the XML parser can parse it.

Signature

    Martin Honnen --- MVP XML
    http://JavaScript.FAQTs.com/


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.