Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / New Users / December 2007

Tip: Looking for answers? Try searching our database.

How to perform XPath queries on HTML?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Siegfried Heintze - 02 Dec 2007 19:24 GMT
JTidy is a java library that will populate an XML DOM from an HTML string.
The XML DOM has XPATH. Is there a similar library for C# and VB.NET
programers that will allow me to perform XPATH queries on HTML?

Also, what is the name of the HTTP client that will allow me to fetch the
HTML from a web site?

thanks,
Siegfried
Scott M. - 02 Dec 2007 22:45 GMT
You can only perform XPath operations on XML, so, by definition, it can't be
used with HTML. But, if you have XHTML, then you can simply load up an
XMLDomDocument with this XHTML and use XPath on it that.

-Scott

> JTidy is a java library that will populate an XML DOM from an HTML string.
> The XML DOM has XPATH. Is there a similar library for C# and VB.NET
[quoted text clipped - 5 lines]
> thanks,
> Siegfried
Barry Kelly - 03 Dec 2007 17:16 GMT
> You can only perform XPath operations on XML, so, by definition, it can't be
> used with HTML.

A handy thing about the XPathNavigator class in .NET is that if you can
implement it (i.e. derive and implement its abstract methods) for your
arbitrary tree-shaped data structure, then you can query it using XPath.

-- Barry

Signature

http://barrkel.blogspot.com/

Scott M. - 04 Dec 2007 00:44 GMT
But, since HTML may not be a well-formed tree structure, wouldn't you have
problems querying it?

>> You can only perform XPath operations on XML, so, by definition, it can't
>> be
[quoted text clipped - 5 lines]
>
> -- Barry
Barry Kelly - 05 Dec 2007 15:26 GMT
> But, since HTML may not be a well-formed tree structure, wouldn't you have
> problems querying it?

Like I said earlier, HtmlAgilityPack uses a very lenient but
deterministic HTML parser. It can make a tree out of just about any
source HTML; as long as the XPath query works on one instance of the
server side's generated HTML (assuming it's generated otherwise why
automate the querying?), then it should work on subsequent instances.

In other words, even if the HTML is malformed and results in a
non-compliant tree, the formation of the tree itself is deterministic
and so it ought to be consistently queryable.

-- Barry

Signature

http://barrkel.blogspot.com/

Barry Kelly - 03 Dec 2007 17:11 GMT
> JTidy is a java library that will populate an XML DOM from an HTML string.
> The XML DOM has XPATH. Is there a similar library for C# and VB.NET
> programers that will allow me to perform XPATH queries on HTML?

Use HtmlAgilityPack. It has a basic, lenient HTML parser and implements
IXPathNavigable and a basic DOM, so it can be searched using an XPath.

http://www.codeplex.com/htmlagilitypack

> Also, what is the name of the HTTP client that will allow me to fetch the
> HTML from a web site?

WebRequest & WebResponse should be able to do this for you, no? Do you
have more specific questions about WebRequest.Create / etc?

-- Barry

Signature

http://barrkel.blogspot.com/


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.