Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / ASP.NET / Web Services / November 2007

Tip: Looking for answers? Try searching our database.

Parsing HTML

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Mohammad-Reza - 23 Feb 2007 16:10 GMT
Hi
I want to parse a web page (in a web service) and retrive some of its
information. I googled the MSDN and found a walkthrough (How to: Create Web
Services That Parse the Contents of a Web Page) but the walkthrogh is a
little complex and the writer did not completly describe all the aspects of
the solution.
Could any one elaborate on this walkthrough? Or direct me to another (or
better) way to deal with such a problem.

Thanks in advance.
Scott M. - 23 Feb 2007 23:37 GMT
How about using the W3C Document Object Model, which was designed to do just
what you are trying to do?

> Hi
> I want to parse a web page (in a web service) and retrive some of its
[quoted text clipped - 8 lines]
>
> Thanks in advance.
Mohammad-Reza - 24 Feb 2007 05:51 GMT
I want to write a web service that extracts some information from a web page
and use that web service in a windows application. I think the usual solution
for parsing is a little bit slow and costs too much (getting HTML code and
finding the keys using loops). I want to know if there is any possible way in
.NET to simply extract those information (for example a method that returns
every HTML tag of the web page with its value)?
The process time of the web service is very important for me.

Thanks in advance.

> How about using the W3C Document Object Model, which was designed to do just
> what you are trying to do?
[quoted text clipped - 11 lines]
> >
> > Thanks in advance.
Scott M. - 24 Feb 2007 17:54 GMT
I don't know where you have gotten your information, but this is exactly
what the DOM is for.

>I want to write a web service that extracts some information from a web
>page
[quoted text clipped - 28 lines]
>> >
>> > Thanks in advance.
John Saunders - 25 Feb 2007 14:12 GMT
>I don't know where you have gotten your information, but this is exactly
>what the DOM is for.

Scott,

I used this approach with a Windows Forms application back in 2001, with
.NET 1.0. It worked, but was a bit clumsy, and it was time-consuming. I used
the ActiveX Internet Browser control to load the page I was interested in,
and once the page was loaded, I could access the DOM from C# code. Did you
have a different technique in mind when you talk about the DOM?

Perhaps a faster technique would be to use regular expressions to parse the
HTML and find what you're looking for.

John
Scott M. - 25 Feb 2007 15:38 GMT
What I had in mind was, if the HTML in question was well-formed (XHTML), you
could just load it into an XMLDocument (from a string) object and use the
XML DOM to parse from there.

>>I don't know where you have gotten your information, but this is exactly
>>what the DOM is for.
[quoted text clipped - 12 lines]
>
> John
Mohammad-Reza - 26 Feb 2007 08:22 GMT
Can you give a sample code for loading XHTML to a XMLDocument?

> What I had in mind was, if the HTML in question was well-formed (XHTML), you
> could just load it into an XMLDocument (from a string) object and use the
[quoted text clipped - 16 lines]
> >
> > John
Scott M. - 26 Feb 2007 14:51 GMT
Well, XHTML is XML, so you'd really be loading XML into an XMLDocument, but
once it's loaded, you can parse out whatever you like using the DOM.

Dim xmlDoc As New System.XML.XMLDocument()
'You can load the XML in one of two ways...

'docPath represents a path to an file containing the XML
xmlDoc.Load(docPath)

'or
'Here you can load a string directly
xmlDoc.LoadXML(string)

'Example of getting all the paragraph tags and then the text of the first
one using the DOM...
dim theParagraphs As XMLNodeList = xmlDoc.GetElementsByTagName("P")
dim firstParagraphText As String = theParagraphs(0).Text

-Scott

> Can you give a sample code for loading XHTML to a XMLDocument?
>
[quoted text clipped - 22 lines]
>> >
>> > John
John Saunders - 26 Feb 2007 17:27 GMT
> What I had in mind was, if the HTML in question was well-formed (XHTML),
> you could just load it into an XMLDocument (from a string) object and use
> the XML DOM to parse from there.

That works well for XHTML. The problem is that most web sites are still
using HTML, which is not well-formed XML.

John
Scott M. - 26 Feb 2007 19:06 GMT
But, we're not talking about most web pages.  We are talking about a
particular page that is being used with a web service.  In other words, it's
part of the OP's applicaiton, which he should have some control over.

>> What I had in mind was, if the HTML in question was well-formed (XHTML),
>> you could just load it into an XMLDocument (from a string) object and use
[quoted text clipped - 4 lines]
>
> John
John Saunders - 26 Feb 2007 22:49 GMT
> But, we're not talking about most web pages.  We are talking about a
> particular page that is being used with a web service.  In other words,
> it's part of the OP's applicaiton, which he should have some control over.

Sorry, I didn't recall that he said it was his application. I assumed he was
scraping from somebody else's application.

Even though it's his, there may be reasons why he can't guarantee that the
page he needs will be XHTML and will be guaranteed to remain XHTML.

John
Stane Bozic - 03 Nov 2007 01:04 GMT
An answer is HtmlAgilityPack (www.codeplex.com/htmlagilitypack).

> Hi
> I want to parse a web page (in a web service) and retrive some of its
[quoted text clipped - 6 lines]
>
> Thanks in advance.

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.