Hi folks,
I am retrieving a website for a site using httpWebRequest. What I want to
do with the retrieved webpage is list all the hyperlinks in the page. If I
do a simple regex search for <a then I get links that are commented out in
code and I don't want that. I want links that are actually active. This is
to do with reciprocal link check.
Can someone please point me in the right direction.
Thanks.

Signature
<a href="http://1pakistangifts.com">Send Gifts to Pakisan at #Pakistan Gifts
Store</a> | <a href="http://dotspecialists.com">Leading Software offshoring
and outsourcing service provider</a> | <a
href="http://websitedesignersrus.com">Professional Websites at affordable
prices</a>
Alexey Smirnov - 14 Aug 2007 09:17 GMT
> Hi folks,
>
[quoted text clipped - 3 lines]
> code and I don't want that. I want links that are actually active. This is
> to do with reciprocal link check.
Hi, I think you can try to clean the text before you get the links.
For example:
html_code = Regex.Replace(html_code, "<!--((.|\n)*?)-->", "");
This will replace all commented code by an empty string and then you
can get the links.
Jesse Houwing - 14 Aug 2007 11:56 GMT
Hello Enigma,
> Hi folks,
>
[quoted text clipped - 7 lines]
>
> Thanks.
Have a look at the HTML Agility pack. It allows you to treat the HTML as
it were XML.
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
--
Jesse Houwing
jesse.houwing at sogeti.n