I have been trying to parse a webpage in my own free time and I have
come to yet another regex I can't quiet seem to get. I wanted to get
the data inside the <table> tags, however the html and other daata
inside it span multiple lines and nothing I use seems to work.
My first attempt was: <table .*>(?<Info>.*?)</table>
This worked on <table><td>this is some random sentence</td></table>
but not on:
<table width="100%" border="0" cellspacing="0" cellpadding="0"
class="niceTableBorder">
test
</table>
I tried playing with the whitespace \s escape but nothing seemed to
work.
Is there a highly recommended book for c# regex's that I can pick up
to learn this instead of relying on the usenet group here?
Any help appreciated.
-Sean
Martin Honnen - 17 Mar 2008 17:34 GMT
> I have been trying to parse a webpage in my own free time and I have
> come to yet another regex I can't quiet seem to get.
Why don't you use an HTML parser like the SgmlReader
<URL:http://wiki.opengarden.org/Community/SgmlReader_1.7.2> or the HTML
agility pack <URL:http://www.codeplex.com/htmlagilitypack>? Forget about
regular expressions to parse the markup.

Signature
Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Sean - 17 Mar 2008 18:39 GMT
I was trying to stay away from 3rd party items. It's as much of a
learning experience as a fun hobby. I figured for something as trivial
as this I wouldn't need to use all the extra functionallity. I
understand if I was doing something more indepth I wouldn't want to re-
invent the wheel, but I don't think this is that deep.
> > I have been trying to parse a webpage in my own free time and I have
> > come to yet another regex I can't quiet seem to get.
[quoted text clipped - 8 lines]
> Martin Honnen --- MVP XML
> http://JavaScript.FAQTs.com/
Jarlaxle - 17 Mar 2008 19:04 GMT
i would use the xmldocument class or the xmltextreader class. they are
simple to use and built-in to .net.
> I have been trying to parse a webpage in my own free time and I have
> come to yet another regex I can't quiet seem to get. I wanted to get
[quoted text clipped - 21 lines]
>
> -Sean
Ben Voigt [C++ MVP] - 17 Mar 2008 19:17 GMT
> I have been trying to parse a webpage in my own free time and I have
> come to yet another regex I can't quiet seem to get. I wanted to get
[quoted text clipped - 19 lines]
>
> Any help appreciated.
To solve your immediate problem, use string.Replace("\r", " ").Replace("\n",
" ")
However you'll still be in trouble with multiple tables, the closing tag may
match the wrong opening tag. (Whether nested tables or siblings gives you
trouble will depend on whether you are using greedy matching, neither
minimal nor maximal will be correct in every case)
I think it's proven that you can't match nested delimiters in the general
case using pure regexs.
If you want something more lightweight than a full parser, you could try to
regex match individual tags and use a stack. Note that angle brackets
inside quoted strings could still cause you some grief.
> -Sean
Sean - 17 Mar 2008 19:20 GMT
Thanks Ben.
Yeah I think if I take out all of the whitespace as you suggest it'll
create less problems.
I also see what you mean, grabs the tags and implement a stack object
ot push and pop all of the opening and closing tag elements. I think
that'll be the best approach I can find.
> > I have been trying to parse a webpage in my own free time and I have
> > come to yet another regex I can't quiet seem to get. I wanted to get
[quoted text clipped - 40 lines]
>
> - Show quoted text -
Roger Frost - 17 Mar 2008 20:09 GMT
I too am interested in a really good book about Regular Expressions
(specific to C# or .NET would be excellent).
In the mean time Sean, I have been referencing
http://www.regular-expressions.info/ but since this site has the top two
Google matches, you are probably already aware of it. :)
In your example below, have you tried passing RegexOptions.Singleline? It
causes "." to match all characters including "\n". I think it is required
in order to span multiple lines unless you match the newline character
explicitly in your Regex statement.
Hope this helps. Regular expressions are the bane of my existence.

Signature
Roger Frost
"Logic Is Syntax Independent"
> I have been trying to parse a webpage in my own free time and I have
> come to yet another regex I can't quiet seem to get. I wanted to get
[quoted text clipped - 21 lines]
>
> -Sean
Moe Sisko - 18 Mar 2008 00:10 GMT
> In the mean time Sean, I have been referencing
> http://www.regular-expressions.info/ but since this site has the top two
> Google matches, you are probably already aware of it. :)
To add to Roger's post :
That helpful site also has a sample demo app (written in C#) which you can
download from : http://www.regular-expressions.info/dotnet.html
I find experimenting with regexs using the demo app really helpful.
Jesse Houwing - 18 Mar 2008 02:33 GMT
Hello Roger,
I can really reccommend Regular Expressions with .NET by Dan Appleman (http://www.amazon.com/Regular-Expressions-NET-Dan-Appleman/dp/B0000632ZU).
Or Mastering Regular Expressions by Jeffrey Friedl (http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124
/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1205803889&sr=1-1).
The first ons is a real .NET reference with C# code examples and it explains
specific things about the .NET syntax and specifics of Regular Expressions.
The second is the overall Bible on regular expressions. It covers the different
ways to implement a Regex Engine and from there builds on. It's a great read
if you want a more scientific backgroudn on the inner workings of a regex
engine and if you want to learn about and spot performance issues and other
harder parts in the Regex language.
Jesse
> I too am interested in a really good book about Regular Expressions
> (specific to C# or .NET would be excellent).
[quoted text clipped - 36 lines]
>>
>> -Sean
--
Jesse Houwing
jesse.houwing at sogeti.nl
Roger Frost - 18 Mar 2008 11:11 GMT
> Hello Roger,
>
[quoted text clipped - 12 lines]
>
> Jesse
I will look into Dan Appleman's book for sure.
Once I understand how to use regular expressions correctly maybe I can get
ambitious. :)
Thanks a bunch Jesse!

Signature
Roger Frost
"Logic Is Syntax Independent"
Sean - 18 Mar 2008 12:46 GMT
Thanks Jesse for the book references, I'm checking them out now!
On Mar 17, 9:33 pm, Jesse Houwing <jesse.houw...@newsgroup.nospam>
wrote:
> Hello Roger,
>
[quoted text clipped - 57 lines]
>
> - Show quoted text -
Jesse Houwing - 18 Mar 2008 02:13 GMT
Hello Sean,
> I have been trying to parse a webpage in my own free time and I have
> come to yet another regex I can't quiet seem to get. I wanted to get
[quoted text clipped - 20 lines]
>
> -Sean
As someone else already pointed out, '.' only matches everything, but the
newline character. There is a special option to change this behaviour, but
it is rarely needed. It would in this case probably result in more trouble
than it's worth.
As someone else already suggested, an sgml or html reader is probably your
best option. I'd try out the HtmlAgilityPack out on Codeplex.com, but this
can also be solved with regex:
<table[^>]*>(?<info>((?!</table).)*)
will work as long as you activate RegexOptions.SingleLine
<table[^>]*>(?<info>((?!</table)[\s\S])*)
will work even without specifying RegexOptions.SingleLine
The biggest problem with singleline on is that you create a great chance
that a '.*' somewhere will consume the whole contents of teh file and start
backtracking from there. Just like the .* in your table statement. These
are real performance killers.
--
Jesse Houwing
jesse.houwing at sogeti.nl