Hi Jesse,
Thanks again for the assistance.
I have taken what you have posted and put it into a console app but it
doesn't seem to pick up anything using the expression. If I uncomment the
shortpat and the first foreach, I get data. Am I missing something?
static void Main(string[] args)
{
//string shortpat = @">([\w\s]+)<";
string pat =
@"<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^>]>";
string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td>
<td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%""
align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center""
>0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>-</td><td width=""10%""
align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";
Regex r = new Regex(pat);
//foreach (Match m in r.Matches(html))
//{
// if (m.Groups[1].Value.Trim() == "")
// // ignore these
// continue;
// else
// Console.WriteLine(m.Groups[1].Value);
// Console.ReadLine();
//}
foreach (Match m in r.Matches(html))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
Console.WriteLine(team);
Console.WriteLine(quarter);
Console.WriteLine(score);
}
Console.ReadLine();
}
> Hello JP,
>
[quoted text clipped - 98 lines]
> Jesse Houwing
> jesse.houwing at sogeti.nl
Arnshea - 18 Sep 2007 15:04 GMT
> Hi Jesse,
>
[quoted text clipped - 163 lines]
>
> - Show quoted text -
To capture st. louis try replacing shortpat with:
string shortpat = @">([^<]+)<";
Jesse Houwing - 18 Sep 2007 15:31 GMT
Hello JP,
I updated the expression and tested it. I'm not sure what to capture where.
Maybe if you can describe the table row to me I could do a better expression.
Right now it captures as follows: team, quarter and the score (4x).
<tr[^>]*>\s*<td[^>]*>\s*<a[^>]*>(?<team>((?!</a).)*)</a[^>]*>\s*</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>(\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>)+\s*</tr[^>]*>
it results in a match that is built up as follows:
Match
+- Groups
+- Team.Value "New Orleans"
+- Quarter.Value "0"
+- Score.Captures[0].Value = 10
+- Score.Captures[1].Value = 0
+- Score.Captures[2].Value = 0
+- Score.Captures[3].Value = 10
My question now is... did I interpret it right?
Is the row built up of a team name, quarter number and 4 scores?
To explain the expression:
<tr[^>*> Find a start of a row
\s* allow whitespace between <tr> tag and the first <td>
<td[^>]*> Find the first table cell.
\s* allow whitespace between <td> and <a>
<a[^>]*> Find the opening a href tag
(?team((?!</a).)*) capture everything you find from there to the </a tag
in a group named team.
</a[^>]*> Find the closing a tag
\s* allow whitespace between </a> and </td>
</td[^>]*> Find the closing td tag
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<quarter>((?!</td).)*) capture everything from there to the </ta tag into
a group named quarter
</td[^>]*> Find the closing td tag
( start repeating group
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<score>((?!</td).)*) capture everything from there to the </ta tag into
a group named score
</td[^>]*> Find the closing td tag
)+ end repeating group. If the named group in this repeating group captured
multiple times. The value scan be found in the Group.Captures collection.
\s*
</tr[^>]*> Finally make sure we've captured a complete row by finding it's
end tag.
Jesse
> Hi Jesse,
>
[quoted text clipped - 34 lines]
> 071.htm"">Indianapolis</A>-</td><td width=""10%""
> align=""center"" >7</
td>> <td width=""10%"" align=""center"" >3</td><td width=""10%""
td>>
> align=""center"" >14</td><td width=""10%""
> align=""center"" >17</
td>> <td width=""20%"" align=""center"" >41</td></tr></table>";
td>>
> Regex r = new Regex(pat);
>
[quoted text clipped - 128 lines]
>> Jesse Houwing
>> jesse.houwing at sogeti.nl
--
Jesse Houwing
jesse.houwing at sogeti.nl
JP - 18 Sep 2007 22:28 GMT
Jesse,
Cut and paste this text into notepad and then save as an html file. You
will see how the data relates. The teams run vertically along with the
quarters and final score. I guess that is what I am having a hard time
understanding. How you can group the different data points with the way
this table is structured.
I did write a little routine to pull out different data points based on
looping through the data but it is a bit kludgy. Was hoping to do something
more robust like your solution.
<html>
<table border="0" width="100%"><tr><td width="40%">Team</td><td width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td width="10%"
align="center" >0</td><td width="20%" align="center" >10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapolis</A></td><td
width="10%" align="center" >7</td><td width="10%" align="center" >3</td><td
width="10%" align="center" >14</td><td width="10%" align="center" >17</td><td
width="20%" align="center" >41</td></tr></table>
</html>
> Hello JP,
>
[quoted text clipped - 230 lines]
> Jesse Houwing
> jesse.houwing at sogeti.nl
Jesse Houwing - 19 Sep 2007 23:39 GMT
Hello JP,
> Jesse,
>
[quoted text clipped - 7 lines]
> on looping through the data but it is a bit kludgy. Was hoping to do
> something more robust like your solution.
Given that the last column is the final score (excuse my lack of insight
in sports with more than 2 halves, I'm a european soccer watcher), it should
be quite easy to alter the expression I gave before:
<tr[^>]*>\s*<td[^>]*>\s*<a[^>]*>(?<team>((?!</a).)*)</a[^>]*>\s*</td[^>]*>\s*(\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>){4}\s*<td[^>]*>(?<finalscore>((?!</td).)*)</td[^>]*></tr[^>]*>
It uses the position in teh table relative to the row as the way to find
out which is what.
1st td, team
2nd td .. 5th td, score for each quarter (score x4)
6th column final score
Now if you match against it, make sure you specify the option RegexOptions.SingleLine,
so that the . will match a newline. I migth have forgotten to mention that
before.
The code should work out like this:
private static Regex scoreRegex = new Regex ("..", RegexOptions.SingleLine
| RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
public void ExtractData(string htmlInput)
{
Match m = scoreRegex.Match(htmlInput);
while (m != null && m.Success)
{
string team = m.Groups["team"].Value;
string quarter1 = m.Groups["score"].Captures[0].Value;
string quarter2 = m.Groups["score"].Captures[1].Value;
string quarter3 = m.Groups["score"].Captures[2].Value;
string quarter4 = m.Groups["score"].Captures[3].Value;
string final = m.Groups["finalscore"].Value;
// Do your thing with the data before moving on to the next match
m = m.NextMatch();
}
}
If you look at the differences between this regex and the last. The most
important difference is that the score part will only be repeated 4 times.
{4}.
I hope this works out.
Jesse
> <html>
> <table border="0" width="100%"><tr><td width="40%">Team</td><td
[quoted text clipped - 272 lines]
>> Jesse Houwing
>> jesse.houwing at sogeti.nl
--
Jesse Houwing
jesse.houwing at sogeti.nl
JP - 20 Sep 2007 22:14 GMT
Thanks Jesse.
> Hello JP,
>
[quoted text clipped - 291 lines]
> >>>>>>
> >>>>>>> \s*</tr[^>]>
Arnshea - 18 Sep 2007 18:07 GMT
> Hi Jesse,
>
[quoted text clipped - 163 lines]
>
> - Show quoted text -
Ok, this should (will?) grab everything you're looking for and print
it out in order:
static void Main(string[] args)
{
string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td> <td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%"" align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center"" >0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>?</td><td width=""10%"" align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";
string contents = null;
string tdPat = "<td[^>]*>(.+?)</td>"; // grabs everything between
<td>...</td>
string innerTextPat = ">([^>]+)<"; // grabs innermost non-html
Regex tdRegex = new Regex(tdPat);
Regex innerTextRegex = new Regex(innerTextPat);
foreach (Match m in tdRegex.Matches(html))
{
contents = m.Groups[1].Value;
Console.WriteLine(contents); // will include <a href="...">TEAM
NAME</a>
// the following will print out the team name w/o the hyperlink
foreach (Match m2 in innerTextRegex.Matches(contents))
Console.WriteLine(m2.Groups[1].Value);
}
}