Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / C# / September 2007

Tip: Looking for answers? Try searching our database.

Regular Expression Help

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
JP - 17 Sep 2007 20:50 GMT
I am creating a screen scraping app that will extract data from a website.  
The screen scraping is pretty straightforward using .NET 2.0, but stripping
out all extraneous characters is proving to be more difficult.   I am
basically trying to extract the team, quarter, score for the quarter, and
score for the entire game from this html.   (This html is a subset of the
entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td width="10%"
align="center" >0</td><td width="20%" align="center" >10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapolis</A></td><td
width="10%" align="center" >7</td><td width="10%" align="center" >3</td><td
width="10%" align="center" >14</td><td width="10%" align="center" >17</td><td
width="20%" align="center" >41</td></tr></table>

In essance I want to be able to put the names and scores into an array so I
can add to a database.   From what I read regular expressions should be able
to do this but I am a complete beginner using regex.   Could someone assist
in getting me started?   Many thanks.
Arnshea - 17 Sep 2007 21:34 GMT
> I am creating a screen scraping app that will extract data from a website.  
> The screen scraping is pretty straightforward using .NET 2.0, but stripping
[quoted text clipped - 21 lines]
> to do this but I am a complete beginner using regex.   Could someone assist
> in getting me started?   Many thanks.

One way to do this is to use the pattern:

>([\w\s]+)<

then ignore whitespace only matches.  So assuming you've got all of
your input on a single line, the code below should print out what
you're looking for.  Depending on how you've got your input (as one
big multi-line string or as multiple strings) you may need to use the
RegexOptions.Multiline flag in the regex constructor.

        static void Main(string[] args)
        {
            string pat = @">([\w\s]+)<";
            string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td> <td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%"" align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center"" >0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>?</td><td width=""10%"" align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";

            Regex r = new Regex(pat);

            foreach (Match m in r.Matches(html))
            {
                if ( m.Groups[1].Value.Trim() == "" )
                    // ignore these
                    continue;
                else
                    // do whatever it is you want to do here
                    Console.WriteLine(m.Groups[1].Value);
            }
        }
JP - 17 Sep 2007 22:04 GMT
Thanks!  I think that almost did it.   I ran into a problem testing "St.
Louis" though.  Possibly due to the "."?   When writing the results to the
debug window, St. Louis was omitted.

> > I am creating a screen scraping app that will extract data from a website..  
> > The screen scraping is pretty straightforward using .NET 2.0, but stripping
[quoted text clipped - 63 lines]
>             }
>         }
Arnshea - 17 Sep 2007 22:21 GMT
> Thanks!  I think that almost did it.   I ran into a problem testing "St.
> Louis" though.  Possibly due to the "."?   When writing the results to the
[quoted text clipped - 69 lines]
>
> - Show quoted text -

Yeah, the pattern will only match letters, numbers, _, digits and
whitespace.  Now that I think about it though, try changing the
pattern to:

>([^<]+)<

that should match any innermost text of the table cells.
Jesse Houwing - 17 Sep 2007 22:25 GMT
Hello JP,

> Thanks!  I think that almost did it.   I ran into a problem testing
> "St. Louis" though.  Possibly due to the "."?   When writing the
> results to the debug window, St. Louis was omitted.

This is caused by the [\w\s]+. If you replace it with [^>]+ it should work.

Though keep in mind that this isn't very strict and will easily break if
the page layout changes... (as in it will return results that aren't possibly
what you expected instead of return no result at all).

Jesse

>>> I am creating a screen scraping app that will extract data from a
>>> website..  The screen scraping is pretty straightforward using .NET
[quoted text clipped - 73 lines]
>> }
>> }
--
Jesse Houwing
jesse.houwing at sogeti.nl
Jesse Houwing - 17 Sep 2007 22:22 GMT
Hello JP,

> I am creating a screen scraping app that will extract data from a
> website.  The screen scraping is pretty straightforward using .NET
[quoted text clipped - 23 lines]
> should be able to do this but I am a complete beginner using regex.
> Could someone assist in getting me started?   Many thanks.

I posted a regex a while back that did almost this.

<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^>]>

this will extract all the rows with the info you need.

It will store the respective values in a named group, so they're easily extracted:

foreach (Match m in regex.Matches(input))
{
   string team = m.Groups["team"].Value;
   string quarter = m.Groups["quarter"].Value;
   string score = m.Groups["score"].Value;
}

You can even chain this expression so you can get all the results in one
pass:

(<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^>]>\s*)+

Match m = regex.Match(input);
for (int i = 0; i < m.Groups["team"].Captures.Length; i++)
{
   string team = m.Groups["team"].Captures[i].Value;
   string quarter = m.Groups["quarter"].Captures[i].Value;
   string score = m.Groups["score"].Captures[i].Value;
}

As an alternative you might want to have a look at the HTML Agility pack.
It allows you to do XPath queries over HTML. A very powerful way to extract
data from HTML files.

http://www.codeplex.com/htmlagilitypack

--
Jesse Houwing
jesse.houwing at sogeti.nl
JP - 18 Sep 2007 02:42 GMT
Jesse,

That's exactly what I am trying to do with the data within the HTML.   The
expressions and code you listed don't apply to the HTML I posted correct?  

I am not sure how I could group the teams and quarters if they are not
labeled.  I probably don't understand how regex works...

> Hello JP,
>
[quoted text clipped - 63 lines]
> Jesse Houwing
> jesse.houwing at sogeti.nl
Jesse Houwing - 18 Sep 2007 12:04 GMT
Hello JP,

> Jesse,
>
> That's exactly what I am trying to do with the data within the HTML.
> The expressions and code you listed don't apply to the HTML I posted
> correct?

They do. by just saying <td[^>]+> you say find a <td tag. Then find everything
up to the closing > and then match the closing >. It ignores all the border,
widtch, height and other stuff in there.

For now I ignored the <a href> around the team name, but other than that,
the expression should work.

> I am not sure how I could group the teams and quarters if they are not
> labeled.  I probably don't understand how regex works...

Well if they're always in the xth cell as in you example, you can use their
position (as I've done in the expression). The (?<name>...) construct in
the expression then labels them.

Jesse

>> Hello JP,
>>
[quoted text clipped - 71 lines]
>> Jesse Houwing
>> jesse.houwing at sogeti.nl
--
Jesse Houwing
jesse.houwing at sogeti.nl
JP - 18 Sep 2007 14:54 GMT
Hi Jesse,

Thanks again for the assistance.

I have taken what you have posted and put it into a console app but it
doesn't seem to pick up anything using the expression.   If I uncomment the
shortpat and the first foreach, I get data.   Am I missing something?

       static void Main(string[] args)
       {
           //string shortpat = @">([\w\s]+)<";
           string pat =
@"<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^>]>";

           string html = @"<table border=""0"" width=""100%""><tr><td
           width=""40%"">Team</td><td width=""10%"" align=""center"">1</td>
<td
           width=""10%"" align=""center"">2</td><td width=""10%""
           align=""center"">3</td> <td width=""10%""
align=""center"">4</td><td
           width=""20%"" align=""center"">Score</td></tr><tr><td
width=""40%""><A
           href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
           Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
           width=""10%"" align=""center"" >10</td><td width=""10%""
           align=""center"" >0</td><td width=""10%"" align=""center""
>0</td><td
           width=""20%"" align=""center"" >10</td></tr><tr><td
width=""40%""><A
           href=""/default.asp?c=sportsnetwork&page=nfl/teams/
           071.htm"">Indianapolis</A>-</td><td width=""10%""
align=""center"" >7</
           td><td width=""10%"" align=""center"" >3</td><td width=""10%""
           align=""center"" >14</td><td width=""10%"" align=""center"" >17</
           td><td width=""20%"" align=""center"" >41</td></tr></table>";

           Regex r = new Regex(pat);

           //foreach (Match m in r.Matches(html))
           //{
           //    if (m.Groups[1].Value.Trim() == "")
           //        // ignore these
           //        continue;
           //    else
           //        Console.WriteLine(m.Groups[1].Value);
           //        Console.ReadLine();
           //}

           foreach (Match m in r.Matches(html))
           {
               string team = m.Groups["team"].Value;
               string quarter = m.Groups["quarter"].Value;
               string score = m.Groups["score"].Value;

               Console.WriteLine(team);
               Console.WriteLine(quarter);
               Console.WriteLine(score);
           }

           Console.ReadLine();

       }

> Hello JP,
>
[quoted text clipped - 98 lines]
> Jesse Houwing
> jesse.houwing at sogeti.nl
Arnshea - 18 Sep 2007 15:04 GMT
> Hi Jesse,
>
[quoted text clipped - 163 lines]
>
> - Show quoted text -

To capture st. louis try replacing shortpat with:
string shortpat = @">([^<]+)<";
Jesse Houwing - 18 Sep 2007 15:31 GMT
Hello JP,

I updated the expression and tested it. I'm not sure what to capture where.
Maybe if you can describe the table row to me I could do a better expression.

Right now it captures as follows: team, quarter and the score (4x).

<tr[^>]*>\s*<td[^>]*>\s*<a[^>]*>(?<team>((?!</a).)*)</a[^>]*>\s*</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>(\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>)+\s*</tr[^>]*>

it results in a match that is built up as follows:

Match
+- Groups
   +- Team.Value "New Orleans"
   +- Quarter.Value "0"
   +- Score.Captures[0].Value = 10
   +- Score.Captures[1].Value = 0
   +- Score.Captures[2].Value = 0
   +- Score.Captures[3].Value = 10

My question now is... did I interpret it right?

Is the row built up of a team name, quarter number and 4 scores?

To explain the expression:

<tr[^>*>   Find a start of a row
\s* allow whitespace between <tr> tag and the first <td>
<td[^>]*> Find the first table cell.
\s* allow whitespace between <td> and <a>
<a[^>]*> Find the opening a href tag
(?team((?!</a).)*) capture everything you find from there to the </a tag
in a group named team.
</a[^>]*> Find the closing a tag
\s* allow whitespace between </a> and </td>
</td[^>]*> Find the closing td tag
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<quarter>((?!</td).)*) capture everything from there to the </ta tag into
a group named quarter
</td[^>]*> Find the closing td tag
( start repeating group
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<score>((?!</td).)*) capture everything from there to the </ta tag into
a group named score
</td[^>]*> Find the closing td tag
)+ end repeating group. If the named group in this repeating group captured
multiple times. The value scan be found in the Group.Captures collection.
\s*
</tr[^>]*> Finally make sure we've captured a complete row by finding it's
end tag.

Jesse

> Hi Jesse,
>
[quoted text clipped - 34 lines]
> 071.htm"">Indianapolis</A>-</td><td width=""10%""
> align=""center"" >7</
td>> <td width=""10%"" align=""center"" >3</td><td width=""10%""
td>>
> align=""center"" >14</td><td width=""10%""
> align=""center"" >17</
td>> <td width=""20%"" align=""center"" >41</td></tr></table>";
td>>
> Regex r = new Regex(pat);
>
[quoted text clipped - 128 lines]
>> Jesse Houwing
>> jesse.houwing at sogeti.nl
--
Jesse Houwing
jesse.houwing at sogeti.nl
JP - 18 Sep 2007 22:28 GMT
Jesse,

Cut and paste this text into notepad and then save as an html file.  You
will see how the data relates.   The teams run vertically along with the
quarters and final score.  I guess that is what I am having a hard time
understanding.   How you can group the different data points with the way
this table is structured.

I did write a little routine to pull out different data points based on
looping through the data but it is a bit kludgy.  Was hoping to do something
more robust like your solution.

<html>
<table border="0" width="100%"><tr><td width="40%">Team</td><td width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td width="10%"
align="center" >0</td><td width="20%" align="center" >10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapolis</A></td><td
width="10%" align="center" >7</td><td width="10%" align="center" >3</td><td
width="10%" align="center" >14</td><td width="10%" align="center" >17</td><td
width="20%" align="center" >41</td></tr></table>
</html>

> Hello JP,
>
[quoted text clipped - 230 lines]
> Jesse Houwing
> jesse.houwing at sogeti.nl
Jesse Houwing - 19 Sep 2007 23:39 GMT
Hello JP,

> Jesse,
>
[quoted text clipped - 7 lines]
> on looping through the data but it is a bit kludgy.  Was hoping to do
> something more robust like your solution.

Given that the last column is the final score (excuse my lack of insight
in sports with more than 2 halves, I'm a european soccer watcher), it should
be quite easy to alter the expression I gave before:

<tr[^>]*>\s*<td[^>]*>\s*<a[^>]*>(?<team>((?!</a).)*)</a[^>]*>\s*</td[^>]*>\s*(\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>){4}\s*<td[^>]*>(?<finalscore>((?!</td).)*)</td[^>]*></tr[^>]*>

It uses the position in teh table relative to the row as the way to find
out which is what.

1st td, team
2nd td .. 5th td, score for each quarter (score x4)
6th column final score

Now if you match against it, make sure you specify the option RegexOptions.SingleLine,
so that the . will match a newline. I migth have forgotten to mention that
before.

The code should work out like this:

private static Regex scoreRegex = new Regex ("..", RegexOptions.SingleLine
| RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);

public void ExtractData(string htmlInput)
{
  Match m = scoreRegex.Match(htmlInput);
 
  while (m != null && m.Success)
  {
      string team = m.Groups["team"].Value;
      string quarter1 = m.Groups["score"].Captures[0].Value;
      string quarter2 = m.Groups["score"].Captures[1].Value;
      string quarter3 = m.Groups["score"].Captures[2].Value;
      string quarter4 = m.Groups["score"].Captures[3].Value;
      string final = m.Groups["finalscore"].Value;

      //  Do your thing with the data before moving on to the next match

      m = m.NextMatch();
  }
}

If you look at the differences between this regex and the last. The most
important difference is that the score part will only be repeated 4 times.
{4}.

I hope this works out.

Jesse

> <html>
> <table border="0" width="100%"><tr><td width="40%">Team</td><td
[quoted text clipped - 272 lines]
>> Jesse Houwing
>> jesse.houwing at sogeti.nl
--
Jesse Houwing
jesse.houwing at sogeti.nl
JP - 20 Sep 2007 22:14 GMT
Thanks Jesse.

> Hello JP,
>
[quoted text clipped - 291 lines]
> >>>>>>
> >>>>>>> \s*</tr[^>]>
Arnshea - 18 Sep 2007 18:07 GMT
> Hi Jesse,
>
[quoted text clipped - 163 lines]
>
> - Show quoted text -

Ok, this should (will?) grab everything you're looking for and print
it out in order:

 static void Main(string[] args)
 {
   string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td> <td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%"" align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center"" >0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>?</td><td width=""10%"" align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";
   string contents = null;
   string tdPat = "<td[^>]*>(.+?)</td>"; // grabs everything between
<td>...</td>
   string innerTextPat = ">([^>]+)<"; // grabs innermost non-html

   Regex tdRegex = new Regex(tdPat);
   Regex innerTextRegex = new Regex(innerTextPat);

   foreach (Match m in tdRegex.Matches(html))
   {
     contents = m.Groups[1].Value;

     Console.WriteLine(contents); // will include <a href="...">TEAM
NAME</a>

     // the following will print out the team name w/o the hyperlink
     foreach (Match m2 in innerTextRegex.Matches(contents))
       Console.WriteLine(m2.Groups[1].Value);
   }
 }
alex_f_il@hotmail.com - 30 Sep 2007 18:20 GMT
> I am creating a screenscrapingapp that will extract data from a website.  
> The screenscrapingis pretty straightforward using .NET 2.0, but stripping
[quoted text clipped - 21 lines]
> to do this but I am a complete beginner using regex.   Could someone assist
> in getting me started?   Many thanks.

You can also try SWExplorerAutomation (SWEA) http://webius.net. SWEA
Visual Data Extractors (XPathDataExtractor and TableDataExtractor )
save time on development Web Scraping solutions.

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.