Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / New Users / July 2007

Tip: Looking for answers? Try searching our database.

Regex search: advanced search range settings?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
BeanDog - 21 Jul 2007 22:20 GMT
I'm using .NET 2.0.

I need (for performance reasons) to restrict Regex searches to a certain
portion of a large string.  The Regex.Match functions allow me to input the
beginning and ending position of the search.  However, what I need is to find
whether there is a Regex match that begins no later than a certain character
position.

For a trivial example, consider the string:
abcde

My regular expression is "cd", and my search range is characters 0-2.  The
Regex.Match functions will fail on this search ("cd" is not in "abc"), but I
need it to find any matches that *begin* within the range, and "cd" does
begin on or before character 2.

I can't simply lengthen the allowed range (in this case searching 0-3
instead of 0-2), since my actual regular expressions match strings of
arbitrary length.

Any suggestions?
Peter Duniho - 21 Jul 2007 22:44 GMT
> [...]
> I can't simply lengthen the allowed range (in this case searching 0-3
> instead of 0-2), since my actual regular expressions match strings of
> arbitrary length.

I don't understand this statement.  Not only can you lengthen the allowed  
range, you must.  I don't see any way for Regex to find characters that  
you hide from it.

Would this work?

    string strExpression;
    string strSearch;
    int ichStart, cchLength;

    // initialize above variables

    Regex regex = new Regex(strExpression);

    return Regex.Match(strSearch, ichStart,
        Math.Min(strSearch.Length - ichStart,
        cchLength + strExpression.Length - 1));

Essentially, extend the search length by the number of characters in your  
expression, but then constrain it to ensure that the actual length passed  
to Match() doesn't exceed the length of the string to be searched.

In your example, the variables are:

    strExpression: "cd"
    strSearch: "abcde"
    ichStart: 0
    cchLength: 3

This results in a call to Regex.Match("abcde", 0, 4), which will find the  
string you're looking for.

Pete
BeanDog - 22 Jul 2007 14:40 GMT
OK, my example was too trivial to illustrate my point.  Consider the
following string:

<html><head></head><script type="text/javascript"></script>(2MB of HTML text
here)</html>

My regular expression might be something like this:
<\s*script\s*(type=['"]text/javascript['"])?\s*>

For performance reasons (this and hundreds of similar regexes need to be run
in a few milliseconds), I can't search all 2MB of text for this regular
expression.  Based on other information available to me from my algorithm, I
am completely uninterested in script tags that *begin* after character 20.  
But I can't just restrict my search to characters 0-20, since the Regex class
only matches strings that lie completely within the given range.

However, because of my strict performance requirements, I can't lengthen the
Regex's search domain to the entire 2MB string.  Since my regular expression
could match a string that is 10 characters long or 1000 characters long or
100,000 characters long, it's impossible for me to determine the amount to
lengthen the Regex's search range.

This was not an issue when I was using boost::regex, as that library allows
you to search for matches that extend past the end of a given range.  I've
ported most of my code to C#, and I had to remove this very important
optimization due to what I see as a limitation in .NET's Regex class.

So my question is, how can I instruct the Regex class to search within a
given range, but allow the match to extend beyond the end of the given range
if necessary?

Or do I get to write .NET bindings for boost::regex? :-p

> > [...]
> > I can't simply lengthen the allowed range (in this case searching 0-3
[quoted text clipped - 34 lines]
>
> Pete
Kevin Spencer - 23 Jul 2007 13:02 GMT
What you need to do is identify the characters that will form a sequence
that uniquely identifies the beginning and end of the pattern you're looking
for. The you can use String.IndexOf to find whether the "script" begins
before character 20. If it does, use String.IndexOf to find the point where
the sequence ends, and use your Regular Expression on the substring
identified.

Another option, depending upon your actual requirements, would be to use
unsafe C code to create a pointer to the beginning of the string and rather
than using a managed Regular Expression, simply iterate the characters in
the string. Unsafe pointers in managed code are still a lot faster than
anything you can do with managed code alone.

Signature

HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

> OK, my example was too trivial to illustrate my point.  Consider the
> following string:
[quoted text clipped - 75 lines]
>>
>> Pete

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.