.NET Forum / .NET Framework / New Users / June 2006
Efficient regular expression pattern ?
|
|
Thread rating:  |
Steve B. - 14 Jun 2006 13:24 GMT Hi,
I'm building an application that analyse a flow of url in order to detect some pages. I've a very huge list of regular expressions (up to several thousands) that I have to check on all urls.
Each regular expression will be evaluated agains all urls.
How can I write the code in order to be the most efficient ?
My fisrt idea is to create a List<Regex> that I populate once, and each time an url is incoming, I do a foreach(Regex r in Regexs).
But I'm afraid this process is quite slow since urls can input up to 10/seconds.
I'd apreciate any comments or suggestions.
Thanks, Steve
Barry Kelly - 14 Jun 2006 14:18 GMT > I'm building an application that analyse a flow of url in order to detect > some pages. > I've a very huge list of regular expressions (up to several thousands) that > I have to check on all urls. Can you show us some of your regular expressions? Are many of them simple text matches, or do they all involve wildcards? Are they using () which end up capturing when they don't need to (consider the ExplicitCapture option)? Do you use the Compiled option?
It can be quite easy to write regular expressions with exponential matching behaviour. For example, an RE which is supposed to match the whole string and looks like ".*/foo/" will be much slower than "^.*/foo/". Also, the current .NET regular expression matcher doesn't perform some optimizations that many other matchers do.
> Each regular expression will be evaluated agains all urls. > > How can I write the code in order to be the most efficient ? Profile it to find out what bits are slow. You might consider trying each regular expression on a test set of URL data, and dumping out which ones are matching slowest. That'll tell you which ones you need to optimize.
> But I'm afraid this process is quite slow since urls can input up to > 10/seconds. How slow is it currently?
I imagine you should be able to check many tens of thousands of regular expressions per second.
-- Barry
 Signature http://barrkel.blogspot.com/
Steve B. - 14 Jun 2006 14:31 GMT The goal is to detect some pages loaded on specific "malicious" pages. According that, most of the regexp will looks like this :
^http://www.spywarerulez.com/
This is supposed to detect any ressource for this host.
However most of this sh.tting companies use several subdomain names, and regexp also look like this :
^http://[A-Za-z0-9_\.].spywarerulez.com/
I'm quite new in using Regex, so I'm not aware of the behaviour of all options.
For the moment I've not tested the actual speed of this process, since I do not have right now all urls. The app is self learning and the url will grow when it will go into production and when the administrators will add new addresses.
Thanks, Steve
>> I'm building an application that analyse a flow of url in order to detect >> some pages. [quoted text clipped - 31 lines] > > -- Barry Barry Kelly - 14 Jun 2006 15:36 GMT > For the moment I've not tested the actual speed of this process, since I do > not have right now all urls. You should try some tests with urls and regexs you make up yourself, to find out (in ballpark figures) what you can expect.
> The app is self learning and the url will grow > when it will go into production and when the administrators will add new > addresses. You should probably include some self-monitoring too then, and periodically log a dump of the "top ten" slowest REs, if the scanning process does in fact turn out to be a bottleneck.
-- Barry
 Signature http://barrkel.blogspot.com/
Steve B. - 14 Jun 2006 15:41 GMT Thanks for your advises.
I'll take a deeper look and try to measure the executing time for each match test.
Steve
>> For the moment I've not tested the actual speed of this process, since I >> do [quoted text clipped - 12 lines] > > -- Barry Kevin Spencer - 14 Jun 2006 15:39 GMT Hi Steve,
Forgive me if I repeat anything you already know.
A regular expression is an expression of a set of rules that define a "pattern" (or a pattern that defines a set of rules!).
It looks to me like you're trying to create SPAM detection rules, a most admirable quest, and one which I am chipping away at during my own spare time.
SPAM detection rules act as a filter, and if I'm not mistaken, you want to filter out suspected SPAM emails.
For the greatest efficiency, you want to use the filter that removes the most from the list in order to eliminate those from the remaining items that need to be filtered. The fewer the items to filter, the faster the process.
Some filters will remove more than others. For example, let's say that you're filtering on domain names. In your list, you identify the following:
^http://www.spywarerulez.com ^http://a.spywarerulez.com ^http://b.spywarerulez.com ^http://spywarerulez.com ^http://spywarerulez.net ^http://spywarerulez.org
Now, you could filter on all of them individually using several filters, but if, for example, you know that any domain that ends with "spywarerulez." plus a root domain, you could filter them all out with:
\b(?:http://)*\w*\.?spywarerulez.\w+\b
(Note that this assumes that the URL begins and ends at a word boundary. You would probably want to change that part)
So, what I would do is to identify which regular expressions filter the most items out first. Apply these filters first, and then create a succession of more and more specific filters.
One good technique to prevent multiple filters is to use an OR expression, as in:
\b(?:http://)*\w*\.?(spywarerulez|myspyware|spammersdelight|getrichquick|suckeraminute).\w+\b
This would match any domain with any of the domain names that are ORed together.
 Signature HTH,
Kevin Spencer Microsoft MVP Professional Chicken Salad Alchemist
A lifetime is made up of Lots of short moments.
> The goal is to detect some pages loaded on specific "malicious" pages. > According that, most of the regexp will looks like this : [quoted text clipped - 55 lines] >> >> -- Barry Steve B. - 14 Jun 2006 15:51 GMT The suspects url will be stored and be used by multiples systems. The first "client app" will be a proxy system that will kick this http request and replace by a warning message (so even if the user try to display a picutre in outlook, he won't actually send a request to the server and if a "link" is clicked from an email, or IM, or anywhere else, the page won't load if is forbidden).
Email filter is not planned now, but it should be a great idea to capitalize malicious url.
Thanks,
Since
> Hi Steve, > [quoted text clipped - 106 lines] >>> >>> -- Barry Nick Malik [Microsoft] - 15 Jun 2006 09:44 GMT You probably don't need regular expressions to do what you are doing.
Regular expressions are a generic system of pattern matching that uses a fairly efficient matching algorithm to execute a compiled pattern. However, for the fact that it is generic and the fact that patterns are compiled, you pay a price. It is better to use regular expressions when you have a small number of patterns and a single long target string or many target strings. It may not be a good idea to use regular expressions when you have thousands of patterns to match against thousands of target strings. Use regular expressions... just not in this way.
URLS are very mathematical. As such, you can use a single regular expression to extract the domain name from each of the incoming urls. Then you can use simple parsing and matching to determine if your domain name matches. If it does, you can look for subdomains.
So, first create a hit list containing the domain names you are looking for, followed by a list of subdomain rules you want to apply.
step 1: take http://s1234563.fubar.com/cx/url=http://www.ebay.com and apply a regex to get "s1234563" and "fubar.com" step 2: match directly against "fubar.com" if no hit in your hit list, you are done. step 3: if you hit against a domain, then look to the list of subdomains to see if one is specifically targeted. If not, all are 'bad'.
Hope this helps,
 Signature --- Nick Malik [Microsoft] MCSD, CFPS, Certified Scrummaster http://blogs.msdn.com/nickmalik
Disclaimer: Opinions expressed in this forum are my own, and not representative of my employer. I do not answer questions on behalf of my employer. I'm just a programmer helping programmers. --
> Hi, > [quoted text clipped - 17 lines] > Thanks, > Steve Steve B. - 15 Jun 2006 10:11 GMT Your solution seems to be quite effective, but there a scenario where this cannot apply :
The solution will probably works well for domain filtering, but if I want to exlude a specific page or sub pages, it won't works.
Let's imagine for example a web site hosting site : http://www.anygoodprovider.com/asiteofmalicioususer/spywareeldoradodownloads.htm
In this case, the domain is ok, but not the page.
I'm thinkg about a "IUrlFilter" :
interface IUrlFilter { bool TestUrl(string urlToTest); }
This interface will be implemented twice :
class HostFilter : IUrlFilter
class PageFilter : IUrlFilter
I firstly test agains hostname (which will probably be 95% of the tests), and if host is ok, I also test against page (with a very lower number of test to execute) using the regex method.
I suppose this approach would also help for extensibility of the system...
Do you think I'm on the right way ?
Thanks, Steve
> You probably don't need regular expressions to do what you are doing. > [quoted text clipped - 45 lines] >> Thanks, >> Steve Nick Malik [Microsoft] - 21 Jun 2006 15:25 GMT > Your solution seems to be quite effective, but there a scenario where this > cannot apply : [quoted text clipped - 6 lines] > > In this case, the domain is ok, but not the page. yes, but you get a hit on the domain, so you proceed to the page. If you make the domain matching efficient, you make 95% of your matching efficient.
> I'm thinkg about a "IUrlFilter" : > [quoted text clipped - 7 lines] > > class PageFilter : IUrlFilter I disagree. I wouldn't use your mechanism for either method.
> I firstly test agains hostname (which will probably be 95% of the tests), > and if host is ok, I also test against page (with a very lower number of > test to execute) using the regex method. Regex can be used not only to match, but also to parse. Use it to parse, but then match using a data structure.
Use regex to return a list of terms seperated by slashes. Create a tree structure of all of the offending pages by URL. I'm willing to bet that 80% of the hits against an offending page will be consumed by less than 5% of the root elements and less than 30% of the first level elements. In other words, virus sites will have many virus pages, and hosting providers that don't care about virus hosting will have many many virus sites.
allow for aliases by making your tree so that you can have references from multiple roots point to a single tree. This covers the fact that DNS systems routinely allow for nearly infinite aliases.
hostingprovider.com subdomain: abc, jbx, sand, zx* dir: foo dir: bar page: viruspage1.html page: viruspage2.html page: viruspage3.html dir: fudge page: viruspage4.html hostalias.com -- reference to hostingprovider.com 127.198.41.77 -- reference to hostingprovider.com
One regex can parse your entire URL into parts. Domain, subdomain, a list of directories, and a page reference, followed by a list of parameters. So you can match once against every URL, take the parts list and run it through the tree you create. Simple matching will catch every hit. And it scales well.
Let's say that you have 10,000 URLs in your list. Let's say that 80% of them have the same 100 domains. That leaves 20% with a smattering of other domains. I'll swag that down to about 1000 domains. Let's say that you create a tree structure like above, but the root is a sorted list of these 1000 domains. Let's say you use my method and a binary search against the domain name. You can find the domain hit in 10 hits or less... you'll average five hits. Then, if your directory and page lists follow the same pattern, you are likely to find your match or free the URL in less than 10 more matches. That's 20 matches. Add another 10,000 URLs to the list. The number of matches goes up... to 21.
With your method, the first scenario requires 10,000 matches. The second requires 20,000 matches. It scales EXCEEDINGLY badly.
Please, Steve, use a data structure for this mechanism. The fundamental concept is a Suffix tree, which is a specialized form of a Trie. Note that you could use a pure Trie structure, especially a Patricia Trie structure, if you are clever and want REALLY fast performance, but that may require some programming chops. You can find a good writeup on Trie structures on wikipedia. http://en.wikipedia.org/wiki/Trie You could also use a Directed Acyclic Word Graph, which will give you some very fast performance as well. To be honest, I may have described a compact DAWG above.
Some other links: http://www.nist.gov/dads/HTML/suffixtree.html http://www.nist.gov/dads/HTML/directedAcyclicWordGraph.html
 Signature --- Nick Malik [Microsoft] MCSD, CFPS, Certified Scrummaster http://blogs.msdn.com/nickmalik
Disclaimer: Opinions expressed in this forum are my own, and not representative of my employer. I do not answer questions on behalf of my employer. I'm just a programmer helping programmers. --
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|