.NET Forum / Languages / C# / May 2008
Checking if a list of names appears in a body of text.
|
|
Thread rating:  |
Brent - 03 May 2008 01:23 GMT I have a list of company names (say, IBM, Corning, General Motors, and another 5,000 of them).
If I take a body of text, a news article, for instance, and I want to see which company names appear in that text, is there an efficient way to do this?
I thought about looping through the array of names, and doing an IndexOf or Regex match, but this method is slow. Then I thought about an array intersection, but this is problematic for two-word company names (you can't just create the second array based on a split on spaces).
Any hints would be much appreciated!
--Brent
Peter Duniho - 03 May 2008 01:40 GMT > I have a list of company names (say, IBM, Corning, General Motors, and > another 5,000 of them). > > If I take a body of text, a news article, for instance, and I want to > see which company names appear in that text, is there an efficient way > to do this? This comes up here surprisingly often. :)
For the general case, so far I've yet to see someone suggest a better solution that the one I've proposed in the past: using a state graph. This approach assumes that you've got a static list that you can use to initialize your state graph once, and then use the same graph to process multiple input over and over. The cost of creating the graph would be prohibitive if you had to create it for each iteration of input.
In an earlier thread, I posted some code that provided a generic implementation of a state graph. I'm not claiming it's the best implementation, but it does work. :) You can find that message here: http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/0f06 f696d4500b77
VERY IMPORTANT! That code had a performance bug in it that severely limited its usefulness. The original person I'd posted the code for never bothered to comment on it, so I never even noticed until much later, when I recommended the same code to someone else and they complained that it wasn't as fast as I'd said it should be. You can find my follow-up post in which I included the bug-fix for the earlier code here: http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/ac50 505b568a75fd
If you care about performance (and obviously you do :) ), do NOT use the code I originally posted without including the later bug-fix.
You may want to review both threads for comments from other people. A number of other folks contributed to the threads, and they had insightful and useful comments, especially pertaining to special cases where you can get good performance by knowing something particular about the input. The state graph performs well as a general-purpose solution, but sometimes you can get equal or better performance with a different solution that takes advantage of some particular known characteristic of the input.
Pete
Peter Bromberg [C# MVP] - 03 May 2008 03:11 GMT Very interesting concept, Peter. I'll be playing with this over the weekend. Peter
>> I have a list of company names (say, IBM, Corning, General Motors, and >> another 5,000 of them). [quoted text clipped - 37 lines] > > Pete Peter Bromberg [C# MVP] - 12 May 2008 00:28 GMT Tomas Petricek has a very nice article with sample code along these lines here: http://tomasp.net/articles/ahocorasick.aspx Peter
>> I have a list of company names (say, IBM, Corning, General Motors, and >> another 5,000 of them). [quoted text clipped - 37 lines] > > Pete Arne Vajhøj - 03 May 2008 03:18 GMT > I have a list of company names (say, IBM, Corning, General Motors, and > another 5,000 of them). [quoted text clipped - 8 lines] > names (you can't just create the second array based on a split on > spaces). Try something like:
public static string[] Find2(string[] lst, string txt) { HashSet<string> hs = new HashSet<string>(txt.Split(' ', ',', '.')); return Array.FindAll<string>(lst, (string s) => hs.Contains(s)); }
assuming you are on 3.5 !
Arne
colin - 03 May 2008 10:14 GMT have you considered a decision tree built from your list of company names?
Colin =^.^=
>I have a list of company names (say, IBM, Corning, General Motors, and > another 5,000 of them). [quoted text clipped - 12 lines] > > --Brent Mr. Arnold - 03 May 2008 12:31 GMT >I have a list of company names (say, IBM, Corning, General Motors, and > another 5,000 of them). [quoted text clipped - 10 lines] > > Any hints would be much appreciated! Finding the phrase How do we use the MakePattern method to find our phrase? Let's suppose that we aren't interested in where the phrase occurs, or whether it occurs several times, but just whether or not it appears at all. So our approach will be to take the original phrase, turn it into a pattern, match the pattern, and return true if the pattern has been matched:
public Boolean PhraseFound(String argPhrase, String argText)
{
String strPattern = MakePattern(argPhrase);
Match match = Regex.Match(argText, strPattern);
return match.Success;
}
I used the Regex.Match to find the occurrence of a word in a text field or variable. You can also use the features of Regex that will find the positions of the words so that can use something like a RichTextBox and position to the word or words in the textbox and highlight all the words.
rossum - 03 May 2008 14:13 GMT >I have a list of company names (say, IBM, Corning, General Motors, and >another 5,000 of them). [quoted text clipped - 12 lines] > >--Brent As an alternative have a look at the Rabin-Karp string searching algorithm. http://en.wikipedia.org/wiki/Rabin-Karp
It performs well when searching a text for multiple targets, as in your case.
rossum
Peter Duniho - 03 May 2008 18:03 GMT > As an alternative have a look at the Rabin-Karp string searching > algorithm. http://en.wikipedia.org/wiki/Rabin-Karp > > It performs well when searching a text for multiple targets, as in > your case. However, it still has to visit each character in the input string many times (for interior characters -- i.e. those not at the very beginning or end of the input, M times where M is the length of the longest match string). Also, to deal with match strings of varying length, it needs to compute a hash value for every substring of the input string of every possible match length.
If you're implementing your own hash function and table, this isn't so bad, since you'd easily be able to use the intermediate results as you work your way toward the longest possible match length. However, in the context of .NET the natural approach would be to use the built-in string hashing, requiring a new hash computation for each possible substring. That could be prohibitive.
In case it wasn't clear (and as I noted in one of the previous threads I referenced, I admit that the code I posted isn't necessarily the most clear), one of the state graph functions (RgobjFromCollectionParallel()) will return all possible matches for multiple match strings, even when there might be overlapping results.
I agree it's not a clear win over the Rabin-Karp approach, since the "parallel" traversal winds up having to update multiple graph positions per input string character, even though it only visits each input string character once. But even the worst case would be no worse than Rabin-Karp, since in that algorithm you're assured of nearly N*M operations (where N is the input string string and M is the maximum match string length), whereas with the state graph, you only wind up with that many if by some miracle you have to traverse the entire graph starting with each input character (that would be pretty weird input data :) ).
One reason I like the state graph is that it's conceptually fairly easy to understand. (Well, _I_ think it is anyway :) ). Also, if you know that your input string has no overlapping matches, and you know that none of your match strings contain any other match string, you can even process the input in a single pass (rather than running parallel traversals, even as efficient though that can be).
I think it's entirely possible that a fully-optimized, completely custom Rabin-Karp implementation could well out-perform the generic state graph code I posted. But this is .NET. Who wants to rewrite all that hash table stuff and squeeze the last bit of performance out of the implementation when a nice, .NET-friendly algorithm works fine. :)
Pete
rossum - 03 May 2008 21:02 GMT >> As an alternative have a look at the Rabin-Karp string searching >> algorithm. http://en.wikipedia.org/wiki/Rabin-Karp [quoted text clipped - 8 lines] >compute a hash value for every substring of the input string of every >possible match length. Yes, you do need multiple hashes, but you do not need to visit characters more than once. Since you know in advance the lengths you will need, you can retain just enough characters to calculate the required number of rolling hashes, say 3 chars, 4 chars and 5 chars from the starting character. That just needs three character variables.
>If you're implementing your own hash function and table, this isn't so >bad, since you'd easily be able to use the intermediate results as you >work your way toward the longest possible match length. That is essential in Rabin-Karp. You use a rolling hash so it is easy to calculate a hash without having to recalculate the entire hash, you just calculate the difference between the old hash and the new hash.
>However, in the >context of .NET the natural approach would be to use the built-in string >hashing, requiring a new hash computation for each possible substring. >That could be prohibitive. It would be, which is why the R-K algorithm avoids it. You can use any reasonable hash function that you want, and the rolling hash is a resonable hash function.
[snip]
>I think it's entirely possible that a fully-optimized, completely custom >Rabin-Karp implementation could well out-perform the generic state graph >code I posted. But this is .NET. Who wants to rewrite all that hash >table stuff and squeeze the last bit of performance out of the >implementation when a nice, .NET-friendly algorithm works fine. :) Why would you need to rewrite Hashtable? Just calculate your own integer rolling hash value and pass it to a Hashtable as the key value.
rossum
>Pete Peter Duniho - 04 May 2008 01:06 GMT > Yes, you do need multiple hashes, but you do not need to visit > characters more than once. Since you know in advance the lengths you > will need, you can retain just enough characters to calculate the > required number of rolling hashes, say 3 chars, 4 chars and 5 chars > from the starting character. That just needs three character > variables. I see...I missed that part of the description the first time I read the Wikipedia article.
That said, it still looks to me as though you need to keep a list of hashes, one for each possible length of match string (each of which still needs to be updated for each new character in the input string). I remain unconvinced that this is going to be significantly more performant than a simple state graph, and it seems more complicated to understand. And in either case, it seems like we've simply traded visiting characters in the input string multiple times for visiting individual rolling hash codes multiple times (to update each one once per input character).
But that's just me. With the rolling hashes, I can see how it has a different implementation (one that doesn't require a custom hash table), if not less cost, than I previously assumed, and if it seems appropriate to the OP, far be it from me to argue. :)
Thanks for the clarification.
Pete
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|