Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / C# / May 2008

Tip: Looking for answers? Try searching our database.

Checking if a list of names appears in a body of text.

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Brent - 03 May 2008 01:23 GMT
I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Any hints would be much appreciated!

--Brent
Peter Duniho - 03 May 2008 01:40 GMT
> I have a list of company names (say, IBM, Corning, General Motors, and
> another 5,000 of them).
>
> If I take a body of text, a news article, for instance, and I want to
> see which company names appear in that text, is there an efficient way
> to do this?

This comes up here surprisingly often.  :)

For the general case, so far I've yet to see someone suggest a better  
solution that the one I've proposed in the past: using a state graph.  
This approach assumes that you've got a static list that you can use to  
initialize your state graph once, and then use the same graph to process  
multiple input over and over.  The cost of creating the graph would be  
prohibitive if you had to create it for each iteration of input.

In an earlier thread, I posted some code that provided a generic  
implementation of a state graph.  I'm not claiming it's the best  
implementation, but it does work.  :)  You can find that message here:
http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/0f06
f696d4500b77


VERY IMPORTANT!  That code had a performance bug in it that severely  
limited its usefulness.  The original person I'd posted the code for never  
bothered to comment on it, so I never even noticed until much later, when  
I recommended the same code to someone else and they complained that it  
wasn't as fast as I'd said it should be.  You can find my follow-up post  
in which I included the bug-fix for the earlier code here:
http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/ac50
505b568a75fd


If you care about performance (and obviously you do :) ), do NOT use the  
code I originally posted without including the later bug-fix.

You may want to review both threads for comments from other people.  A  
number of other folks contributed to the threads, and they had insightful  
and useful comments, especially pertaining to special cases where you can  
get good performance by knowing something particular about the input.  The  
state graph performs well as a general-purpose solution, but sometimes you  
can get equal or better performance with a different solution that takes  
advantage of some particular known characteristic of the input.

Pete
Peter  Bromberg [C# MVP] - 03 May 2008 03:11 GMT
Very interesting concept, Peter. I'll be playing with this over the weekend.
Peter

>> I have a list of company names (say, IBM, Corning, General Motors, and
>> another 5,000 of them).
[quoted text clipped - 37 lines]
>
> Pete
Peter Bromberg [C# MVP] - 12 May 2008 00:28 GMT
Tomas Petricek has a very nice article with sample code along these lines
here:
http://tomasp.net/articles/ahocorasick.aspx
Peter

>> I have a list of company names (say, IBM, Corning, General Motors, and
>> another 5,000 of them).
[quoted text clipped - 37 lines]
>
> Pete
Arne Vajhøj - 03 May 2008 03:18 GMT
> I have a list of company names (say, IBM, Corning, General Motors, and
> another 5,000 of them).
[quoted text clipped - 8 lines]
> names (you can't just create the second array based on a split on
> spaces).

Try something like:

public static string[] Find2(string[] lst, string txt)
{
    HashSet<string> hs = new HashSet<string>(txt.Split(' ', ',', '.'));
    return Array.FindAll<string>(lst, (string s) => hs.Contains(s));
}

assuming you are on 3.5 !

Arne
colin - 03 May 2008 10:14 GMT
have you considered a decision tree built from your list of company names?

Colin =^.^=

>I have a list of company names (say, IBM, Corning, General Motors, and
> another 5,000 of them).
[quoted text clipped - 12 lines]
>
> --Brent
Mr. Arnold - 03 May 2008 12:31 GMT
>I have a list of company names (say, IBM, Corning, General Motors, and
> another 5,000 of them).
[quoted text clipped - 10 lines]
>
> Any hints would be much appreciated!

Finding the phrase
How do we use the MakePattern method to find our phrase?  Let's suppose that
we aren't interested in where the phrase occurs, or whether it occurs
several times, but just whether or not it appears at all.   So our approach
will be to take the original phrase, turn it into a pattern, match the
pattern, and return true if the pattern has been matched:

   public Boolean PhraseFound(String argPhrase, String argText)

   {

     String strPattern = MakePattern(argPhrase);

     Match match = Regex.Match(argText, strPattern);

     return match.Success;

   }

I used the Regex.Match to find the occurrence of a word in a text field or
variable. You can also use the features of Regex that will find the
positions of the words so that can use something like a RichTextBox and
position to the word or words in the textbox and highlight all the words.
rossum - 03 May 2008 14:13 GMT
>I have a list of company names (say, IBM, Corning, General Motors, and
>another 5,000 of them).
[quoted text clipped - 12 lines]
>
>--Brent
As an alternative have a look at the Rabin-Karp string searching
algorithm.  http://en.wikipedia.org/wiki/Rabin-Karp

It performs well when searching a text for multiple targets, as in
your case.

rossum
Peter Duniho - 03 May 2008 18:03 GMT
> As an alternative have a look at the Rabin-Karp string searching
> algorithm.  http://en.wikipedia.org/wiki/Rabin-Karp
>
> It performs well when searching a text for multiple targets, as in
> your case.

However, it still has to visit each character in the input string many  
times (for interior characters -- i.e. those not at the very beginning or  
end of the input, M times where M is the length of the longest match  
string).  Also, to deal with match strings of varying length, it needs to  
compute a hash value for every substring of the input string of every  
possible match length.

If you're implementing your own hash function and table, this isn't so  
bad, since you'd easily be able to use the intermediate results as you  
work your way toward the longest possible match length.  However, in the  
context of .NET the natural approach would be to use the built-in string  
hashing, requiring a new hash computation for each possible substring.  
That could be prohibitive.

In case it wasn't clear (and as I noted in one of the previous threads I  
referenced, I admit that the code I posted isn't necessarily the most  
clear), one of the state graph functions (RgobjFromCollectionParallel())  
will return all possible matches for multiple match strings, even when  
there might be overlapping results.

I agree it's not a clear win over the Rabin-Karp approach, since the  
"parallel" traversal winds up having to update multiple graph positions  
per input string character, even though it only visits each input string  
character once.  But even the worst case would be no worse than  
Rabin-Karp, since in that algorithm you're assured of nearly N*M  
operations (where N is the input string string and M is the maximum match  
string length), whereas with the state graph, you only wind up with that  
many if by some miracle you have to traverse the entire graph starting  
with each input character (that would be pretty weird input data :) ).

One reason I like the state graph is that it's conceptually fairly easy to  
understand.  (Well, _I_ think it is anyway :) ).  Also, if you know that  
your input string has no overlapping matches, and you know that none of  
your match strings contain any other match string, you can even process  
the input in a single pass (rather than running parallel traversals, even  
as efficient though that can be).

I think it's entirely possible that a fully-optimized, completely custom  
Rabin-Karp implementation could well out-perform the generic state graph  
code I posted.  But this is .NET.  Who wants to rewrite all that hash  
table stuff and squeeze the last bit of performance out of the  
implementation when a nice, .NET-friendly algorithm works fine.  :)

Pete
rossum - 03 May 2008 21:02 GMT
>> As an alternative have a look at the Rabin-Karp string searching
>> algorithm.  http://en.wikipedia.org/wiki/Rabin-Karp
[quoted text clipped - 8 lines]
>compute a hash value for every substring of the input string of every  
>possible match length.
Yes, you do need multiple hashes, but you do not need to visit
characters more than once.  Since you know in advance the lengths you
will need, you can retain just enough characters to calculate the
required number of rolling hashes, say 3 chars, 4 chars and 5 chars
from the starting character.  That just needs three character
variables.

>If you're implementing your own hash function and table, this isn't so  
>bad, since you'd easily be able to use the intermediate results as you  
>work your way toward the longest possible match length.  
That is essential in Rabin-Karp.  You use a rolling hash so it is easy
to calculate a hash without having to recalculate the entire hash, you
just calculate the difference between the old hash and the new hash.

>However, in the  
>context of .NET the natural approach would be to use the built-in string  
>hashing, requiring a new hash computation for each possible substring.  
>That could be prohibitive.
It would be, which is why the R-K algorithm avoids it.  You can use
any reasonable hash function that you want, and the rolling hash is a
resonable hash function.

[snip]

>I think it's entirely possible that a fully-optimized, completely custom  
>Rabin-Karp implementation could well out-perform the generic state graph  
>code I posted.  But this is .NET.  Who wants to rewrite all that hash  
>table stuff and squeeze the last bit of performance out of the  
>implementation when a nice, .NET-friendly algorithm works fine.  :)
Why would you need to rewrite Hashtable?  Just calculate your own
integer rolling hash value and pass it to a Hashtable as the key
value.

rossum

>Pete
Peter Duniho - 04 May 2008 01:06 GMT
> Yes, you do need multiple hashes, but you do not need to visit
> characters more than once.  Since you know in advance the lengths you
> will need, you can retain just enough characters to calculate the
> required number of rolling hashes, say 3 chars, 4 chars and 5 chars
> from the starting character.  That just needs three character
> variables.

I see...I missed that part of the description the first time I read the  
Wikipedia article.

That said, it still looks to me as though you need to keep a list of  
hashes, one for each possible length of match string (each of which still  
needs to be updated for each new character in the input string).  I remain  
unconvinced that this is going to be significantly more performant than a  
simple state graph, and it seems more complicated to understand.  And in  
either case, it seems like we've simply traded visiting characters in the  
input string multiple times for visiting individual rolling hash codes  
multiple times (to update each one once per input character).

But that's just me.  With the rolling hashes, I can see how it has a  
different implementation (one that doesn't require a custom hash table),  
if not less cost, than I previously assumed, and if it seems appropriate  
to the OP, far be it from me to argue.  :)

Thanks for the clarification.

Pete

Rate this thread:







Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.