Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / C# / April 2008

Tip: Looking for answers? Try searching our database.

String array intersections

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
kevinmr@optonline.net - 17 Apr 2008 20:51 GMT
To all -

I have two arrays of strings one of length M and the other of length
N; N > M. Ther arrays will be on the order of 1 million for N array
and ~50000 for the M array. Can someone describe to me a fast
algorithm to find the intersection of the arrays? Currently I am using
ArrayList in VC.NET.

Thanks in advance
Gilles Kohl [MVP] - 17 Apr 2008 21:23 GMT
>To all -
>
[quoted text clipped - 3 lines]
>algorithm to find the intersection of the arrays? Currently I am using
>ArrayList in VC.NET.

Hmm, are your arrays sorted?

  Regards,
  Gilles.
Peter Duniho - 17 Apr 2008 21:55 GMT
> To all -
>
[quoted text clipped - 3 lines]
> algorithm to find the intersection of the arrays? Currently I am using
> ArrayList in VC.NET.

As Gilles alludes to, if your arrays are already sorted, then you can do a  
sort of "merge sort" algorithm in which rather than creating a new output  
array, you're simply identifying matches.  That should be very fast,  
basically O(n).

If the arrays aren't sorted, then I think that putting all the elements of  
one into a Dictionary<> instance and then looking up all the other array's  
elements in that Dictionary<> instance would work well.  This would be  
somewhat slower than if the data were sorted to start with, but usually  
faster than actually sorting the data just for this one purpose.

Mind you, depending on your definition of "fast", processing 1 million of  
anything isn't necessarily ever going to actually be fast.  :)

But I think a dictionary-based solution would perform reasonably well for  
unsorted data.

Pete
Gilles Kohl [MVP] - 17 Apr 2008 23:47 GMT
>> To all -
>>
[quoted text clipped - 8 lines]
>array, you're simply identifying matches.  That should be very fast,  
>basically O(n).

That was the plan :-)

>If the arrays aren't sorted, then I think that putting all the elements of  
>one into a Dictionary<> instance and then looking up all the other array's  
[quoted text clipped - 7 lines]
>But I think a dictionary-based solution would perform reasonably well for  
>unsorted data.

Ditto - the new HashSet<T> could come in handy here to play the role
of a "lightweight Dictionary". Basically, with smallArray and
largeArray being string arrays:

  HashSet<string> set = new HashSet<string>(smallArray);
  var intersection = largeArray.Where(name => set.Contains(name));

(although the OPs mention of ArrayList might be indicative of target
Framework 1.1 rather than 3.5, ahem)

  Regards,
  Gilles.
Jon Skeet [C# MVP] - 18 Apr 2008 00:04 GMT
On Apr 17, 1:55 pm, "Peter Duniho" <NpOeStPe...@nnowslpianmk.com>
wrote:

<snip>

> Mind you, depending on your definition of "fast", processing 1 million of
> anything isn't necessarily ever going to actually be fast.  :)

Ooh, I don't know about that. A little test app I wrote using
HashSet<string> takes about 100ms on my laptop with N=1000000 and
M=50000.

Not the sort of thing you want to be doing *really* regularly (like on
every web request to a server) but still pretty fast :)

Jon
Peter Duniho - 18 Apr 2008 00:15 GMT
>> Mind you, depending on your definition of "fast", processing 1 million  
>> of
[quoted text clipped - 3 lines]
> HashSet<string> takes about 100ms on my laptop with N=1000000 and
> M=50000.

Like I said, it depends on your definition of "fast".  In some scenarios,  
100ms is incredibly slow.

That's of course ignoring the question as to whether your test was even a  
valid performance measurement.  Depending on the hash implementation,  
there could be a significant difference between the best-case (no  
collisions) and worst-case scenarios (many collisions, no strings actually  
in the intersection).  But for the sake of this discussion, I'm willing to  
grant you that.  :)

Pete

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.