Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / Managed C++ / May 2004

Tip: Looking for answers? Try searching our database.

Data Structure question

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Tommy Vercetti - 04 May 2004 17:19 GMT
I have an application that loads a 150 MB text file containing 6.6
million strings. I want to be able to quickly determine run lookups
against this and determine yes or no; is a given string in this collection.

This obviously requires a binary tree (or possibly a hash table). My
first solution was to use STL with std::set<std::string>. However, this
solution took up 600MB and was unacceptable. I revised this by reading
the entire 150MB text file into a single contiguous block, and building
an std::set based on pointers into this block. I used a custom
comparison operator that compares until the end of a line. This worked
and dramatically reduced load time and RAM use. RAM use is now at 350MB.

However, I'd like to reduce this RAM footprint further. Anyone have any
ideas for this kind of task? I'm thinking of implementing a special case
binary tree that is optimized for this specific task. Obviously, I only
need to balance the tree once and I don't need to worry about additions
or removals once everything is built.

Any ideas/suggestions?

thanks!
Carl Daniel [VC++ MVP] - 04 May 2004 21:35 GMT
> I have an application that loads a 150 MB text file containing 6.6
> million strings. I want to be able to quickly determine run lookups
[quoted text clipped - 18 lines]
> worry about additions
> or removals once everything is built.

Store the data in a contiguous buffer (with embedded nulls to demarcate the
strings) and store pointers in a sorted std::vector instead of a std::set.
That should get you down to about 174Mb.

A bit more fancy, you could turn the entire dictionary into a large ternary
search tree, or perhaps build a large finite automata that can recognize the
strings.  Obviously either of those would require a significant
pre-processing of the data, but I'm guessing that wouldn't necessarily be a
problem.  Without detailed knowledge of the distribution of characters in
the strings it's difficult to predict how large the associated data
structure for either of those would be, but it's not unlikely that it would
be considerably smaller and also quite possibly faster.

-cd
Tommy Vercetti - 05 May 2004 01:52 GMT
Ah, and since the collection is sorted you can still get logarithmic
search times.

Wow, those are actually really good ideas. I'm surprised. I will
investigate those solutions. I have just googled ternary search trees
and it looks promising. Thanks!

> Store the data in a contiguous buffer (with embedded nulls to demarcate the
> strings) and store pointers in a sorted std::vector instead of a std::set.
[quoted text clipped - 10 lines]
>
> -cd
leo - 08 May 2004 00:07 GMT
> Ah, and since the collection is sorted you can still get logarithmic
> search times.
>
> Wow, those are actually really good ideas. I'm surprised. I will
> investigate those solutions. I have just googled ternary search trees
> and it looks promising. Thanks!

Using Ternary Search Tree could still be a problem in memory
consumption. If you don't need partial search functionality, probably
hashtable is the best choice. On the other hand, if the ternary tree
you built is not balanced, then the performance of it will be worse
than hashtable.

Leo

Rate this thread:







Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.