> I have an application that loads a 150 MB text file containing 6.6
> million strings. I want to be able to quickly determine run lookups
[quoted text clipped - 18 lines]
> worry about additions
> or removals once everything is built.
Store the data in a contiguous buffer (with embedded nulls to demarcate the
strings) and store pointers in a sorted std::vector instead of a std::set.
That should get you down to about 174Mb.
A bit more fancy, you could turn the entire dictionary into a large ternary
search tree, or perhaps build a large finite automata that can recognize the
strings. Obviously either of those would require a significant
pre-processing of the data, but I'm guessing that wouldn't necessarily be a
problem. Without detailed knowledge of the distribution of characters in
the strings it's difficult to predict how large the associated data
structure for either of those would be, but it's not unlikely that it would
be considerably smaller and also quite possibly faster.
-cd
Tommy Vercetti - 05 May 2004 01:52 GMT
Ah, and since the collection is sorted you can still get logarithmic
search times.
Wow, those are actually really good ideas. I'm surprised. I will
investigate those solutions. I have just googled ternary search trees
and it looks promising. Thanks!
> Store the data in a contiguous buffer (with embedded nulls to demarcate the
> strings) and store pointers in a sorted std::vector instead of a std::set.
[quoted text clipped - 10 lines]
>
> -cd
leo - 08 May 2004 00:07 GMT
> Ah, and since the collection is sorted you can still get logarithmic
> search times.
>
> Wow, those are actually really good ideas. I'm surprised. I will
> investigate those solutions. I have just googled ternary search trees
> and it looks promising. Thanks!
Using Ternary Search Tree could still be a problem in memory
consumption. If you don't need partial search functionality, probably
hashtable is the best choice. On the other hand, if the ternary tree
you built is not balanced, then the performance of it will be worse
than hashtable.
Leo