Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / C# / January 2008

Tip: Looking for answers? Try searching our database.

Hash Code Of A String

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
j1mb0jay - 11 Jan 2008 22:03 GMT
I am currently working on a dictionary populating program. I currently
have a socket connection my local news server and am trawling through
all of the articles looking for new words. I am currently using Java to
do this but would like to move the source to C#. Java's String class has
a method that hashes strings. I was wondering if C# has a method which
does the same?

In my Java version of the program I am using the Multiply Add and Divide
(MAD) method for the compression of the hash code, does C# have any
built in functions(methods) that will do this for me, or does anyone
know of a more efficient way?

j1mb0jay
KH - 11 Jan 2008 23:15 GMT
All .NET types have the GetHashCode() method, at minimum inherited from
System.Object.

string str = "fubar";

int hc = str.GetHashCode(); // 418978654

> I am currently working on a dictionary populating program. I currently
> have a socket connection my local news server and am trawling through
[quoted text clipped - 9 lines]
>
> j1mb0jay
j1mb0jay - 11 Jan 2008 23:24 GMT
> All .NET types have the GetHashCode() method, at minimum inherited from
> System.Object.
[quoted text clipped - 16 lines]
>>
>> j1mb0jay

Much the same as Java the GetHashCode() methodology. How would you
convert the 418978654 to a number that I could use as an index into a
fixed size array?

j1mb0jay
KH - 11 Jan 2008 23:48 GMT
That's really a whole different question; a couple options:

1) GetHashCode() returns a 32-bit interger, so if you created an array of
size UInt32.MaxValue you could use the hashcode directly

2) You could % the hashcode by the size of your array

Neither of those are actually viable options of course. What are you
actually trying to accomplish?

> > All .NET types have the GetHashCode() method, at minimum inherited from
> > System.Object.
[quoted text clipped - 22 lines]
>
> j1mb0jay
j1mb0jay - 12 Jan 2008 00:15 GMT
How much memory would option one take on a 32bit machine !!, I only have
access to about 8GB physical and 18GB virtual.

I am trying to populate a dictionary that has standard English words but
is extended by the slang and dialog found on new groups. This dictionary
will then aid with my coursework which is to crack a few simple
encryption types such as polyalphabetic.

I also want to order the dictionary by the popularity of the words found
on the news server. I have successfully completed this by loading the
dictionary file into a sorted binary tree.

I just wanted to improve the insert method of the hash table because
after inserting all words I have to check a maximum 4 words rather than
1 to see if the word already exists in the hash table which is
2.5million words long.

> That's really a whole different question; a couple options:
>
[quoted text clipped - 32 lines]
>>
>> j1mb0jay
Peter Duniho - 12 Jan 2008 00:37 GMT
> [...]
> I just wanted to improve the insert method of the hash table because  
> after inserting all words I have to check a maximum 4 words rather than  
> 1 to see if the word already exists in the hash table which is  
> 2.5million words long.

There's no way for you to guarantee that you don't have to check multiple  
words, even if you have as many entries in the hash table as you have data  
elements.  Collisions are always a possibility.

Besides that, you seem to be pursuing a pointless optimization.  Even if  
your hash table required you to check dozens of words, it would still be  
ridiculously faster than a linear search of 2.5 million words.  That's the  
point of a hash table.  You may be able to improve performance slightly by  
implementing your own hash table and/or hash function, but IMHO once  
you've got to the point of use _some_ kind of hash table, you're likely to  
find the algorithm isn't going to get a lot faster by messing around with  
the hash table.  Your efforts will be better spent looking for other  
low-hanging fruit elsewhere in whatever the algorithm you're doing is.  
And if you can't get acceptable performance that way while still using a  
hash table, you may wind up needing an algorithm that uses a data  
structure that's even faster than the hash table.

Pete
j1mb0jay - 12 Jan 2008 00:42 GMT
>> [...]
>> I just wanted to improve the insert method of the hash table because
[quoted text clipped - 20 lines]
>
> Pete

I am trying to do this for a data structure module for my degree,
writing our own data structures is important for best marks. More marks
are to be gained for good performance. Up to now from the data
structures i have studied this seems to be this best for this task. What
you do recommend for a better ?

j1mb0jay
Peter Duniho - 12 Jan 2008 01:15 GMT
> I am trying to do this for a data structure module for my degree,  
> writing our own data structures is important for best marks. More marks  
> are to be gained for good performance. Up to now from the data  
> structures i have studied this seems to be this best for this task. What  
> you do recommend for a better ?

Caveat: I don't know the specific grading standard, how "performance" is  
defined (does memory usage matter?), and frankly I don't care.  I already  
have mixed feelings about assisting with what's essentially homework.

That said, if memory consumption and set-up time is no object, a state  
graph _could_ be better (I think some people reading this may be starting  
to think, "what's up with that guy and his state graphs?"...I know, they  
aren't really the end-all be-all, but I like them anyway :) ).

With a hash table, you are guaranteed to have to scan the string at least  
once, because even if there's only one string associated with a specific  
hash value, you need to check to make sure it's the one you're actually  
looking for (I'm assuming here that you are not guaranteed ahead of time  
that the string being looked up is actually in your dictionary).

Using a state graph, the worst-case scenario is having to scan the entire  
string, and if you fail to match, or you match a string that is unique  
earlier than the last character (a common enough scenario), then searching  
for a string requires only a single partial scan of the string.

Why did I write "could"?  Well, in reality the implementation of the state  
graph depends.

For each node in the state graph, you need a way of mapping a new  
character to the next node.  If you're guaranteed only regular ASCII  
characters, then this requires only a 128-entry lookup table (less,  
actually, if you exclude control characters and maybe others too).  This  
requirement could come out of the basic dictionary contents, or you could  
simplify the dictionary by removing accents, etc. and only supporting  
languages using the Roman alphabet.  But if you want to be more flexible,  
then you're looking at some other data structure for the node lookup, and  
that could start adding in cost again (for example, one easy  
implementation might use a dictionary, which is essentially another hash  
table :) ).

If you can afford to burn 256K per state graph node, then the next-node  
lookup is a single constant-order array retrieval for Unicode.  But  
otherwise, you need something that's going to be slower.

In addition, traversal of the state graph could wind up visiting memory  
locations that are not proximal to each other, causing less efficient use  
of the CPU cache than might occur with a regular hash table.

Of course, the hash table might not be able to be implemented in a  
perfectly optimal way either.  You could have a lot of collisions, either  
because the hash function just isn't capable of reducing collisions, or  
because you don't really have 2 million+ entries in the hash table.

Each implementation has its pros and cons and it's not really possible to  
say in advance which would come out ahead.

Finally, these are sort of the standard fare for data structures classes.  
There are a number of other possible data structures that are much more  
advanced (i.e. complicated) and they may perform better in specific  
scenarios (they would take advantage of constraints you are able to apply  
to the problem, assuming there are any).  People have been writing word  
dictionaries for decades, and there are countless implementations, each  
with varying performance and appropriateness for a given data set.

Your biggest problem with the approach you've described is that I just  
don't know of a practical way to come up with a hash function that  
guarantees no collisions.  That's the only way you can do what you're  
asking.  I haven't looked closely at this myself, but I do believe it's  
mathematically _possible_ to write a hash function, generated based on the  
known input data, that will guarantee no collisions, but I suspect that  
hash function is not practical to implement (it'd probably wind up be some  
incredibly-large-order polynomial or something, with magic constants  
derived from the input data).

Pete
Peter Duniho - 12 Jan 2008 00:31 GMT
> Much the same as Java the GetHashCode() methodology. How would you  
> convert the 418978654 to a number that I could use as an index into a  
> fixed size array?

And what would the point of that be?

.NET already has collection classes that use the hash code directly (e.g.  
Dictionary, Hashtable, etc.).

You can always just use the % operator to take an arbitrary number and map  
it to some arbitrarily smaller range.  But I don't see the point in this  
case.

Pete
jehugaleahsa@gmail.com - 13 Jan 2008 22:40 GMT
> KH wrote:
> > All .NET types have the GetHashCode() method, at minimum inherited from
[quoted text clipped - 27 lines]
>
> - Show quoted text -

const int FIXED_ARRAY_SIZE = 40; // for example
string s = "foobar";
int hc = s.GetHashCode(); // hc can be negative - a common mistake for
newbs
hc %= FIXED_ARRAY_SIZE;
if (hc < 0)
{
   hs += FIXED_ARRAY_SIZE;
}

This will get you within the array. However, I can't see why this is
useful unless you are implementing a simple hash table. In which case,
just use a Dictionary; it will be faster.

Best of luck,
Travis

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.