.NET Forum / .NET Framework / New Users / November 2006
GetHashCode for Objects that Compare Based on Value (Not reference equality)
|
|
Thread rating:  |
Matt - 13 Nov 2006 22:00 GMT Hello,
In a few classes I have overridden Equals, ==, and != and would like to generate an optimal hash code but I am a little confused about the best approach to take. All of the classes that I am developing use a value comparison, not reference, to determine equality. The classes were all originally generated from an XSD, so there is a hierarchical organization of classes. Because I am doing a value comparison, the actual contents of the objects are relevant, but this gets tricky because the majority of these classes contain arrays of additional objects, which in turn contain more objects.
For example, one boardgame is only equivalent to another board game if it contains the same number of game pieces, and those pieces are on the same relative spots on the board, and the pieces are the same color, and the pieces have the same stickers on them, and the stickers have the same adhesive, etc.. You can see how it gets laborious to generate a hash code based on the value of one's contents if it is composed of objects which are composed of other objects and so on. The boardgame is not the same if the adhesive on the game pieces is different, so the hash codes should be different in that case, right? If the hash code were based simply on the number of pieces, their location, and their color then I could expect collisions and that would introduce all sorts of confusion.
Would you just use Object.GetHashCode and forget about it... or would you generate a hash code from an exhaustive dig into an object's contents? I was thinking of generating a hashcode from the hash codes of the member variables? What is the best approach for this issue?
Thanks
ssamuel - 13 Nov 2006 22:35 GMT Matt,
Your best bet is to XOR the GetHashCode() of all your significant non-computed member vairables:
public class myClass {
private int intID = 0; private string strName = null; private decimal decPrice = 0; private decimal decQuantity = 0;
private decimal decTotalValue = 0;
public void Calculate() { this.decTotalValue = this.decPrice * this.decQuantity; }
public override int GetHashCode() { int intHash = this.intID.GetHashCode() ^ this.decPrice.GetHashCode() ^ this.decQuantity.GetHashCode(); return this.strName == null ? intHash : intHash ^ this.strName.GetHashCode(); }
}
If everyone's done their homework and implemented honest GetHashCode() overrides in all the classes of your member variables -- everything in the framework has -- your GetHashCode() should satisfy the requirements. Note that I'm not including decTotalValue, because it's nothing more than decPrice times decQuantity, and now's a good time to ask yourself, "do I really need to cache that value?" Just beware of the normal pitfalls, like null member variables (a good reason to use structs and enums when they'll suffice instead of full-blown classes).
Stephan
> Hello, > [quoted text clipped - 27 lines] > > Thanks Matt - 14 Nov 2006 00:46 GMT Thanks for the quick reply, would you still XOR everything if the member variables were not primitive - so the GetHashCode calls would call other GetHashCode calls internally? My main concern is the performance hit, if I have a lot of objects in a hierarchical fashion and I call GetHashCode on the topmost object which in turn generates a hash based on its own member variables, each of which in turn generates their own hash code from its member variables, and so on and so forth.. Generating a hash code at the top level could result in a huge number of operations.
For a simple ADT with primitive types, your approach makes total sense to me. When collections and nested types are brought into the picture, the performance hit seems like it could be tremendous if I am using my objects extensively in collections that query the hash code frequently. Since hash codes in .NET are not guaranteed to be constant, and since it should change if the contents of a value based equality object change, then I would imagine that the included .NET collections would make potentially frequent calls to GetHashCode while performing common operations. Am I correctly understanding this?
Thanks!
> Matt, > [quoted text clipped - 68 lines] > > > > Thanks ssamuel - 15 Nov 2006 20:23 GMT > For a simple ADT with primitive types, your approach makes total sense > to me. When collections and nested types are brought into the picture, [quoted text clipped - 5 lines] > make potentially frequent calls to GetHashCode while performing common > operations. Am I correctly understanding this? Hashtable shouldn't make too many calls to GetHashCode(). I can't speak for the actual implementation of System.Collections.Hashtable, but it's entirely possible that GetHashCode() would only need to be called once per object, on insert. If you don't move or resort stuff, the original hash should stand. I don't know all that much about specific performance enhancements that would require re-hashing, but whoever used one of these would have to account for hash generation time when deciding if implementation would save or lose time.
As for increasing times in cascaded hashes, your statements are dead-on. GetHashCode() should return "quickly," but that's a relative thing. If you have sub objects, they too should return quickly, so overall hash computation should be fast. Keeping things to simple integer math speeds a lot up, as opposed to any string operation, which is may take close to an order of magnitude longer. If you're designing a data structure, you may want to build the hash while you build the data structure to prevent having to do it later. A good DBA type may be able to help you with this: it's like building an table index after adding data instead of before.
This is a necessary evil. In order to have a competent hash, you have to have some representation of all the essential component parts of an object. While it's true that we want to reduce the number of cycles our programs suck up, the whole point of the CPU is to process things that need to be processed.
Stephan
Jon Skeet [C# MVP] - 15 Nov 2006 23:05 GMT <snip>
> As for increasing times in cascaded hashes, your statements are > dead-on. GetHashCode() should return "quickly," but that's a relative [quoted text clipped - 6 lines] > able to help you with this: it's like building an table index after > adding data instead of before. One option is to determine the hash lazily. Either have one "magic value" which indicates "I haven't worked out the hash yet" (and make sure your hash algorithm never ends up with that as the value) or have a separate flag to say whether or not the hash has been calculated.
This is the approach Sun's Java implementation takes for strings, although there's nothing there to stop the "magic number" (0) being the result of a real hash, and thus being recalculated every time. Indeed, it's guaranteed to be the case for empty strings - although at that point it doesn't take long to recalculate :)
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
Dale - 14 Nov 2006 14:26 GMT I usually combine the ToString of each significant property and return the hashcode of the combined string. I also override ToString on most of my classes so that the ToString method returns a string that uniquely identifies the instance. That takes care of the values for reference types.
 Signature Dale Preston MCAD C# MCSE, MCDBA
> Matt, > [quoted text clipped - 68 lines] > > > > Thanks Jon Skeet [C# MVP] - 14 Nov 2006 20:11 GMT > Your best bet is to XOR the GetHashCode() of all your significant > non-computed member vairables: XOR can easily lead to collisions - in your example, for instance, the hashcodes of the two decimals are XORed - that means that an item with a price and quantity of 2 will (everything else being equal) be the same as an item with a price and quantity of 1.
I use the trick from Josh Bloch's Java book:
int hash = 37; hash = 17*hash + hashOfFirstMember; hash = 17*hash + hashOfSecondMember; hash = 17*hash + hashOfThirdMember; etc
37 and 17 aren't magical, just coprime. It's not guaranteed to be a *great* hashcode, but it's a little better than XOR :)
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
ssamuel - 14 Nov 2006 20:55 GMT > XOR can easily lead to collisions - in your example, for instance, the > hashcodes of the two decimals are XORed - that means that an item with > a price and quantity of 2 will (everything else being equal) be the > same as an item with a price and quantity of 1. Jon,
This is true. It's also irrelevant.
According to the framework documentation (ask if you need a link), GetHashCode() isn't meant for uniqueness or object identification. It's meant to provide a good (i.e. -- random) distribution across a hashtable. Aside from the fact that the documentation recommends XOR implementations, it's fairly clear from a mathematical basis that XOR functions provide better orthogonality when correlating multiple distributions, which is what we're trying to accomplish.
I believe the example you gave, while neat, simply adds a constant offset to your distribution. All this serves to do is skew your random variable: it provides no benefit to randomness of the resulting distribution, all else being equal. The probability distribution is the same.
There are certainly times when a purely random distribution of hashes is undesireable. However, you'd have to know quite a bit about the nature of your data to determine what a better hashing algorithm would be. I'll admit that I didn't ask if this was the case, largely because I don't want to go through the analysis for someone else's data, and also because that's an optimization step that shouldn't come into play until it's clear that a poor hashing algorithm is killing hash performance. I believe the question was directed towards, "hey, the compiler's making me do it, so how do I make it shut up simply?"
Stephan
Jon Skeet [C# MVP] - 14 Nov 2006 21:18 GMT > > XOR can easily lead to collisions - in your example, for instance, the > > hashcodes of the two decimals are XORed - that means that an item with > > a price and quantity of 2 will (everything else being equal) be the > > same as an item with a price and quantity of 1. > > This is true. It's also irrelevant. Not unless you think that collisions are irrelevant. They're irrelevant (unless your code is broken) in terms of *correctness* but they're important in terms of performance. Let's face it, "return 0;" is a *valid* implementation of GetHashCode(), but not a terribly useful one.
> According to the framework documentation (ask if you need a link), > GetHashCode() isn't meant for uniqueness or object identification. It's > meant to provide a good (i.e. -- random) distribution across a > hashtable. Yes - but that suggests that collisions should be avoided where possible. Using XOR increases the chance of collision.
> Aside from the fact that the documentation recommends XOR > implementations There are lots of bad recommendations in the docs :)
> it's fairly clear from a mathematical basis that XOR > functions provide better orthogonality when correlating multiple > distributions, which is what we're trying to accomplish. Not really. For example, using XOR with lots of numbers which all give small hashcodes will only give you a range with
> I believe the example you gave, while neat, simply adds a constant > offset to your distribution. All this serves to do is skew your random > variable: it provides no benefit to randomness of the resulting > distribution, all else being equal. The probability distribution is the > same. I believe that in real life it's more likely that you'll see hash collisions using XOR than using the method I've shown.
For instance, if you have two string properties, it's more likely in the real world that you'll have two instances which either both use the same value for both properties (although that value may be different between the instances) or that use the same values in a different order.
> There are certainly times when a purely random distribution of hashes > is undesireable. However, you'd have to know quite a bit about the [quoted text clipped - 5 lines] > performance. I believe the question was directed towards, "hey, the > compiler's making me do it, so how do I make it shut up simply?" My point was trying to get to a hashing algorithm which was *more* random than yours.
Here's an example program - it mimics a class containing 3 ints, all of which will be in the range -10 to 10. In real life, this is far from uncommon - even if the theoretical range of the data is large, the actual range is often very small. We calculate every possible hashcode using both methods, and then see how many *different* ones there are in each case.
Using XOR, you only get 32 possible hashcodes. With multiplication (even completely "untuned") you get 6141. With a *tiny* bit of knowledge/guesswork about the data, you can improve this to the full 9261 possible values, just by changing the multiplication factor.
Why use a hashcode algorithm which is more likely to result in duplicate hashcodes (which won't prevent anything from working, but will slow things down) when it's easy to give a wider range?
using System; using System.Collections.Generic;
class Test { static void Main() { // Use maps as sets Dictionary<int,int> xorMap = new Dictionary<int,int>(); Dictionary<int,int> multMap = new Dictionary<int,int>(); for (int x=-10; x <= 10; x++) { for (int y=-10; y <= 10; y++) { for (int z=-10; z <=10; z++) { int xorHash = x^y^z; int multHash = 37; multHash = 17*multHash + x; multHash = 17*multHash + y; multHash = 17*multHash + z; xorMap[xorHash] = xorHash; multMap[multHash] = multHash; } } } Console.WriteLine ("Number of different hashcodes using XOR: "+ xorMap.Count); Console.WriteLine ("Number of different hashcodes using */+: "+ multMap.Count); } }
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
ssamuel - 14 Nov 2006 22:35 GMT Jon,
I highly respect you for taking the time to write sample code, but I think you're chasing a proof by example. Given your assumptions, you are completely correct. The fact that you have to justify your assumptions as "in real life," which is a highly subjective assumption, your view being one with which I cannot agree, proves my point: if you know your data well, you can always come up with a better hashing algorithm.
My proof is more abstract than yours:
1. It is too hard to collect all the data used as member variables in every class in every industry through the history, present and future of all C# applications.
2. Therefore, one cannot characterize 'all data' or even 'real life data.'
(For the record, your posit that an array of evenly distributed integers between any x and -x is common would need lots more empirical data before I begin to believe it; in the last three industries I've worked in, such data is uncommon. I claim that your other posit, that a well-designed class should contain two string members that contain the same value, is very poor.)
3. Uncharacterizable data should be considered randomly distributed, by the definition of random distribution, and thus should data in member variables of a given class.
4. Given two random distributions, A and B, A XOR B is random.
5. It follows that given a class with random member variable values, XOR-ing member variable values will give a random hash.
QED, unless you have specific characterization of your data, XOR is efficient. Seeing that it's the method recommended in the docs -- I won't even touch your comment about their incorrectness -- I'll follow Ockham's Razor and stick with it.
Mind you, I'm not trying to say your algorithm is incorrect. I'm simply stating that one should implement the simplest efficient algorithm until optimization, when one can gain insight into the true distribution of values in ones real data set.
Stephan
Jon Skeet [C# MVP] - 14 Nov 2006 23:56 GMT > I highly respect you for taking the time to write sample code, but I > think you're chasing a proof by example. Given your assumptions, you [quoted text clipped - 5 lines] > > My proof is more abstract than yours: It's not really more abstract. You're effectively saying that a good default position is to treat all data as if it were randomly and evenly distributed. I'm saying that's not a good default position because so much of the time in the real world we see repeated values, or values which tend to lie within relatively small ranges.
> 1. It is too hard to collect all the data used as member variables in > every class in every industry through the history, present and future > of all C# applications. > > 2. Therefore, one cannot characterize 'all data' or even 'real life > data.' That's false logic - or at least, it's effectively claiming that because we can't know *everything* it's not worth using what we *do* know as a reasonable representative sample.
Here's a counterexample: I can't collect all the data used in every English language text ever written, but if I were to look at all the English language books I happen to have in my book cases, I can reasonably quickly come to the conclusion that 'e' and 's' are used more often than 'x' and 'z'. Similarly I don't think it's unreasonable to suggest that the number 8 comes up more often as a value for real- life data stored in ints than 8327646.
> (For the record, your posit that an array of evenly distributed > integers between any x and -x is common would need lots more empirical > data before I begin to believe it; in the last three industries I've > worked in, such data is uncommon. a) There's no need for the range to be -x to x. b) There's no need for the distribution to be even
In fact, there's no need for the values even to be in the same range - try the code I gave with different ranges for the three variables: you'll never get out of the number of bits which can be different. With my algorithm you quickly get a broader range.
Do you really think it's particularly uncommon for real-world data to not end up using the full range of the datatype's possible values?
> I claim that your other posit, that a > well-designed class should contain two string members that contain the > same value, is very poor.) I didn't say that well-designed classes "should" contain two string members that contain the same value - I claimed it often happened. Think "shipping address" and "invoice address" - as references to Address objects, if you will (it really doesn't matter whether they're strings or not - the point is that repeating a value effectively cancels out the first value's hashcode when using XOR).
> 3. Uncharacterizable data should be considered randomly distributed, by > the definition of random distribution, and thus should data in member > variables of a given class. I think that's as much a guess as any other distribution - but I'd consider it far worse than my guess that numeric values, particularly integers, are often relatively small, usually positive. This is reasonable, as many real-world values have the same attributes. Consider:
1) Age 2) Day of month 3) Month of year 4) Year 5) House number 6) Price (in cents) 7) Length of DVD or CD in minutes 8) Dimension (eg in centimeters) 9) Number of messages in my inbox 10) Number of replies to a blog post
Many of these will end up being represented using an int in simple classes. Do you really believe they have a uniform random distribution over the full range of ints?
> 4. Given two random distributions, A and B, A XOR B is random. > > 5. It follows that given a class with random member variable values, > XOR-ing member variable values will give a random hash. If life really consisted of uniformly distributed random values, XOR would be fine. However, I contest that the real world isn't like that.
> QED, unless you have specific characterization of your data, XOR is > efficient. Seeing that it's the method recommended in the docs -- I > won't even touch your comment about their incorrectness -- I'll follow > Ockham's Razor and stick with it. Why won't you touch my comment about the docs not being correct? I've submitted many corrections to MSDN before, and there are *huge* numbers of bad code examples. I'm sure I could easily find some if you doubt me...
Now, as to it being good solely by being recommended in documentation - "my" algorithm is recommended in a book which is widely recognised as a great text. Okay, so it's about Java, but I don't believe that there's any significant difference between .NET and Java that would affect the advice about hashcodes.
> Mind you, I'm not trying to say your algorithm is incorrect. I'm simply > stating that one should implement the simplest efficient algorithm > until optimization, when one can gain insight into the true > distribution of values in ones real data set. I don't think there's any significant difference in simplicity, to be honest. I *do* believe you'll get more collisions in the real world using XOR than using multiply-and-add.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
ssamuel - 15 Nov 2006 00:47 GMT Jon,
You've proven my point. Every time you list an example, you're demonstrating that by knowing your data, you can come up with a better hash than a generic implementation.
The likelihood of producing collisions using my method is sufficiently low in classes that contain the following data sets:
1. stock ticker, price 2. patient name, age, ssn, collection of descriptive flags 3. date/time, event description, event data
Further, for those sets of data, an XOR hash produces a sufficiently random distribution for hashes to work just fine. Your algorithm works just as well in these cases, but mine does well also. If you disagree with that, we need to start to develop a metric for how "good" a hash has to be and what your benchmark is. Then we're departing from the realm of developing a good generic algorithm; thus specific examples haven't gotten us far.
Saying that XOR produces collisions is not very useful. XOR works on the basis of collisions, so the task is not to prove that you could have collisions, but that the collisions inside two data structures that contain this data would lead to low orthogonality if the data was highly orthogonal.
Your argument that you can generalize about the English language doesn't apply because your assumptions have no bearing on the assumptions that make up this case. It's about as logical as me saying, "I'm right because the Cardinals won the World Series in 2006," then going on to prove that the Cardinals did in fact win.
You've made statements about how you have some ability to characterize all "real world" data. All you've done to prove it is list single data types that aren't themselves randomly distributed, although some that are used as commonly as those which you didn't mention are. Correlating two or more of those changes the orthogonality of your random variables, be they of Gaussian, random, lognormal or any other distribution. A few cases, or two dozen cases for that matter, don't prove a point.
Your hash works well in some scenarios and my hash works well in general scenarios. That's not to say that yours doesn't work well in general scenarios -- it appears as if it would -- or that there are some in which mine doesn't work well -- we know there are. You have, however, failed to prove that mine doesn't work well in the "real world."
I'm not going to get into a petty argument over whose book is better. You're certainly not the only person ever to notice an error somewhere, and you're even more certainly not an authority on whether MSDN has more errors per unit volume than any particular book in existence, except perhaps any that you may have written. Either prove that the specific passage I quoted is incorrect or drop the "MSDN is the root of all evil" argument.
Stephan
Jon wrote:
> > I highly respect you for taking the time to write sample code, but I > > think you're chasing a proof by example. Given your assumptions, you [quoted text clipped - 120 lines] > http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet > If replying to the group, please do not mail me too
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|