Hash code, why is string and numeric key not good for memory address - data-structures

In the data structure lecture (and still happening right now), our lecturer explained that hash codes are useful for memory addresses.
That made sense, but then he added "except for numeric and string keys – Why?"
I thought the reason was because then we can no longer apply hash functions but according to him that is not true.
As we can either implement different hash function for strings or use integer representation of the memory addresses.
He claimed the reason is due to the fact that strings are array and numeric can be array type as well. And applying the hash function would only allocate part of that character to the 'bucket array'.
The thing is our lecturer isn't the guy who made the lecture note (He uses the previous lecturer's one last year) and I don't think what he said today is correct, Can someone enlighten me on this please?

These lecture notes you refer to come directly from Goodrich & Tamassia’s book (both the Algorithm design one and Data Structures). It discusses a variety of hash code functions - such as using the memory address of the object, using an integer cast, component sum, or polynomial accumulation. It notes that using the memory address of an object is “good in general, except for numeric and string keys”.
There are times when the hash code that maps an object to an integer based on its memory address is sufficient, even if that object is a string. However, two objects with equal value (a=‘hello’, b=‘hello’) would not have the same hash code using this method, since they have different memory addresses. The same applies for other objects such as numeric keys (a=10, b=10 are equal in value but not in memory address).
Consider a simple system which stores a password for a user as a hash code. If the user enters in a password, the string they entered in is hashed according to the same hash function and compared against the one which is stored. These two passwords have the same value (the one the user first created and the one they use to log in), so they should produce the same hash to successfully log in. Therefore, we would not want to use the memory address to map the string to an integer in this scenario.

Related

Any reference to definition or use of the data structuring technique "hash linking"?

I would like more information about a data structure - or perhaps it better described as a data structuring technique - that was called hash linking when I read about it in an IBM Research Report a long time ago - in the 70s or early 80s. (The RR may have been from the 60s.)
The idea was to be able to (more) compactly store a table (array, vector) of values when most values fit in a (relatively) small compact range but some values (may) have had unusually large (or small) values out of that range. Instead of making each element of the table wider to hold the entire range you would store, in the table, only those values that fit in the small compact range and put all other entries that didn't fit into a hash table.
One use case I remember being mentioned was for bank accounts - you might determine that 98% of the accounts in your bank had balances under $10,000.00 so they would nicely fit in a 6-digit (decimal) field. To handle the very few accounts $10,000.00 or over you would hash-link them.
There were two ways to arrange it: Both involved a table (array, vector, whatever) where each entry would have enough space to fit the 95-99% case of your data values, and a hash table where you would put the ones that didn't fit, as a key-value pair (key was index into table, value was the item value) where the value field could really fit the entire range of the values.
You would pick a sentinel value, depending on your data type. Might be 0, might be the largest representable value. If the value you were trying to store didn't fit the table you'd stick the sentinel in there and put the (index, actual value) into the hash table. To retrieve you'd get the value by its index, check if it was the sentinel, and if it was look it up in the hash table.
You would have no reasonable sentinel value. No problem. You just store the exceptional values in your hash table, and on retrieval you always look in the hash table first. If the index you're trying to fetch isn't there you're good: just get it out of the table itself.
Benefit was said to be saving a lot of storage while only increasing access time by a small constant factor in either case (due to the properties of a hash table).
(A related technique is to work it the other way around if most of your values were a single value and only a few were not that value: Keep a fast searchable table of index-value pairs of the ones that were not the special value and a set of the indexes of the ones that were the very-much-most-common-value. Advantage would be that the set would use less storage: it wouldn't actually have to store the value, only the indexes. But I don't remember if that was described in this report or I read about that elsewhere.)
The answer I'm looking for is a pointer to the original IBM report (though my search on the IBM research site turned up nothing), or to any other information describing this technique or using this technique to do anything. Or maybe it is a known technique under a different name, that would be good to know!
Reason I'm asking: I'm using the technique now and I'd like to credit it properly.
N.B.: This is not a question about:
anything related to hash tables as hash tables, especially not linking entries or buckets in hash tables via pointer chains (which is why I specifically did not add the tag hashtable),
an "anchor hash link" - using a # in a URL to point to an anchor tag - which is what "hash link" gets you when you search for it on the intertubes,
hash consing which is a different way to save space, for much different use cases.
Full disclosure: There's a chance it wasn't in fact an IBM report where I read it. During the 70s and 80s I was reading a lot of TRs from IBM and other corporate labs, and MIT, CMU, Stanford and other university departments. It was definitely in a TR (not a journal or ACM SIG publication) and I'm nearly 100% sure it was IBM (I've got this image in my head ...) but maybe, just maybe, it was wasn't ...

hashing vs hash function, don't know the difference

For example, "Consistent hashing" and "Perfect hash function", in wikipedia, I click "hashing" and the link direct to "hash function", so it seems that they have the same meaning, but why does another exist? And is there any difference when using "hashing" or "hash function"? And is it ok to call "consistent hashing" as "consistent hash function"? Thanks!
A hash function takes some input data (typically a bunch of binary bytes, but could be anything - whatever you make it to) and calculates a hash value, which is typically an integer number (but, again, can be anything). The process of doing this is called hashing.
The hash value is always the same size, no matter what the input looks like. Well, I suppose you cold make a hash function that has a variable-size output, but I haven't seen one in the wild yet. It wouldn't be very practical. Thus, by its very nature, hashing is usually a one-way calculation. You can't normally get the original data back from the hash value, because there are many more possible input data combinations than there are possible hash values.
The main advantages are:
The hash value is always the same size
The same input will always generate the same output.
If it's a good hash function, different inputs will usually generate different outputs, but it's still possible that two different inputs generate the same output (this is called a hash collision).
If you have a cryptographical hash function you also get one more advantage:
From having only the hash value, it's impossible (unfeasible) to come up with input data that would hash to this value. Never mind that it's not the original input data, any kind of input data that would hash to the given output value is impossible to find in a useful timeframe.
The results of a hash function can be used in various ways. As mentioned in other answers, hash tables are one common use-case. Verifying data integrity is another case - for example, you download a file, then hash it, then check the hash value against the value that was specified in the webpage where you downloaded the file from. If they don't match, the file was not downloaded correctly. If you combine hash values with public-key cryptography you can get digital signatures. And I'm sure there are other uses to which the principle can be put.
you can write a hash function and what it does is to hash keys to bins.
In other words the hash function is doing the hashing.
I hope that clarifies it.
HashTable is a data Structure in which a given value is mapped with a particular key for faster access of elements. - Process of populating this data structure is known as hashing.
To do hashing , you need a function which will provide logic for mapping values to keys. This function is hash function
I hope this clarifies your doubt.

Basics in Universal Hashing, how to ensure accessibility

to my current understanding Universal Hashing is a method whereby the hash function is chosen randomly at runtime in order to guarantee reasonable performance for any kind of input.
I understand we may do this in order to prevent manipulation by somebody choosing malicious input deliberately (a possibility of a deterministic hash function is know).
My Question is the following: Is it not true, that we still need to guarantee that a key will be mapped to the same address every time we hash it ? For instance if we want to retrieve information, but the hash function is chosen at random, how do we guarantee we can get back at our data ?
A universal hash function is a family of different hash functions that have the property that with high probability, two randomly-chosen elements from the universe will not collide no matter which hash function is chosen. Typically, this is implemented by having the implementation pick a random hash function from a family of hash functions to use inside the implementation. Once this hash function is chosen, the hash table works as usual - you use this hash function to compute a hash code for an object, then put the object into the appropriate location. The hash table has to remember the choice of the hash function it made and has to use it consistently throughout the program, since otherwise (as you've noted) it would forget where it mapped each element.
Hope this helps!

How to match against a large collection efficiently

I have a large collection of objects of type foo. Each object of type foo has say 100 properties (all strings) plus an id. An object of type bar also has these 100 properties.
I want to find the matching object of type foo from the collection where all these properties match with that of bar.
Aside from the brute force method, is there an elegant algorithm where we can calculate a signature for foo objects once and do the same for the bar object and match more efficiently?
The foos are in the thousands and the bars are in the millions.
Darth Vader has a point there... and I never thought that I'd be siding with the dark side!
I'll go over what I think are the best tools for the trade:
Embedded database: Google's LevelDB- it's faster than most database solutions out there.
Hashing function: Google's CityHash- it's fast and it offers excellent hashing!
JSON Serialization
The Embedded Database
The goal of using an embedded database is that you will get performance that will beat most database solutions that you're likely to encounter. We can talk about just how fast LevelDB is, but plenty of other people have already talked about it quite a bit so I won't waste time. The embedded database allows you to store key/value pairs and quickly find them in your database.
The Hashing Function
A good hashing function will be fast and it will provide a good distribution of non-repeatable hashes. CityHash is very fast and it has very good distribution, but again: I won't waste time since a lot of other people have already talked about the performance of CityHash. You would use the hashing function to hash your objects and then use the unique key to look them up in the database.
JSON Serialization
JSON Serialization is the antithesis of what I've shown above: it's very slow and it will diminish any performance gain you achieved with CityHash, but it gives you a very simple way to hash an entire object. You serialize the object to a JSON string, then you hash the string using CityHash. Despite the fact that you've lost the performance gains of CityHash because you spent so much time serializing the object to JSON, you will still reap the benefits of having a really good hashing function.
The Conclusion
You can store billions of records in LevelDB and you will be able to quickly retrieve the exact value you're looking for just by providing the hash for it.
In order to generate a key, you can use JSON serialization and CityHash to hash the JSON string.
Use the key to find the matching object!
Enjoy!
If you have ALL matching properties. That means they are same objects actually. is that correct?
In any case, you want to use a Map/Dictionary/Table with a good hashing algorithm to find matching objects.
Whichever language you are using, you should override the gethashcode and equals methods to implement it.
If you have a good hashing algorithm your access time will be O(1). otherwise it can be upto O(n).
Based on your memory limitation, you want to store foos in the map, storing bars might requite lots of space which you might not have.
Hash is very nice and simple to implement.. But i want suggest you that algorithm:
Map your 100 string properties to one big string(for example concatenate with fixed length for each property) that should unique id of this object. So we have 1000 string in first set, and 1mln strings in second.
The problem reduces to find for each strings in second set if first set contains it.
Make trie data structure on first set
Complicity of checking if string S in the trie is O(|S|). |S| - length of S.
So... Complicity of algorithm is - O(Sum(|Ai|) + Sum(|Bi|)) = O(max(Sum(|Ai|), Sum(|Bi|)) = O(Sum(|Bi|)) for your problem. Ai - string unique id for first set, Bi - string unique id for second set.
UPDATE:
Trie takes O(Sum(|Ai|) * |Alphabet|) space at worst.

How to test a hash function?

Is there a way to test the quality of a hash function? I want to have a good spread when used in the hash table, and it would be great if this is verifyable in a unit test.
EDIT: For clarification, my problem was that I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap. So I need a different hash, and would like to have a Unit Test so that this problem cannot creep in any more.
You have to test your hash function using data drawn from the same (or similar) distribution that you expect it to work on. When looking at hash functions on 64-bit longs, the default Java hash function is excellent if the input values are drawn uniformly from all possible long values.
However, you've mentioned that your application uses the long to store essentially two independent 32-bit values. Try to generate a sample of values similar to the ones you expect to actually use, and then test with that.
For the test itself, take your sample input values, hash each one and put the results into a set. Count the size of the resulting set and compare it to the size of the input set, and this will tell you the number of collisions your hash function is generating.
For your particular application, instead of simply XORing them together, try combining the 32-bit values in ways a typical good hash function would combine two indepenet ints. I.e. multiply by a prime, and add.
First I think you have to define what you mean by a good spread to yourself. Do you mean a good spread for all possible input, or just a good spread for likely input?
For example, if you're hashing strings that represent proper full (first+last) names, you're not going to likely care about how things with the numerical ASCII characters hash.
As for testing, your best bet is to probably get a huge or random input set of data you expect, and push it through the hash function and see how the spread ends up. There's not likely going to be a magic program that can say "Yes, this is a good hash function for your use case.". However, if you can programatically generate the input data, you should easily be able to create a unit test that generates a significant amount of it and then verify that the spread is within your definition of good.
Edit: In your case with a 64 bit long, is there even really a reason to use a hash map? Why not just use a balanced tree directly, and use the long as the key directly rather than rehashing it? You pay a little penalty in overall node size (2x the size for the key value), but may end up saving it in performance.
If your using a chaining hash table, what you really care about is the number of collisions. This would be trivial to implement as a simple counter on your hash table. Every time an item is inserted and the table has to chain, increment a chain counter. A better hashing algorithm will result in a lower number of collisions. A good general purpose table hashing function to check out is: djb2
Based on your clarification:
I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap.
it appears you have some unhappy "resonances" between the way you assign the two ID values and the sizes of your HashMap instances.
Are you explicitly sizing your maps, or using the defaults? A QAD check seems to indicate that a HashMap<Long,String> starts with a 16-bucket structure and doubles on overflow. That would mean that only the low-order bits of the ID values are actually participating in the hash bucket selection. You could try using one of the constructors that takes an initial-size parameter and create your maps with a prime initial size.
Alternately, Dave L's suggestion of defining your own hashing of long keys would allow you to avoid the low-bit-dependency problem.
Another way to look at this is that you're using a primitive type (long) as a way to avoid defining a real class. I'd suggest looking at the benefits you could achieve by defining the business classes and then implementing hash-coding, equality, and other methods as appropriate on your own classes to manage this issue.

Resources