I have a very basic question. believe me I have read many books, saw videos but not able to get my answer.
Suppose we have HashMap.
I have 3(a,b,c) vales that maps to same hash, a and b are same but c is different.
If I add only a and b to hastable, how does hashMap knows it is NOT a collision.
Suppose we have Hashmap.... Now I call put(obj1,"Test") and then put(obj2,"Test") obj1 and obj2 map to same key.... Can you tell me what hash map is going to store for these two calls
Will it store the actual objects?
If no how will it decide on the second call that it is not collision if obj1 and obj2 are same.
Thanks
Most hashtables require two bits of support from your storage objects - GetHashCode() and Equals().
If two objects return the same GetHashCode() but their comparison with Equals() returns true, they represent the same data, and thus is not a collision, just duplicate entries.
If two objects return the same GetHashCode(), and their comparison returns false, then the Hashtable knows that the objects represent different things, and thus treats it as a collision.
This is why in many oop languages, like C#, you have to overrride/implement GetHashCode() and Equals() in your storage objects. If you ever implement those methods such that two objects, when compared with Equals() returns true, but return different values from GetHashCode(), then you have a bug.
According to Wikipedia:
In computer science, a hash table or hash map is a data structure that
uses a hash function to map identifying values, known as keys (e.g., a
person's name), to their associated values (e.g., their telephone
number). Thus, a hash table implements an associative array.
Also Wiki said associative array as "abstract data type composed of a collection of (key,value) pairs".
So yes, the hash table do know their "tenants".
Related
For resolving hashing collision in the Hash Table data structure, we have one very popular strategy called Separate Chaining.
I'm aware, that in the Separate Chaining strategy, keys, which end up being collided into backing array's same index (due to the fact, that they're hashed into the same particular values), are Linked Lists.
I wonder whether the type of backing array is LinkedList<E>[] from the moment of creation of Hash Table (during separate chaining strategy implementation), or it's int[] and it gets converted to the LinkedList<E>[] array after first collision?
Because, having Linked Lists as each element of the backing array seems not the most optimal solution.. it means, that those Linked Lists, should be a list of the elements, which in turn, are Entries/Buckets of a pair of key-value.. and this all really consumes a lot of memory and resource, I reckon.
I did quite a research in different books and academic articles; yet, I still can't really get a clear answer on this.
Yes, separate chaining will cost more memory than probing or re-hashing. But the benefit is that you get more items in the hash table before performance begins to suffer. At some point you still have to re-index: typically when you realize that some bucket is over-represented or when the total number of occupied buckets exceeds some threshold.
Note that the backing array itself isn't a linked list. The backing array for a hash table that uses probing or re-hashing will probably be a dynamically-sized array of entries. Your entry would be something like:
class Entry {
String: key;
SomeObject: value;
}
If you're using separate chaining, the Entry object gets an additional field: a reference to the next item that hashed to the same bucket:
class Entry {
String: key;
SomeObject: value;
Entry: next;
}
The memory difference for the first item really isn't enough to worry about.
It's possible to write the code so that if a bucket has but a single item, it will contain just the key and value, and the bucket is converted to a linked list only on first collision. There is perhaps a small memory win there, and an even smaller performance gain. But the code is more complex and the gains aren't huge unless you know that the majority of your buckets won't have any collisions. Not worth the trouble of implementing, testing, and maintaining two different code paths.
As far as I know hash table uses has key to store any item whereas dictionary uses simple key value pair to store item.it means that dictionary is a lot faster than hash table (Which I think. Please correct me if I am wrong).
Does this mean I should never use hash table?
The answer is "it depends".
A Dictionary is merely a way to map a key to a value. You can either use a library or implement one yourself.
A Hash table is a specific way to implement a dictionary where the key based upon a hash function. This function is usually based on modulo arithmetic. This means that two distinct value may end up with the hash key and therefore there will be a collision between the keys. It is then up to you (or whoever implements the hash table) to the determine how to resolve the collision. You could chain the value at the same key, re-hash and use a sub-hash table, or you may even want to start over with a new hash function (which would be expensive).
Depending on the underlying implementation of the dictionary (hash table) will affect your lookup performance.
I am taking a course on data structures in coursera and I read recently about Universal family of hash functions. If i choose a hash function randomly from a universal family of hash functions, How will i exactly remap it to look up for a value. If i have to remember the function chosen for each key, then i should maintain a list for it. And this evaluation of finding the correct hash function for a key itself will take linear time violating the constant time look up of hash tables. How should i proceed implementing it?
When making one hash map, you use one function from the family. When you rehash the entire map (typically because of lack of capacity or too many collisions) or create a separate map, you can then choose a different hashing function from the family. You wouldn't use two different functions to attempt to create the same hash map.
to my current understanding Universal Hashing is a method whereby the hash function is chosen randomly at runtime in order to guarantee reasonable performance for any kind of input.
I understand we may do this in order to prevent manipulation by somebody choosing malicious input deliberately (a possibility of a deterministic hash function is know).
My Question is the following: Is it not true, that we still need to guarantee that a key will be mapped to the same address every time we hash it ? For instance if we want to retrieve information, but the hash function is chosen at random, how do we guarantee we can get back at our data ?
A universal hash function is a family of different hash functions that have the property that with high probability, two randomly-chosen elements from the universe will not collide no matter which hash function is chosen. Typically, this is implemented by having the implementation pick a random hash function from a family of hash functions to use inside the implementation. Once this hash function is chosen, the hash table works as usual - you use this hash function to compute a hash code for an object, then put the object into the appropriate location. The hash table has to remember the choice of the hash function it made and has to use it consistently throughout the program, since otherwise (as you've noted) it would forget where it mapped each element.
Hope this helps!
I have a large collection of objects of type foo. Each object of type foo has say 100 properties (all strings) plus an id. An object of type bar also has these 100 properties.
I want to find the matching object of type foo from the collection where all these properties match with that of bar.
Aside from the brute force method, is there an elegant algorithm where we can calculate a signature for foo objects once and do the same for the bar object and match more efficiently?
The foos are in the thousands and the bars are in the millions.
Darth Vader has a point there... and I never thought that I'd be siding with the dark side!
I'll go over what I think are the best tools for the trade:
Embedded database: Google's LevelDB- it's faster than most database solutions out there.
Hashing function: Google's CityHash- it's fast and it offers excellent hashing!
JSON Serialization
The Embedded Database
The goal of using an embedded database is that you will get performance that will beat most database solutions that you're likely to encounter. We can talk about just how fast LevelDB is, but plenty of other people have already talked about it quite a bit so I won't waste time. The embedded database allows you to store key/value pairs and quickly find them in your database.
The Hashing Function
A good hashing function will be fast and it will provide a good distribution of non-repeatable hashes. CityHash is very fast and it has very good distribution, but again: I won't waste time since a lot of other people have already talked about the performance of CityHash. You would use the hashing function to hash your objects and then use the unique key to look them up in the database.
JSON Serialization
JSON Serialization is the antithesis of what I've shown above: it's very slow and it will diminish any performance gain you achieved with CityHash, but it gives you a very simple way to hash an entire object. You serialize the object to a JSON string, then you hash the string using CityHash. Despite the fact that you've lost the performance gains of CityHash because you spent so much time serializing the object to JSON, you will still reap the benefits of having a really good hashing function.
The Conclusion
You can store billions of records in LevelDB and you will be able to quickly retrieve the exact value you're looking for just by providing the hash for it.
In order to generate a key, you can use JSON serialization and CityHash to hash the JSON string.
Use the key to find the matching object!
Enjoy!
If you have ALL matching properties. That means they are same objects actually. is that correct?
In any case, you want to use a Map/Dictionary/Table with a good hashing algorithm to find matching objects.
Whichever language you are using, you should override the gethashcode and equals methods to implement it.
If you have a good hashing algorithm your access time will be O(1). otherwise it can be upto O(n).
Based on your memory limitation, you want to store foos in the map, storing bars might requite lots of space which you might not have.
Hash is very nice and simple to implement.. But i want suggest you that algorithm:
Map your 100 string properties to one big string(for example concatenate with fixed length for each property) that should unique id of this object. So we have 1000 string in first set, and 1mln strings in second.
The problem reduces to find for each strings in second set if first set contains it.
Make trie data structure on first set
Complicity of checking if string S in the trie is O(|S|). |S| - length of S.
So... Complicity of algorithm is - O(Sum(|Ai|) + Sum(|Bi|)) = O(max(Sum(|Ai|), Sum(|Bi|)) = O(Sum(|Bi|)) for your problem. Ai - string unique id for first set, Bi - string unique id for second set.
UPDATE:
Trie takes O(Sum(|Ai|) * |Alphabet|) space at worst.