I need some advice for implementing a good hash table structure. I've researching something, but I would like some external opinions. Thanks!
Whatever hash function you choose you have to fullfil the following requirements:
provide a uniform distribution of hash values: a non-uniform distribution will increase the amount of collisions between mapped values.
good schema for collision resolution: it's almost impossible to avoid them, so you will have to implement some strategies such as "separate chaining" or "open addressing". A good starting point is http://task3.cc/44/hash-maps-with-linear-probing-and-separate-chaining/.
Related
Hash tables allow mapping keys to values by using a hashing function. Here the hashing function actually computes the index of the key mapped to a specific value. But I just can't get my head around why we even use hash tables it in the first place? Why do you need a hash table? Are maps/dictionaries not good enough? Why not declare a dictionary ({'key1': 'value1'} in Python) and use it in the places where a hash table is required? I read a lot about it and still don't get it. Can you help me understand this?
why do you need a hashtable, is the map/dictionary not good
This is like asking why you need an automotive engine, isn't a car good enough? An engine is how a car works; you just don't see the engine when you are driving the car. But if you are learning to become an automotive engineer, then you should learn how engines work and how to design, build and maintain them.
Likewise, a hash table is how a dictionary works, you just don't see the hash table if you are writing code that uses a dictionary. But if you are learning to become a computer scientist, then you should learn how hash tables and other data structures work, and how to design, build and maintain them.
I am researching about hash tables and hash maps, everything I have read or watched gives a very vague description of the differences. From messing around on Netbeans with them both, they seem to have the same functions and do the same things, what are the fundamental differences between these two data structures?
There are no differences, but you can find that the same thing called differently in different programming languages, so how people call something depends on their background and programming language they use. For example: in c++ it will be HashMap and in java it will be HashTable.
Also, there could be one difference concluded based on the naming: HashTable allows only store hashed keys, but not values whereas HashMap allows to retrieve a value by hashed key. Internally the both will use the same algorithm and can be considered as same data structure.
HashTable sounds to me like a concrete data structure, although it has numerous variants depending on what happens when a collision occurs, when the table fills up, when it empties.
Map sounds like a abstract data structure, something defined by the available operations (Dictionary would be a potential other name for the same data structure, but I'd not be surprised if some nomenclature defined both with a nuance somewhere).
HashMap sounds like an implementation of the Map abstract data structure using an HashTable concrete data structure.
Again, I'd not be surprised if a language or a library provided both, with a nuance somewhere (HashMap for instance could provide only the operations defined for a Map, but HashTable provides everything which make sense for an HashTable).
If my intention is only to have a good hash function that spreads data evenly into all of the buckets, then I need not come up with a family of hash functions, I could just do with one good hash function, is that correct?
The purpose of having a family of hash functions is only to make it harder for the enemy to build a pathological data set as when we pick a hash function randomly, he/she has no information about which hash function is employed. Is my understanding right?
EDIT:
Since someone is trying to close as unclear; This question is to know the real purpose of employing a Universal family of hash functions.
I could just do with one good hash function, is that correct?
As you note later in your question, an "enemy" who knows which hash function you're using could prepare a pathological data set.
Further, hashing is just the first stage in storing data into your table's buckets - if you're implementing open addressing / closed hashing, you also need to select alternative buckets to probe after collisions: simple approaches like linear and quadratic probing generally provide adequate collision avoidance, and are likely mathematically simpler and therefore faster than rehashing, but they don't maintain a probability of the next probe finding an unused bucket at the load factor. Rehashing with another good hash function (including another from a family of such functions) does, so if that's important to you you may prefer to use a family of hash functions.
Note too that sometimes an in-memory hash table is used to say at which offsets/sectors on disk data is stored, so extra rehashing calculations with already-in-memory data may be far more appealing than a higher probability (with linear/quadratic probing) of waiting on disk I/O only to find another collision.
I need an engine which consists of a world populated with axis-aligned bounding boxes (AABBs). A continuous loop will be executed, doing the following:
for box_a in world
box_a = do_something(box_a)
for box_b in world
if (box_a!=box_b and collides(box_a, box_b))
collide(box_a, box_b)
collide(box_b, box_a)
The problem with that is, obviously, that this is O(n^2). I have managed to make this loop much faster partitioning the space in chunks, so this became:
for box_a in world
box_a = do_something(box_a)
for chunk in box_a.neighbor_chunks
for box_b in chunk
if (box_a!=box_b and collides(box_a, box_b))
collide(box_a, box_b)
collide(box_b, box_a)
This is much faster but a little crude. Given that there is such a faster algorithm with not a lot of effort, I'd bet there is a data structure I'm not aware of that generalizes what I've done here, providing much better scalability for this algorithm.
So, my question is: what is the name of this problem and what are the optimal algorithms and data-structures to implement it?
this is indeed a generic problem of computer science : space partitionning.
its used in raytracing, path tracing, raster rendering, physics, IA, games, and pretty sure in HPC, databases, matrix maths, whatever science (molecules, pharmacy....), and I bet thousands of other stuff.
there is no 1 best structure, I have a friend who did his master on an algorithm to tesselate a point of cloud coming out of a laser scanner (billions of data) and in his case the best data structure was to mix a collection of uniforms 3D grids with some octree.
For other people kd-tree is the best, for other people, BVH trees are the best.
I like the grid system but it cannot work if the space is too wide because all cells has to exist.
One day I even implemented a sparse grid system using a hash map, it worked, I didn't bother to profile or investigate the performance so I wouldn't know if its an excellent way, I know its one way though.
To do that, you make a KEY class which is basically a 3D position vector hasher, first you apply an integer division on the coordinates to define the size of one grid cell. Then you stupidely hash all coordinates into one hash and provide a hash_value method or friend method. an equality operator and then its usable in a hash map.
You can use a google::sparse_map or something along these lines. I personally used boost::unordered and it was enough in my case.
Then the thing to consider is the presence of AABB into more than one cell. You can store a reference in every cell covered by your AABB, its just something to be aware of in every algorithm : "there is no 1-1 relationship between cell references and AABB." that's all.
good luck
I am looking to implement my own collection class. The characteristics I want are:
Iterable - order is not important
Insertion - either at end or at iterator location, it does not matter
Random Deletion - this is the tricky one. I want to be able to have a reference to a piece of data which is guaranteed to be within the list, and remove it from the list in O(1) time.
I plan on the container only holding custom classes, so I was thinking a doubly linked list that required the components to implement a simple interface (or abstract class).
Here is where I am getting stuck. I am wondering whether it would be better practice to simply have the items in the list hold a reference to their node, or to build the node right into them. I feel like both would be fairly simple, but I am worried about coupling these nodes into a bunch of classes.
I am wondering if anyone has an idea as to how to minimize the coupling, or possibly know of another data structure that has the characteristics I want.
It'd be hard to beat a hash map.
Take a look at tries.
Apparently they can beat hashtables:
Unlike most other algorithms, tries have the peculiar feature that the time to insert, or to delete or to find is almost identical because the code paths followed for each are almost identical. As a result, for situations where code is inserting, deleting and finding in equal measure tries can handily beat binary search trees or even hash tables, as well as being better for the CPU's instruction and branch caches.
It may or may not fit your usage, but if it does, it's likely one of the best options possible.
In C++, this sounds like the perfect fit for std::unordered_set (that's std::tr1::unordered_set or boost::unordered_set to you if you have an older compiler). It's implemented as a hash set, which has the characteristics you describe.
Here's the interface documentation. Note that the hash containers actually offer two sets of iterators, the usual ones and local ones which only go through one bucket.
Many other languages have "hash sets" as well, certainly Java and C#.