How does a hash (in a language like Ruby) work "under the hood"? - ruby

I've read here and there about hash maps/tables, and can kind of understand the concept that a hash table is essentially a finite-sized array. The function could use the modulus operator to determine which index in the array corresponds to a particular key. If collisions occur, then a linked-list can be implemented to store all the collided values. This is my very-novice understanding, and I hope someone can expound on it/correct it in the context of a Ruby hash. In Ruby, all you really have to do is
hash = {}
hash[key] = value
and this creates a key with the corresponding value. Say that you're just storing a bunch of symbols as keys and numbers as values:
hash[:a] = 1
hash[:b] = 2
...
What exactly is happening under the hood in terms of storing the values in arrays and linked-lists? What would be an example of a collision?

The Ruby Language Specification does not prescribe any particular implementation strategy for the Hash class. Every implementation is allowed to implement it however they want, provided they honor the contract.
For example, here is Rubinius's implementation, which, being written in Ruby, is pretty easy to follow: kernel/common/hash.rb This is a fairly traditional hashtable. (One other cool thing to note about this implementation is that it actually happens to be as fast as YARV's, which proves that Ruby code can be as efficient as hand-optimized C.)
Rubinius also alternatively implements the Hash class with a Hash Array Mapped Trie: kernel/common/hash_hamt.rb [Note: this implementation uses three VM primitives written in C++.]
You can switch between those two implementations using a configuration option. So, not only is the Hash implementation different between different Ruby implementations, it might even be different between two runs of the exact same program on the exact same version of the exact same Ruby implementation!
In IronRuby, Ruby's Hash class simply delegates to a .NET System.Collections.Generic.Dictionary<object, object>: Ruby/Builtins/Hash.cs
In previous versions, it didn't even delegate, it was just simply a subclass: Ruby/Builtins/Hash.cs

If you are hardcore about this you could look at the implementation directly. This is what the hash ends up using:
https://github.com/ruby/ruby/blob/c8b3f1b470e343e7408ab5883f046b1056d94ccc/st.c
The hash itself is here:
https://github.com/ruby/ruby/blob/trunk/hash.c
Most of the times, the article diego provided in comments will be more than enough

In ruby 2.4, Hash table was moved to open addressing model, so I will describe only how Hash-tables structure works, but not how it is implemented in 2.4 and above.
Let's imagine that we store all entries in an array. When we want to find something, we have to go through all the elements to match one. This can take a long time if we have a lot of elements and using a hash table lets us go directly to the cell with the required value by computing the hash function for that key.
The hash table stores all values in the store (bins) groups, in a data structure similar to an array.
How does hash table work
When we add a new key-value pair, we need to calculate to which "storage" this pair will be inserted and we do this using the .hash method (hash function). The resulting value from the hash function is a pseudo-random number as it always produces the same number for the same value.
Roughly speaking, hash returns the equivalent of the link to the memory location where
the current object is stored. However, for strings, the calculation is relative to the value.
Having received a pseudo-random number, we have to calculate the number of the "storage" where the key-value pair will be stored.
'a'.hash % 16 =>9
a - key
16 - amount of storage
9 - the storage number
So, in Ruby the insertion works in the following way:
How insertion works
It takes the hash of the key using the internal hash function.
:c.hash #=> 2782
After getting the hash value, with the help of modulo operation (2782 % 16) we will get the storage number where to keep our key-value pair :d.hash % 16
Add key-value to a linked list of the proper bin
The search works as follows:
The search works quite the same way:
Determine "hash" function;
Find "storage";
Then iterate through the list and retrieve a hash element.
In ruby, the average number of elements per bin is 5. With the increase in the number of records, the density of elements will grow in each repository (in fact, that size of hash-table is only 16 storages).
If the density of the elements is large, for example 10_000 elements in one "storage", we will have to go through all the elements of this linked-list to find the corresponding record. And we'll go back to O(n) time, which is pretty bad.
To avoid this, table rehash is applied. This means that hash-table size will be increased (up to the next number of - 16, 32, 64, 128, ...) and for all current elements the position in the "storages" will be recalculated.
"Rehash" occurs when the number of all elements is greater than the maximum density multiplied by the current table size.
81 > 5 * 16 - rehash will be called when we add 81 elements to the table.
num_entries > ST_DEFAULT_MAX_DENSITY * table->num_bins
When the number of entries reaches the maximum possible value for current hash-table, the number of "storages" in that hash-table increases (it takes next size-number from 16, 32, 64, 128), and it re-calculates and corrects positions for all entries in that hash.
Check this article for a more in-depth explanation: Do You Know How Hash Table Works? (Ruby Examples)

Related

What happens if you don't hash a value before selecting a hash map bucket?

Since a hash map works with a modulus/division operation to select the appropriate bucket to place the value in, it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is". How good the function function is decides the likelihood of a same-hash return collision. However 'collision' in a hash map is referring to something else, it's referring to the same value AFTER the modulus operation. Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself? I would venture to say that retrieval would be lot faster, as there wouldn't be a need to loop through a number of bytes and do hash operations, with an end result, with respect to hash table collisions, much the same. I mean, the exact values that end up colliding with an already occupied bucket are different values, but if the values are spread all over the place then overall the results should be very similar.
it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is
No, that is not correct. Keys are not generally distributed evenly across bucket indexes. Hashing keys tends to more evenly distribute the bucket index better than raw key.
index = key%bucket_n;
// vs
index = hash(key)%bucket_n;
Further: A good hash function works well with any bucket_n. A weak hash function improves when bucket_n is a prime.
There is a need to balance the number of entries in a table vs. the table size. If entires_n much less than table_size, OP assertions make some sense. Yet this waste lots of memory
If entires_n much greater than table_size, collisions are common. Often even worse without a hash function.
IMO, the hash table size should exponentially grow with the entry count to maintain a density less than some threshold, say 1/3. A re-hash of the table may be needed to accommodate a size change.
Since a hash map works with a modulus/division operation to select the appropriate bucket to place the value in, it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is". How good the function function is decides the likelihood of a same-hash return collision
Not quite. A poor hash function can cluster keys or make particular bits more likely than others to be set. That, in turn, can result in some buckets being more likely to be selected by the modulus operator.
Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself?
In general, you can't say. There could very well be patterns in the keys that, if you just used the modulus operator, will cause some buckets to be much more full than others. A good hashing function essentially randomizes the bits so you're more likely to evenly distribute the keys in the buckets.
Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself?
Many languages do exactly that. E.g. Java.
But you have to be careful, if your hash function is too trivial, it would also be trivial for an attacker to exploit hash collisions to cause a DoS in your service. This is known as a Collision Attack. Different libraries deal with that in different ways.
Java HashMap falls back to a red-black tree whenever it detects too many collisions in a single bucket. Other languages introduce randomization in the hash function, so it would be harder for an attack to exploit it.

Where do hash table's keys exist?

A hash table is a data structure that can map keys to values. Given a key, hash function will calculate then tell us the index of the slots/buckets which storing the value. If multiple keys map to a same slot, it might start a linked list from this slot. If there's no enough slots for values, it will do a resizing operation to find a bigger space.
Is the first level of a hash table's buckets always an array?
Where are the keys stored? Or is it the case that it doesn't have to store the keys every time hash function takes a key and calculates the position?
In Ruby language, does a hash object such as {:name => "Wix", :age => 18} count as a hash table? If it does, I need the answer of question 2.
The ruby name Hash is somewhat misleading. To most developers, they are actually maps, meaning you give them a value and they give you another associated value. The fact that they are hashmaps is really just an implementation detail that makes them fast, and it is in fact the same principle of hashsets, which, given a value, just tell you if the value is in the set or not.
To simplify it a bit, imagine this:
Storing
You have an array of 10 elements. You are told to remember that 35 = "some data". You then hash the index (35), which I will simplify as just modulo-dividing it by the array length, so the result is 35 % 10 = 5.
We then store store the data 35 = "some data" at that index, for example as a tuple [35, "some data"].
We then get some more data, 25 = "more data" and 78 = "cool stuff". So again, we hash the keys and get 5 and 8. Storing the second one is easy, we just have to store [78, "cool stuff"] at position 8 in the array.
But storing [25, "more data"] is a problem, because there's already a bucket at position 5. As you already pointed out, that is solved by storing a linked list. So we go back to the beginning and instead store [35, "some data", nil] for our first value.
To insert 25 we then just change it so that the first element points to the second, and get array[5] = [35, "some data", <pointer>] -> [25, "more data", nil]
Accessing
After a while the user wants to know what the value associated with "25" is.
Since we implement a hashmap, we can just hash the value, 25 % 10 = 5 and know our pair is stored at position 5. We then only have to iterate a linked list with 2 elements looking for the value [25], and when we find it just take the second value and return it to the user.
In Practice
The above is, of course, an oversimplified example, but it shows the basic idea of how hash-maps operate.
In the real world, the hashing algorithm would, of course, be more complicated than just modulo-dividing, but the idea is the same. The hash of a key is always turned into an index in the array. A good hashing algorithm should be 1. fast and 2. random, to avoid having lots of empty buckets and a few buckets with lots of elements.
Also, our array wouldn't have a fixed length of 10, but be smart about it and try to both save memory by not being excessively big, but at the same time be generous enough with the memory to avoid unnecessary shrinking/growing all the time and keep the buckets reasonably short.
In the best case, you can have a map of a few thousand elements, and to access one you just hash it, which takes the same time independently of the size of the hash, instead of having to iterate all those thousands of elements and comparing each one to the one you're looking for.
Regarding your third question, the answer is yes.
As for the second, keys are stored in the buckets, but probably just as their hashed values.
I'm not sure how ruby internally stores the buckets, but generally they could be implemented in many ways, as arrays, structs, etc.

Perfect Hash Building

Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions and are also not revertible. So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Only reason I am able to think is they require say large key say 32bit.But still avoiding collision so the look up will definitely be O(1).
Because they are very slow, for two reasons:
They aim to be crytographically secure, not only collision-resistant in general
They produce a much larger hash value than what you actually need in a hash table
Because they handle unstructured data (octet / byte streams) but the objects you need to hash are often structured and would require linearization first
Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions...
Wrong because:
Two inputs cam still happen to have the same hash value. Say the hash value is 32 bit, a great general-purpose hash routine (i.e. one that doesn't utilise insights into the set of actual keys) still has at least 1/2^32 chance of returning the same hash value for any 2 keys, then 2/2^32 chance of colliding with one of those as a third key is hashed, 3/2^32 for the fourth etc..
Having distinct hash values is a very different thing from having the hash values map to distinct hash buckets in a hash table. Hash values are generally modded into the table size to select a bucket, so at best - and again for general-purpose hashing - the chance of a collision when adding an element to a hash table is #preexisting-elements / table-size.
So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Because speed is often the programmer's goal when choosing to use a hash table over say a binary tree. If the hash values are mathematically complicated to calculate, they may take a lot longer than using a slightly more (but still not particularly) collision prone but faster-to-calculate hash function. That said, there are times when more effort on the hashing can pay off - for example, when the hash table exists on magnetic disk and the I/O costs of seeking & reading records dwarfs hash calculation effort.
antti makes an interesting point about data too... general purpose hashing routines often work on blocks of binary data with a specific starting address and a number of bytes (they may even require that number of bytes to be a multiple of 2 or 4). In many applications, data that needs to be hashed will be intermingled with data that must not be included in the hash - such as cached values, file handles, pointers/references to other data or virtual dispatch tables etc.. A common solution is to hash the desired fields separately and combine the hash keys - perhaps using exclusive-or. As there can be bit fields that should be hashed in the same byte of memory as other data that should not be hashed, you sometimes need custom code to extract those values. Still, even if some copying and padding was required beforehand, each individual field could eventually be hashed using md5, SHA-1 or whatever and those hash values could be similarly combined, so this complication doesn't really categorically rule out the approach you're interested in.
Only reason I am able to think is they require say large key say 32bit.
All other things being equal, the larger the key the better, though if the hash function is mathematically ideal then any N of its bits - where 2^N >= # hash buckets - will produce minimal collisions.
But still avoiding collision so the look up will definitely be O(1).
Again, wrong as mentioned above.
(BTW... I stress general-purpose in a couple places above. That's just because there are trivial cases where you might have some insight into the keys you'll need to hash that allows you to position them perfectly within the available hash buckets. For example, if you knew the keys were the numbers 1000, 2000, 3000 etc. up to 100000 and that you had at least 100 hash buckets, you could trivially define your hash function as x/1000 and know you'd have perfect hashing sans collisions. This situation of knowing that all your keys map to distinct hash table buckets is known as "perfect hashing" - as per your question title - a good general-purpose hash like md5 is not a perfect hash, and indeed it makes no sense to talk about perfect hashing without knowing the complete set of possible keys).

Efficient mapping from 2^24 values to a 2^7 index

I have a data structure that stores amongst others a 24-bit wide value. I have a lot of these objects.
To minimize storage cost, I calculated the 2^7 most important values out of the 2^24 possible values and stored them in a static array. Thus I only have to save a 7-bit index to that array in my data structure.
The problem is: I get these 24-bit values and I have to convert them to my 7-bit index on the fly (no preprocessing possible). The computation is basically a search which one out of 2^7 values fits best. Obviously, this takes some time for a big number of objects.
An obvious solution would be to create a simple mapping array of bytes with the length 2^24. But this would take 16 MB of RAM. Too much.
One observation of the 16 MB array: On average 31 consecutive values are the same. Unfortunately there are also a number of consecutive values that are different.
How would you implement this conversion from a 24-bit value to a 7-bit index saving as much CPU and memory as possible?
Hard to say without knowing what the definition is of "best fit". Perhaps a kd-tree would allow a suitable search based on proximity by some metric or other, so that you quickly rule out most candidates, and only have to actually test a few of the 2^7 to see which is best?
This sounds similar to the problem that an image processor has when reducing to a smaller colour palette. I don't actually know what algorithms/structures are used for that, but I'm sure they're look-up-able, and might help.
As an idea...
Up the index table to 8 bits, then xor all 3 bytes of the 24 bit word into it.
then your table would consist of this 8 bit hash value, plus the index back to the original 24 bit value.
Since your data is RGB like, a more sophisticated hashing method may be needed.
bit24var & 0x000f gives you the right hand most char.
(bit24var >> 8) & 0x000f gives you the one beside it.
(bit24var >> 16) & 0x000f gives you the one beside that.
Yes, you are thinking correctly. It is quite likely that one or more of the 24 bit values will hash to the same index, due to the pigeon hole principal.
One method of resolving a hash clash is to use some sort of chaining.
Another idea would be to put your important values is a different array, then simply search it first. If you don't find an acceptable answer there, then you can, shudder, search the larger array.
How many 2^24 haves do you have? Can you sort these values and count them by counting the number of consecutive values.
Since you already know which of the 2^24 values you need to keep (i.e. the 2^7 values you have determined to be important), we can simply just filter incoming data and assign a value, starting from 0 and up to 2^7-1, to these values as we encounter them. Of course, we would need some way of keeping track of which of the important values we have already seen and assigned a label in [0,2^7) already. For that we can use some sort of tree or hashtable based dictionary implementation (e.g. std::map in C++, HashMap or TreeMap in Java, or dict in Python).
The code might look something like this (I'm using a much smaller range of values):
import random
def make_mapping(data, important):
mapping=dict() # dictionary to hold the final mapping
next_index=0 # the next free label that can be assigned to an incoming value
for elem in data:
if elem in important: #check that the element is important
if elem not in mapping: # check that this element hasn't been assigned a label yet
mapping[elem]=next_index
next_index+=1 # this label is assigned, the next new important value will get the next label
return mapping
if __name__=='__main__':
important_values=[1,5,200000,6,24,33]
data=range(0,300000)
random.shuffle(data)
answer=make_mapping(data,important_values)
print answer
You can make the search much faster by using hash/tree based set data structure for the set of important values. That would make the entire procedure O(n*log(k)) (or O(n) if its is a hashtable) where n is the size of input and k is the set of important values.
Another idea is to represent the 24BitValue array in a bit map. A nice unsigned char can hold 8 bits, so one would need 2^16 array elements. Thats 65536. If the corresponding bit is set, then you know that that specific 24BitValue is present in the array, and needs to be checked.
One would need an iterator, to walk through the array and find the next set bit. Some machines actually provide a "find first bit" operation in their instruction set.
Good luck on your quest.
Let us know how things turn out.
Evil.

Hashing - What Does It Do?

So I've been reading up on Hashing for my final exam, and I just cannot seem to grasp what is happening. Can someone explain Hashing to me the best way they understand it?
Sorry for the vague question, but I was hoping you guys would just be able to say "what hashing is" so I at least have a start, and if anyone knows any helpful ways to understand it, that would be helpful also.
Hashing is a fast heuristic for finding an object's equivalence class.
In smaller words:
Hashing is useful because it is computationally cheap. The cost is independent of the size of the equivalence class. http://en.wikipedia.org/wiki/Time_complexity#Constant_time
An equivalence class is a set of items that are equivalent. Think about string representations of numbers. You might say that "042", "42", "42.0", "84/2", "41.9..." are equivalent representations of the same underlying abstract concept. They would be in the same equivalence class. http://en.wikipedia.org/wiki/Equivalence_class
If I want to know whether "042" and "84/2" are probably equivalent, I can compute hashcodes for each (a cheap operation) and only if the hash codes are equal, then I try the more expensive check. If I want to divide representations of numbers into buckets, so that representations of the same number are in the buckets, I can choose bucket by hash code.
Hashing is heuristic, i.e. it does not always produce a perfect result, but its imperfections can be mitigated for by an algorithm designer who is aware of them. Hashing produces a hash code. Two different objects (not in the same equivalence class) can produce the same hash code but usually don't, but two objects in the same equivalence class must produce the same hash code. http://en.wikipedia.org/wiki/Heuristic#Computer_science
Hashing is summarizing.
The hash of the sequence of numbers (2,3,4,5,6) is a summary of those numbers. 20, for example, is one kind of summary that doesn't include all available bits in the original data very well. It isn't a very good summary, but it's a summary.
When the value involves more than a few bytes of data, some bits must get rejected. If you use sum and mod (to keep the sum under 2billion, for example) you tend to keep a lot of right-most bits and lose all the left-most bits.
So a good hash is fair -- it keeps and rejects bits equitably. That tends to prevent collisions.
Our simplistic "sum hash", for example, will have collisions between other sequences of numbers that also happen to have the same sum.
Firstly we should say about the problem to be solved with Hashing algorithm.
Suppose you have some data (maybe an array, or tree, or database entries). You want to find concrete element in this datastore (for example in array) as much as faster. How to do it?
When you are built this datastore, you can calculate for every item you put special value (it named HashValue). The way to calculate this value may be different. But all methods should satisfy special condition: calculated value should be unique for every item.
So, now you have an array of items and for every item you have this HashValue. How to use it? Consider you have an array of N elements. Let's put your items to this array according to their HashHalues.
Suppose, you are to answer for this question: Is the item "it1" exists in this array? To answer to it you can simply find the HashValue for "it1" (let's call it f("it1")) and look to the Array at the f("it1") position. If the element at this position is not null (and equals to our "it1" item), our answer is true. Otherwise answer is false.
Also there exist collisions problem: how to find such coolest function, which will give unique HashValues for all different elements. Actually, such function doesn't exist. There are a lot of good functions, which can give you good values.
Some example for better understanding:
Suppose, you have an array of Strings: A = {"aaa","bgb","eccc","dddsp",...}. And you are to answer for the question: does this array contain String S?
Firstle, we are to choose function for calculating HashValues. Let's take the function f, which has this meaning - for a given string it returns the length of this string (actually, it's very bad function. But I took it for easy understanding).
So, f("aaa") = 3, f("qwerty") = 6, and so on...
So now we are to calculate HashValues for every element in array A: f("aaa")=3, f("eccc")=4,...
Let's take an array for holding this items (it also named HashTable) - let's call it H (an array of strings). So, now we put our elements to this array according to their HashValues:
H[3] = "aaa", H[4] = "eccc",...
And finally, how to find given String in this array?
Suppose, you are given a String s = "eccc". f("eccc") = 4. So, if H[4] == "eccc", our answer will be true, otherwise it fill be false.
But how to avoid situations, when to elements has same HashValues? There are a lot of ways to it. One of this: each element in HashTable will contain a list of items. So, H[4] will contain all items, which HashValue equals to 4. And How to find concrete element? It's very easy: calculate fo this item HashValue and look to the list of items in HashTable[HashValue]. If one of this items equals to our searching element, answer is true, owherwise answer is false.
You take some data and deterministically, one-way calculate some fixed-length data from it that totally changes when you change the input a little bit.
a hash function applied to some data generates some new data.
it is always the same for the same data.
thats about it.
another constraint that is often put on it, which i think is not really true, is that the hash function requires that you cannot conclude to the original data from the hash.
for me this is an own category called cryptographic or one way hashing.
there are a lot of demands on certain kinds of hash f unctions
for example that the hash is always the same length.
or that hashes are distributet randomly for any given sequence of input data.
the only important point is that its deterministic (always the same hash for the same data).
so you can use it for eample verify data integrity, validate passwords, etc.
read all about it here
http://en.wikipedia.org/wiki/Hash_function
You should read the wikipedia article first. Then come with questions on the topics you don't understand.
To put it short, quoting the article, to hash means:
to chop and mix
That is, given a value, you get another (usually) shorter value from it (chop), but that obtained value should change even if a small part of the original value changes (mix).
Lets take x % 9 as an example hashing function.
345 % 9 = 3
355 % 9 = 4
344 % 9 = 2
2345 % 9 = 5
You can see that this hashing method takes into account all parts of the input and changes if any of the digits change. That makes it a good hashing function.
On the other hand if we would take x%10. We would get
345 % 10 = 5
355 % 10 = 5
344 % 10 = 4
2345 % 10 = 5
As you can see most of the hashed values are 5. This tells us that x%10 is a worse hashing function than x%9.
Note that x%10 is still a hashing function. The identity function could be considered a hash function as well.
I'd say linut's answer is pretty good, but I'll amplify it a little. Computers are very good at accessing things in arrays. If I know that an item is in MyArray[19], I can access it directly. A hash function is a means of mapping lookup keys to array subscripts. If I have 193,372 different strings stored in an array, and I have a function which will return 0 for one of the strings, 1 for another, 2 for another, etc. up to 193,371 for the last one, I can see if any given string is in the array by running that function and then seeing if the given string matches the one in that spot in the array. Nice and easy.
Unfortunately, in practice, things are seldom so nice and tidy. While it's often possible to write a function which will map inputs to unique integers in a nice easy range (if nothing else:
if (inputstring == thefirststring) return 0;
if (inputstring == thesecondstring) return 1;
if (inputstring == thethirdstring) return 1;
... up to the the193371ndstring
in many cases, a 'perfect' function would take so much effort to compute that it wouldn't be worth the effort.
What is done instead is to design a system where a hash function says where one should start looking for the data, and then some other means is used to search for the data from there. A few common approaches are:
Linear hashing -- If two items map to the same hash value, store one of them in the array slot following the one indicated by the hash code. When looking for an item, search in the indicated slot, and then next one, then the next, etc. until the item is found or one hits an empty slot. Linear hashing is simple, but it works poorly unless the table is much bigger than the number of items in it (leaving lots of empty slots). Note also that deleting items from such a hash table can be difficult, since the existence of an item may have prevented some other item from going into its indicated spot.
Double hashing -- If two items map to the same value, compute a different hash value for the second one added, and shove the second item that many slots away (if that slot is full, keep stepping by that increment until a vacant slot is found). If the hash values are independent, this approach can work well with a more-dense table. It's even harder to delete items from such a table, though, than with a linear hash table, since there's no nice way to find items which were displaced by the item to be deleted.
Nested hashing -- Each slot in the hash table contains a hash table using a different function from the main table. This can work well if the two hash functions are independent, but is apt to work very poorly if they aren't.
Chain-bucket hashing -- Each slot in the hash table holds a list of things that map to that hash value. If N things map to a particular slot, finding one of them will take time O(N). If the hash function is decent, however, most non-empty slots will contain only one item, most of those with more than that will contain only two, etc. so no slot will hold very many items.
When dealing with a fixed data set (e.g. a compiler's set of keywords), linear hashing is often good; in cases where it works badly, one can tweak the hash function so it will work well. When dealing with an unknown data set, chain bucket hashing is often the best approach. The overhead of dealing with extra lists may make it more expensive than double hashing, but it's far less likely to perform really horribly.

Resources