when to resize a hash table? - algorithm

In various hash table implementations, I have seen "magic numbers" for when a mutable hash table should resize (grow). Usually this number is somewhere between 65% to 80% of the values added per allocated slots. I am assuming the trade off is that a higher number will give the potential for more collisions and a lower number less at the expense of using more memory.
My question is how is this number arrived at?
Is it arbitrary? based on testing? based on some other logic?

At a guess, most people at least start from the numbers in a book (e.g., Knuth, Volume 3), which were produced by testing. Depending on the situation, some may carry out testing afterwards, and make adjustments accordingly -- but from what I've seen, these are probably in the minority.
As I outlined in a previous answer, the "right" number also depends heavily on how you resolve collisions. For better or worse, this fact seems to be widely ignored -- people frequently don't pick numbers that are particularly appropriate for the collision resolution they use.
OTOH, the other point I found in my testing is that it only rarely makes a whole lot of difference. You can pick numbers across a fairly broad range and get pretty similar overall speed. The main thing is to be careful to avoid pushing the number too high, especially if you're using something like linear probing for collision resolution.

I think you don't want to consider "how full" the table is (how many "buckets" out of total buckets have values) but rather the number of collisions it might take to find a spot for a new item.
I read some compiler book years ago (can't remember title or authors) that suggested just using linked lists until you have more than 10 to 12 items. That would seem to support more than 10 collisions means time to re-size.
The Design and Implementation of Dynamic. Hashing for Sets and Tables in Icon suggests that an average hash chain length of 5 (in that algorithm, the average number of collisions) is enough to trigger a rehash. Seems supported by testing, but I'm not sure I'm reading the paper correctly.
It looks like the resize condition is mainly the result of testing.

That depends on the keys. If you know that your hash function is perfect for all possible keys (for example, using gperf), then you know that you'll have only few collisions, so the number is higher.
But most of the time, you don't know much about the keys except that they are text. In this case, you have to guess since you don't even have test data to figure out in advance how your hash function is behaving.
So you hope for the best. If you hash function is very bad for the keys, then you will have a lot of collisions and the point of growth will never be reached. In this case, the chosen figure is irrelevant.
If your hash function is adequate, then it should create only a few collisions (less than 50%), so a number between 65% and 80% seems reasonable.
That said: Unless your hash table must be perfect (= huge size or lots of accesses), don't bother. If you have, say, ten elements, considering these issues is a waste of time.

As far as I'm aware the number is a heuristic based on empirical testing.
With a reasonably good distribution of hash values it seems that the magic load factor is -- as you say -- usually around 70%. A smaller load factor means that you're wasting space for no real benefit; a higher load factor means that you'll use less space but spend more time dealing with hash collisions.
(Of course, if you know that your hash values are perfectly distributed then your load factor can be 100% and you'll still have no wasted space and no hash collisions.)

Collisions depend highly on data and used hash function.
Most of numbers based on heuristics or on assumption about normal distribution of hash values. (AFAIK values about 70% are typical for extendible hash tables, but one can always construct such data stream, that you get much more/less collisions)

Related

Fast hash function with collision possibility near SHA-1

I'm using SHA-1 to detect duplicates in a program handling files. It is not required to be cryptographic strong and may be reversible. I found this list of fast hash functions https://code.google.com/p/xxhash/ (list has been moved to https://github.com/Cyan4973/xxHash)
What do I choose if I want a faster function and collision on random data near to SHA-1?
Maybe a 128 bit hash is good enough for file deduplication? (vs 160 bit sha-1)
In my program the hash is calculated on chuncks from 0 - 512 KB.
Maybe this will help you:
https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM & MurmurHash
I don't know about xxHash but it looks also promising.
MurmurHash is very fast and version 3 supports 128bit length, I would choose this one. (Implemented in Java and Scala.)
Since the only relevant property of hash algorithms in your case is the collision probability, you should estimate it and choose the fastest algorithm which fulfills your requirements.
If we suppose your algorithm has absolute uniformity, the probability of a hash collision among n files using hashes with d possible values will be:
For example, if you need a collision probability lower than one in a million among one million of files, you will need to have more than 5*10^17 distinct hash values, which means your hashes need to have at least 59 bits. Let's round to 64 to account for possibly bad uniformity.
So I'd say any decent 64-bit hash should be sufficient for you. Longer hashes will further reduce collision probability, at a price of heavier computation and increased hash storage volume. Shorter caches like CRC32 will require you to write some explicit collision handling code.
Google developed and uses (I think) FarmHash for performance-critical hashing. From the project page:
FarmHash is a successor to CityHash, and includes many of the same tricks and techniques, several of them taken from Austin Appleby’s MurmurHash.
...
On CPUs with all the necessary machine instructions, about six different hash functions can contribute to FarmHash's lineup. In some cases we've made significant performance gains over CityHash by using newer instructions that are now commonly available. However, we've also squeezed out some more speed in other ways, so the vast majority of programs using CityHash should gain at least a bit when switching to FarmHash.
(CityHash was already a performance-optimized hash function family by Google.)
It was released a year ago, at which point it was almost certainly the state of the art, at least among the published algorithms. (Or else Google would have used something better.) There's a good chance it's still the best option.
The facts:
Good hash functions, specially the cryptographic ones (like SHA-1),
require considerable CPU time because they have to honor a number of
properties that wont be very useful for you in this case;
Any hash function will give you only one certainty: if the hash values of two files are different, the files are surely different. If, however, their hash values are equal, chances are that the files are also equal, but the only way to tell for sure if this "equality" is not just a hash collision, is to fall back to a binary comparison of the two files.
The conclusion:
In your case I would try a much faster algorithm like CRC32, that has pretty much all the properties you need, and would be capable of handling more than 99.9% of the cases and only resorting to a slower comparison method (like binary comparison) to rule out the false positives. Being a lot faster in the great majority of comparisons would probably compensate for not having an "awesome" uniformity (possibly generating a few more collisions).
128 bits is indeed good enough to detect different files or chunks. The risk of collision is infinitesimal, at least as long as no intentional collision is being attempted.
64 bits can also prove good enough if the number of files or chunks you want to track remain "small enough" (i.e. no more than a few millions ones).
Once settled the size of the hash, you need a hash with some very good distribution properties, such as the ones listed with Q.Score=10 in your link.
It kind of depends on how many hashes you are going to compute over in an iteration.
Eg, 64bit hash reaches a collision probability of 1 in 1000000 with 6 million hashes computed.
Refer to : Hash collision probabilities
Check out MurmurHash2_160. It's a modification of MurmurHash2 which produces 160-bit output.
It computes 5 unique results of MurmurHash2 in parallel and mixes them thoroughly. The collision probability is equivalent to SHA-1 based on the digest size.
It's still fast, but MurmurHash3_128, SpookyHash128 and MetroHash128 are probably faster, albeit with a higher (but still very unlikely) collision probability. There's also CityHash256 which produces a 256-bit output which should be faster than SHA-1 as well.

if huge array is faster than hash-map for look-up?

I'm receiving "order update" from stock exchange. Each order id is between 1 and 100 000 000, so I can use 100 million array to store 100 million orders and when update is received I can look-up order from array very fast just accessing it by index arrray[orderId]. I will spent several gigabytes of memory but this is OK.
Alternatively I can use hashmap, and because at any moment the number of "active" orders is limited (to, very roughly, 100 000), look-up will be pretty fast too, but probaly a little bit slower then array.
The question is - will hashmap be actually slower? Is it reasonably to create 100 millions array?
I need latency and nothing else, I completely don't care about memory, what should I choose?
Whenever considering performance issues, one experiment is worth a thousand expert opinions. Test it!
That said, I'll take a wild stab in the dark: it's likely that if you can convince your OS to keep your multi-gigabyte array resident in physical memory (this isn't necessarily easy - consider looking at the mlock and munlock syscalls), you'll have relatively better performance. Any such performance gain you notice (should one exist) will likely be by virtue of bypassing the cost of the hashing function, and avoiding the overheads associated with whichever collision-resolution and memory allocation strategies your hashmap implementation uses.
It's also worth cautioning that many hash table implementations have non-constant complexity for some operations (e.g., separate chaining could degrade to O(n) in the worst case). Given that you are attempting to optimize for latency, an array with very aggressive signaling to the OS memory manager (e.g., madvise and mlock) are likely to result in the closest to constant-latency lookups that you can get on a microprocessor easily.
While the only way to objectively answer this question is with performance tests, I will argue for using a Hashtable Map. (Caching and memory access can be so full of surprises; I do not have the expertise to speculate on which one will be faster, and when. Also consider that localized performance differences may be marginalized by other code.)
My first reason for "initially choosing" a hash is based off of the observation that there are 100M distinct keys but only 0.1M active records. This means that if using an array, index utilization will only be 0.1% - this is a very sparse array.
If the data is stored as values in the array then it needs to be relatively small or the array size will balloon. If the data is not stored in the array (e.g. array is of pointers) then the argument for locality of data in the array is partially mitigated. Either way, the simple array approach requires lots of unused space.
Since all the keys are already integers, the distribution (hash) function and can be efficiently implemented - there is no need to create a hash of a complex type/sequence so the "cost" of this function should approach zero.
So, my simple proposed hash:
Use linear probing backed by contiguous memory. It is simple, has good locality (especially during the probe), and avoids needing to do any form of dynamic allocation.
Pick a suitable initial bucket size; say, 2x (or 0.2M buckets, primed). Don't even give the hash a chance of resize. Note that this suggested bucket array size is only 0.2% the size of the simple array approach and could be reduced further as the size vs. collision rate can be tuned.
Create a good distribution function for the hash. It can also exploit knowledge of the ID range.
While I've presented specialized hashtable rules "optimized" for the given case, I would start with a normal Map implementation (be it a hashtable or tree) and test it .. if a standard implementation works suitably well, why not use it?
Now, test different candidates under expected and extreme loads - and pick the winner.
This seems to depend on the clustering of the IDs.
If the active IDs are clustered suitably already then, without hashing, the OS and/or L2 cache have a fair shot at holding on to the good data and keeping it low-latency.
If they're completely random then you're going to suffer just as soon as the number of active transactions exceeds the number of available cache lines or the size of those transactions exceeds the size of the cache (it's not clear which is likely to happen first in your case).
However, if the active IDs work out to have some unfortunate pattern which causes a high rate of contention (eg., it's a bit-pack of different attributes, and the frequently-varying attribute hits the hardware where it hurts), then you might benefit from using a 1:1 hash of the index to get back to the random case, even though that's usually considered a pretty bad case on its own.
As far as hashing for compaction goes; noting that some people are concerned about worst-case fallback behaviour for a hash collision, you might simply implement a cache of the full-sized table in contiguous memory, since that has a reasonably constrained worst case. Simply keep the busiest entry in the map, and fall back to the full table on collisions. Move the other entry into the map if it's more active (if you can find a suitable algorithm to decide this).
Even so, it's not clear that the necessary hash table size is sufficient to reduce the working set to being cacheable. How big are your orders?
The overhead of a hashmap vs. an array is almost none. I would bet on a hashmap of 100,000 records over an array of 100,000,000, without a doubt.
Remember also that, while you "don't care about memory", this also means you'd better have the memory to back it up - an array of 100,000,000 integers will take up 400mb, even if all of them are empty. You run the risk of your data being swapped out. If your data gets swapped out, you will get a performance hit of several orders of magnitude.
You should test and profile, as others have said. My random stab in the dark, though: A high-load-factor hash table will be the way to go here. One huge array is going to cost you a TLB miss and then a last-level cache miss per access. This is expensive. A hash table, given the working set size you mentioned, is probably only going to cost some arithmetic and an L1 miss.
Again, test both alternatives on representative examples. We're all just stabbing in the dark.

Is there a guaranteed fair variation on consistent hashing?

I'm looking for something like Consistent Hashing, but with a guarantee that a distribution ends up as fair as possible (not just on average for random keys) - is there such a thing and where can I find it if so?
Edit: In my specific case, the set of keys is known up front (and "small"). Exactly these keys will always be present and must be allocated to exactly one node each at any given point in time.
Sounds to me like you're looking for a minimal perfect hash.
not just on average for random keys
This is not an accurate description of the guarantees provided by consistent hashing. First, "on average" does not capture the fact that, with random placement of a large number of virtual nodes on the circle and a good family of hash functions (e.g., one that is log-wise independent), the load imbalance is seriously unlikely to be large (I believe the usual imbalance should be on the order of square root of the number of keys assigned to a particular machine). Second, the keys don't have to be random as long as they don't depend on the randomly chosen hash function (oblivious adversary).
Since you want hashing that is fair always, randomization won't help, since the RNG might have outcomes indistinguishable from this one. No deterministic algorithm can assign node preferences to keys statically without the possibility of imbalance, unless the keys are known offline.
If you have sufficiently few items that you care about square root imbalances, you can do old-fashioned stateful load balancing.

If consistent hash is efficient,why don't people use it everywhere?

I was asked some shortcommings of consistent hash. But I think it just costs a little more than a traditional hash%N hash. As the title mentioned, if consistent hash is very good, why not we just use it?
Do you know more? Who can tell me some?
Implementing consistent hashing is not trivial and in many cases you have a hash table that rarely or never needs remapping or which can remap rather fast.
The only substantial shortcoming of consistent hashing I'm aware of is that implementing it is more complicated than simple hashing. More code means more places to introduce a bug, but there are freely available options out there now.
Technically, consistent hashing consumes a bit more CPU; consulting a sorted list to determine which server to map an object to is an O(log n) operation, where n is the number of servers X the number of slots per server, while simple hashing is O(1).
In practice, though, O(log n) is so fast it doesn't matter. (E.g., 8 servers X 1024 slots per server = 8192 items, log2(8192) = 13 comparisons at most in the worst case.) The original authors tested it and found that computing the cache server using consistent hashing took only 20 microseconds in their setup. Likewise, consistent hashing consumes space to store the sorted list of server slots, while simple hashing takes no space, but the amount required is minuscule, on the order of Kb.
Why is it not better known? If I had to guess, I would say it's only because it can take time for academic ideas to propagate out into industry. (The original paper was written in 1997.)
I assume you're talking about hash tables specifically, since you mention mod N. Please correct me if I'm wrong in that assumption, as hashes are used for all sorts of different things.
The reason is that consistent hashing doesn't really solve a problem that hash tables pressingly need to solve. On a rehash, a hash table probably needs to reassign a very large fraction of its elements no matter what, possibly a majority of them. This is because we're probably rehashing to increase the size of our table, which is usually done quadratically; it's very typical, for instance, to double the amount of nodes, once the table starts to get too full.
So in consistent hashing terms, we're not just adding a node; we're doubling the amount of nodes. That means, one way or another, best case, we're moving half of the elements. Sure, a consistent hashing technique could cut down on the moves, and try to approach this ideal, but the best case improvement is only a constant factor of 2x, which doesn't change our overall complexity.
Approaching from the other end, hash tables are all about cache performance, in most applications. All interest in making them go fast is on computing stuff as quickly as possible, touching as little memory as possible. Adding consistent hashing is probably going to be more than a 2x slowdown, no matter how you look at this; ultimately, consistent hashing is going to be worse.
Finally, this entire issue is sort of unimportant from another angle. We want rehashing to be fast, but it's much more important that we don't rehash at all. In any normal practical scenario, when a programmer sees he's having a problem due to rehashing, the correct answer is nearly always to find a way to avoid (or at least limit) the rehashing, by choosing an appropriate size to begin with. Given that this is the typical scenario, maintaining a fairly substantial side-structure for something that shouldn't even be happening is obviously not a win, and again, makes us overall slower.
Nearly all of the optimization effort on hash tables is either in how to calculate the hash faster, or how to perform collision resolution faster. These are things that happen on a much smaller time scale than we're talking about for consistent hashing, which is usually used where we're talking about time scales measured in microseconds or even milliseconds because we have to do I/O operations.
The reason is because Consistent Hashing tends to cause more work on the Read side for range scan queries.
For example, if you want to search for entries that are sorted by a particular column then you'd need to send the query to EVERY node because consistent hashing will place even "adjacent" items in separate nodes.
It's often preferred to instead use a partitioning that is going to match the usage patterns. Better yet replicate the same data in a host of different partitions/formats

Is it worthwhile to use a bit vector/array rather than a simple array of bools?

When I want an array of flags it has typically pained me to use an entire byte (or word) to store each one, as would be the result if I made an array of bools or some other numeric type that could be set to 0 or 1. But now I wonder whether using a structure that is more space-efficient is worth it given the (albeit hopefully very slight) additional overhead of shifting and bit testing.
In my company we use Rogue Wave tools (though hopefully not for much longer) and it's their RWBitVec that I've used for this purpose up until now.
It's mostly about saving memory. If your array of bools is large enough that a 8x improvement on storage space is meaningful, then by all means, use a bitarray.
Note that the memory access is pretty expensive compared to the shift/and, so the bitarray approach is slightly faster than the array-of-chars. Basically it comes down to memory versus programmer time. Remember that premature optimization is a waste of time. I'd use whichever approach is the easiest to develop, and then refactor only after it shows that it's a primary performance bottleneck.
Don't use vector<bool>, it's not really a Container:
http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=98
Use std::bitset (for fixed size bitsets) and boost::dynamic_bitset (for resizeable ones) where appropriate. They aren't Containers either, but they don't look as if they ought to be, so are less likely to cause confusion.
Whether the trade-off is worth it depends, obviously, on how big the arrays are in your program. I think you're right that the overhead of bit access is usually negligible, but if the memory overhead is negligible too then you've nothing to go on there either.
bitsets have the advantage that they do exactly what they say on the tin - none of this "declare an array of chars/ints, but the only legal values are 0 and 1" nonsense. Your code will read about the same as if you'd used an array.
I wrote some code once to unpack a bitmap image line into separate bytes per pixel, then pack it back again after processing. For the code I was benchmarking, it was actually faster to do it that way than to work at the bit level.
I've used a bit array for indexing a HUGE tree. The algorithm was:
Check bitarray if entry exists
if entry doesn't exists
return null
else do binary search in tree
return value
The advantage is that the Tree has huge enough that searching for a non existent entry would cause several cache misses before completing. Thus the algorithm was taking longer or not depending on the existence of the value.
However adding that initial bit array search meant I'd reduce cache misses, and would avoid searching the tree at all if the answer wasn't there. By adding this extra step the algorithm became much more robust (actual performance time on a Computer, became nearly linear although the Big-O would say differently), and overall performance increased by an order of magnitude.
Like they say sometimes taking hardware into consideration is more important than the "ideal" mathematical algorithm.
Modern computers have barrel shifters so that a shift of any number of bits up to 31 takes a few cycles (less than many other instructions). Compilers take advantage of this and bit operations are not only space efficient but in most cases time efficient.
But it really depends on how you're using and testing the bits - there are some inefficient methods that would make using a whole integer faster.
-Adam
Is it worth it? Only if you know that you have a problem with memory usage.
But unless you're either:
Working on an embedded processor with very limited resources, or
Storing an astronomical number of bools
then the answer is no. You'll have to work somewhat harder to achieve the same level of readability in your source by using a bitmap than you will using bools, and unless you're operating under either of the previous two conditions you'll likely find that it doesn't make any noticeable difference to your memory footprint.

Resources