Is Universal family of hash functions only to prevent enemy attack? - data-structures

If my intention is only to have a good hash function that spreads data evenly into all of the buckets, then I need not come up with a family of hash functions, I could just do with one good hash function, is that correct?
The purpose of having a family of hash functions is only to make it harder for the enemy to build a pathological data set as when we pick a hash function randomly, he/she has no information about which hash function is employed. Is my understanding right?
EDIT:
Since someone is trying to close as unclear; This question is to know the real purpose of employing a Universal family of hash functions.

I could just do with one good hash function, is that correct?
As you note later in your question, an "enemy" who knows which hash function you're using could prepare a pathological data set.
Further, hashing is just the first stage in storing data into your table's buckets - if you're implementing open addressing / closed hashing, you also need to select alternative buckets to probe after collisions: simple approaches like linear and quadratic probing generally provide adequate collision avoidance, and are likely mathematically simpler and therefore faster than rehashing, but they don't maintain a probability of the next probe finding an unused bucket at the load factor. Rehashing with another good hash function (including another from a family of such functions) does, so if that's important to you you may prefer to use a family of hash functions.
Note too that sometimes an in-memory hash table is used to say at which offsets/sectors on disk data is stored, so extra rehashing calculations with already-in-memory data may be far more appealing than a higher probability (with linear/quadratic probing) of waiting on disk I/O only to find another collision.

Related

Why aren't cryptographic hash functions used in data structures?

While some algorithms like MD5 haven't quite stood the test of time with regards to the security industry, others like the SHA family of functions have (thus far). Yet despite the discovery, or theoretical existence of collisions within their domains, cryptographic hash functions still provide an incredibly well distributed range of fixed length output mappings for data of arbitrary length and type – why aren’t they used in data structures more often? Isn’t the goal of a hash table (provided a good function) to map every input to a unique key, such that chaining, nested tables and other collision handling techniques become entirely moot? It’s certainly convenient being able to feed almost anything to a function, and know the exact length of the key you will receive! Seems like an ideal use for retired security protocols to me.
Cryptographic hash function can and are used as the hash function in hash tables. Only not so often. The drawback for the cryptographic hashes is that they are very 'expensive' in terms of processing power needed compared to the more traditional hash functions used in hash tables.
Traditional hash functions have all the characteristics that you need for a hash table, but require way less CPU cycles to perform. This has changes a bit now most chipsets include hardware acceleration for these cryptographic hashes though.
And the 'index' generated with a cryptographic hash function is too large. SO you need to trim it down by either a reduction or masking. (You don't need 16 bytes of hash table indexes ;))
All in all, they are often not worth the hassle..

if huge array is faster than hash-map for look-up?

I'm receiving "order update" from stock exchange. Each order id is between 1 and 100 000 000, so I can use 100 million array to store 100 million orders and when update is received I can look-up order from array very fast just accessing it by index arrray[orderId]. I will spent several gigabytes of memory but this is OK.
Alternatively I can use hashmap, and because at any moment the number of "active" orders is limited (to, very roughly, 100 000), look-up will be pretty fast too, but probaly a little bit slower then array.
The question is - will hashmap be actually slower? Is it reasonably to create 100 millions array?
I need latency and nothing else, I completely don't care about memory, what should I choose?
Whenever considering performance issues, one experiment is worth a thousand expert opinions. Test it!
That said, I'll take a wild stab in the dark: it's likely that if you can convince your OS to keep your multi-gigabyte array resident in physical memory (this isn't necessarily easy - consider looking at the mlock and munlock syscalls), you'll have relatively better performance. Any such performance gain you notice (should one exist) will likely be by virtue of bypassing the cost of the hashing function, and avoiding the overheads associated with whichever collision-resolution and memory allocation strategies your hashmap implementation uses.
It's also worth cautioning that many hash table implementations have non-constant complexity for some operations (e.g., separate chaining could degrade to O(n) in the worst case). Given that you are attempting to optimize for latency, an array with very aggressive signaling to the OS memory manager (e.g., madvise and mlock) are likely to result in the closest to constant-latency lookups that you can get on a microprocessor easily.
While the only way to objectively answer this question is with performance tests, I will argue for using a Hashtable Map. (Caching and memory access can be so full of surprises; I do not have the expertise to speculate on which one will be faster, and when. Also consider that localized performance differences may be marginalized by other code.)
My first reason for "initially choosing" a hash is based off of the observation that there are 100M distinct keys but only 0.1M active records. This means that if using an array, index utilization will only be 0.1% - this is a very sparse array.
If the data is stored as values in the array then it needs to be relatively small or the array size will balloon. If the data is not stored in the array (e.g. array is of pointers) then the argument for locality of data in the array is partially mitigated. Either way, the simple array approach requires lots of unused space.
Since all the keys are already integers, the distribution (hash) function and can be efficiently implemented - there is no need to create a hash of a complex type/sequence so the "cost" of this function should approach zero.
So, my simple proposed hash:
Use linear probing backed by contiguous memory. It is simple, has good locality (especially during the probe), and avoids needing to do any form of dynamic allocation.
Pick a suitable initial bucket size; say, 2x (or 0.2M buckets, primed). Don't even give the hash a chance of resize. Note that this suggested bucket array size is only 0.2% the size of the simple array approach and could be reduced further as the size vs. collision rate can be tuned.
Create a good distribution function for the hash. It can also exploit knowledge of the ID range.
While I've presented specialized hashtable rules "optimized" for the given case, I would start with a normal Map implementation (be it a hashtable or tree) and test it .. if a standard implementation works suitably well, why not use it?
Now, test different candidates under expected and extreme loads - and pick the winner.
This seems to depend on the clustering of the IDs.
If the active IDs are clustered suitably already then, without hashing, the OS and/or L2 cache have a fair shot at holding on to the good data and keeping it low-latency.
If they're completely random then you're going to suffer just as soon as the number of active transactions exceeds the number of available cache lines or the size of those transactions exceeds the size of the cache (it's not clear which is likely to happen first in your case).
However, if the active IDs work out to have some unfortunate pattern which causes a high rate of contention (eg., it's a bit-pack of different attributes, and the frequently-varying attribute hits the hardware where it hurts), then you might benefit from using a 1:1 hash of the index to get back to the random case, even though that's usually considered a pretty bad case on its own.
As far as hashing for compaction goes; noting that some people are concerned about worst-case fallback behaviour for a hash collision, you might simply implement a cache of the full-sized table in contiguous memory, since that has a reasonably constrained worst case. Simply keep the busiest entry in the map, and fall back to the full table on collisions. Move the other entry into the map if it's more active (if you can find a suitable algorithm to decide this).
Even so, it's not clear that the necessary hash table size is sufficient to reduce the working set to being cacheable. How big are your orders?
The overhead of a hashmap vs. an array is almost none. I would bet on a hashmap of 100,000 records over an array of 100,000,000, without a doubt.
Remember also that, while you "don't care about memory", this also means you'd better have the memory to back it up - an array of 100,000,000 integers will take up 400mb, even if all of them are empty. You run the risk of your data being swapped out. If your data gets swapped out, you will get a performance hit of several orders of magnitude.
You should test and profile, as others have said. My random stab in the dark, though: A high-load-factor hash table will be the way to go here. One huge array is going to cost you a TLB miss and then a last-level cache miss per access. This is expensive. A hash table, given the working set size you mentioned, is probably only going to cost some arithmetic and an L1 miss.
Again, test both alternatives on representative examples. We're all just stabbing in the dark.

Do any hashtables (in-memory, non-distributed) use consistent hashing?

I'm not talking about distributed key/value systems, such as typically used with memcached, which use consistent hashing to make adding/removing nodes a relatively cheap procedure.
I'm talking about your standard in-memory hashtable like python's dict or perl's hash.
It would seem like the benefits of using consistent hashing would also apply to these standard data structures, by lowering the cost of resizing the hashtable. Real-time systems (and other latency-sensitive systems) would benefit from / require hashtables optimized for low-cost growth, even if overall throughput declines slightly.
Wikipedia alludes to "incremental resizing" but basically talks about a hot/cold replacement approach to resizing; there is a separate article about "extendible hashing" that uses a trie for bucket lookup to accomplish cheap rehashing.
Just curious if anyone's heard of in-core, single-node hashtables that use consistent hashing to lower growth cost. Or is this requirement better met using something other approach (ala the two wikipedia bits listed above)?
or ... is my whole question misguided? Do memory paging considerations make the complexity not worth it? That is, the extra indirection of consistent hashing lets you rehash only a fraction of the total keys, but perhaps that doesn't matter because you'll probably have to read from each existing page, so memory latency is your primary factor, and whether you rehash some or all of the keys doesn't matter compared to the cost of the memory access.... but on the other hand, with consistent hashing, all of your key remaps have the same destination page, so there's going to be less memory thrashing than if your keys remap to any of the existing pages.
EDIT: added "data-structures" tag, clarified final sentence to say "page" instead of "bucket".
I haven't heard of this in the wild, but it may be a good idea if you choose the right consistent hash implementation. Specifically, Jump Consistent Hashing by Google et al. First I'll go into why Jump, then I'll go into how it can be useful in a local data structure.
Jump Consistent Hashing
Jump Consistent Hashing (which I'll shorten to Jump) is great for this space for a few reasons. Jump assumes that nodes don't fail, which is great for local data structures because they, well, don't fail! This allows Jump to merely be a mapping to a range of numbers [0, numBuckets), requiring only 2-4 bytes of space.
Further the implementation is simple and fast. And it is even faster if we remove the reference implementation's floating point divides and replace them with half as many integer divides. (Which we can, by the way.)
All this can be used for a variation on...
ConcurrentHashMap
But first, Java's Concurrent Hash Map at a high-level.
Java's ConcurrentHashMap is parameterized by a number of buckets. This sharding factor is constant through the life of the map. Each of these buckets is itself a hash map with its own lock.
When inserting a key-value pair into the map, the key is hashed into one of the buckets. The lock for that key is taken, and the item is inserted into the bucket's hash map before releasing the lock. Whilst inserting into bucket x another thread can be inserting concurrently into bucket y, but it will wait for the lock if inserting into bucket x. Thus Java's ConcurrentHashMap has n-way concurrency, where n is the bucket parameter of the constructor.
Just like any hash map, a bucket in ConcurrentHashMap can fill up and need to grow. Just like the regular hash map it does this by doubling its size and rehashing everything in the bucket back into its bigger self. Except that 'its bigger self' is only the bucket's 'self'. If a bucket is a hot spot and gets more than its fair share of keys, the bucket will grow disproportionately compared to the other buckets. And each time a bucket grows it takes longer and longer to rehash into itself. This last point is not only a problem for hot spots, but when the hash table plain old gets more keys.
Imagine if we could grow the number of buckets as the number of keys grows. With this we could dampen the amount of growth each individual bucket grows.
Enter consistent hashing, which allows us to add more buckets!
ConcurrentHashMap take 2: Consistent Hashing Style
We can get ConcurrentHashMap to grow its number of buckets in a two easy steps.
First replace the function that maps to each bucket with the jump consistent hash function. So far everything should work the the same.
Second split off a new bucket when a bucket is filled; also grow the filled bucket. Actually, only split off a new bucket if the filled bucket becomes the largest biggest in terms of capacity. That can be calculated without iterating the buckets.
With consistent hashing the split will only direct keys into the new bucket and not backwards into any of the old buckets.
End notes
I'm sure there can be improvements on this scheme. To wit, splitting off a bucket requires a full table scan to move keys into the new bucket. This is surely no worse than a vanilla hash map, and likely better, but it is at a disadvantage to the ConcurrentHashMap implementation which likely doesn't have to do a full scan.

If consistent hash is efficient,why don't people use it everywhere?

I was asked some shortcommings of consistent hash. But I think it just costs a little more than a traditional hash%N hash. As the title mentioned, if consistent hash is very good, why not we just use it?
Do you know more? Who can tell me some?
Implementing consistent hashing is not trivial and in many cases you have a hash table that rarely or never needs remapping or which can remap rather fast.
The only substantial shortcoming of consistent hashing I'm aware of is that implementing it is more complicated than simple hashing. More code means more places to introduce a bug, but there are freely available options out there now.
Technically, consistent hashing consumes a bit more CPU; consulting a sorted list to determine which server to map an object to is an O(log n) operation, where n is the number of servers X the number of slots per server, while simple hashing is O(1).
In practice, though, O(log n) is so fast it doesn't matter. (E.g., 8 servers X 1024 slots per server = 8192 items, log2(8192) = 13 comparisons at most in the worst case.) The original authors tested it and found that computing the cache server using consistent hashing took only 20 microseconds in their setup. Likewise, consistent hashing consumes space to store the sorted list of server slots, while simple hashing takes no space, but the amount required is minuscule, on the order of Kb.
Why is it not better known? If I had to guess, I would say it's only because it can take time for academic ideas to propagate out into industry. (The original paper was written in 1997.)
I assume you're talking about hash tables specifically, since you mention mod N. Please correct me if I'm wrong in that assumption, as hashes are used for all sorts of different things.
The reason is that consistent hashing doesn't really solve a problem that hash tables pressingly need to solve. On a rehash, a hash table probably needs to reassign a very large fraction of its elements no matter what, possibly a majority of them. This is because we're probably rehashing to increase the size of our table, which is usually done quadratically; it's very typical, for instance, to double the amount of nodes, once the table starts to get too full.
So in consistent hashing terms, we're not just adding a node; we're doubling the amount of nodes. That means, one way or another, best case, we're moving half of the elements. Sure, a consistent hashing technique could cut down on the moves, and try to approach this ideal, but the best case improvement is only a constant factor of 2x, which doesn't change our overall complexity.
Approaching from the other end, hash tables are all about cache performance, in most applications. All interest in making them go fast is on computing stuff as quickly as possible, touching as little memory as possible. Adding consistent hashing is probably going to be more than a 2x slowdown, no matter how you look at this; ultimately, consistent hashing is going to be worse.
Finally, this entire issue is sort of unimportant from another angle. We want rehashing to be fast, but it's much more important that we don't rehash at all. In any normal practical scenario, when a programmer sees he's having a problem due to rehashing, the correct answer is nearly always to find a way to avoid (or at least limit) the rehashing, by choosing an appropriate size to begin with. Given that this is the typical scenario, maintaining a fairly substantial side-structure for something that shouldn't even be happening is obviously not a win, and again, makes us overall slower.
Nearly all of the optimization effort on hash tables is either in how to calculate the hash faster, or how to perform collision resolution faster. These are things that happen on a much smaller time scale than we're talking about for consistent hashing, which is usually used where we're talking about time scales measured in microseconds or even milliseconds because we have to do I/O operations.
The reason is because Consistent Hashing tends to cause more work on the Read side for range scan queries.
For example, if you want to search for entries that are sorted by a particular column then you'd need to send the query to EVERY node because consistent hashing will place even "adjacent" items in separate nodes.
It's often preferred to instead use a partitioning that is going to match the usage patterns. Better yet replicate the same data in a host of different partitions/formats

when to resize a hash table?

In various hash table implementations, I have seen "magic numbers" for when a mutable hash table should resize (grow). Usually this number is somewhere between 65% to 80% of the values added per allocated slots. I am assuming the trade off is that a higher number will give the potential for more collisions and a lower number less at the expense of using more memory.
My question is how is this number arrived at?
Is it arbitrary? based on testing? based on some other logic?
At a guess, most people at least start from the numbers in a book (e.g., Knuth, Volume 3), which were produced by testing. Depending on the situation, some may carry out testing afterwards, and make adjustments accordingly -- but from what I've seen, these are probably in the minority.
As I outlined in a previous answer, the "right" number also depends heavily on how you resolve collisions. For better or worse, this fact seems to be widely ignored -- people frequently don't pick numbers that are particularly appropriate for the collision resolution they use.
OTOH, the other point I found in my testing is that it only rarely makes a whole lot of difference. You can pick numbers across a fairly broad range and get pretty similar overall speed. The main thing is to be careful to avoid pushing the number too high, especially if you're using something like linear probing for collision resolution.
I think you don't want to consider "how full" the table is (how many "buckets" out of total buckets have values) but rather the number of collisions it might take to find a spot for a new item.
I read some compiler book years ago (can't remember title or authors) that suggested just using linked lists until you have more than 10 to 12 items. That would seem to support more than 10 collisions means time to re-size.
The Design and Implementation of Dynamic. Hashing for Sets and Tables in Icon suggests that an average hash chain length of 5 (in that algorithm, the average number of collisions) is enough to trigger a rehash. Seems supported by testing, but I'm not sure I'm reading the paper correctly.
It looks like the resize condition is mainly the result of testing.
That depends on the keys. If you know that your hash function is perfect for all possible keys (for example, using gperf), then you know that you'll have only few collisions, so the number is higher.
But most of the time, you don't know much about the keys except that they are text. In this case, you have to guess since you don't even have test data to figure out in advance how your hash function is behaving.
So you hope for the best. If you hash function is very bad for the keys, then you will have a lot of collisions and the point of growth will never be reached. In this case, the chosen figure is irrelevant.
If your hash function is adequate, then it should create only a few collisions (less than 50%), so a number between 65% and 80% seems reasonable.
That said: Unless your hash table must be perfect (= huge size or lots of accesses), don't bother. If you have, say, ten elements, considering these issues is a waste of time.
As far as I'm aware the number is a heuristic based on empirical testing.
With a reasonably good distribution of hash values it seems that the magic load factor is -- as you say -- usually around 70%. A smaller load factor means that you're wasting space for no real benefit; a higher load factor means that you'll use less space but spend more time dealing with hash collisions.
(Of course, if you know that your hash values are perfectly distributed then your load factor can be 100% and you'll still have no wasted space and no hash collisions.)
Collisions depend highly on data and used hash function.
Most of numbers based on heuristics or on assumption about normal distribution of hash values. (AFAIK values about 70% are typical for extendible hash tables, but one can always construct such data stream, that you get much more/less collisions)

Resources