HyperLogLog correctness on mapreduce - hadoop

Something that has been bugging me about the HyperLogLog algorithm is its reliance on the hash of the keys. The issue I have is that the paper seems to assume that we have a totally random distribution of data on each partition, however in the context it is often used (MapReduce style jobs) things are often distributed by their hash values so all duplicated keys will be on the same partition. To me this means that we should actually be adding the cardinalities generated by HyperLogLog rather then using some sort of averaging technique (in the case where we are partitioned by hashing the same thing that HyperLogLog hashes).
So my question is: is this a real issue with HyperLogLog or have I not read the paper in enough detail

This is a real issue if you use non-independent hash functions for both tasks.
Let's say the partition decides the node by the first b bits of the hashed values. If you use the same hash function for both partition and HyperLogLog, the algorithm will still work properly, but the precision will be sacrificed. In practice, it'll be equivalent of using m/2^b buckets (log2m' = log2m-b), because the first b bits will always be the same, so only log2m-b bits will be used to choose the HLL bucket.

Related

Improving query access performance for unordered maps that are unchanging

I am looking for suggestions in improving the query time access for unordered maps. My code essentially just consists of 2 steps. In the first step, I populate the unordered map. After the first step, no more entries are ever added to the map. In the second step, the unordered map is only queried. Since the map is essentially unchanging, is there something that can be done to speed up the query time?
For instance, does stl provide any function that can adjust the internal allocations in the map to improve query time access? In other words, it is possible that more than one key was mapped to the same bucket in the unordered map. If more memory was allocated to the map, then chances of such a collision occurring can reduce. In that sense, I am curious as to whether there is anything that can be done knowing the fact that the unordered map will remain unchanged.
If measurements show this is important for you, then I'd suggest taking measurements for other hash table implementations outside the Standard Library, e.g. google's. Using closed hashing aka open addressing may well work better for you, especially if your hash table entries are small enough to store directly in the hash table buckets.
More generally, Marshall suggests finding a good hash function. Be careful though - sometimes a generally "bad" hash function performs better than a "good" one, if it works in nicely with some of the properties of your keys. For example, if you tend to have incrementing number, perhaps with a few gaps, then an identity (aka trivial) hash function that just returns the key can select hash buckets with far less collisions than a crytographic hash that pseudo-randomly (but repeatably) scatters keys with as little as a single bit of difference in uncorrelated buckets. Identity hashing can also help if you're looking up several nearby key values, as their buckets are probably nearby too and you'll get better cache utilisation. But, you've told us nothing about your keys, values, number of entries etc. - so I'll leave the rest with you.
You have two knobs that you can twist: The the hash function and number of buckets in the map. One is fixed at compile-time (the hash function), and the other you can modify (somewhat) at run-time.
A good hash function will give you very few collisions (non-equal values that have the same hash value). If you have many collisions, then there's not really much you can do to improve your lookup times. Worst case (all inputs hash to the same value) gives you O(N) lookup times. So that's where you want to focus your effort.
Once you have a good hash function, then you can play games with the number of buckets (via rehash) which can reduce collisions further.

When to use hash tables?

What are the cases when using hash table can improve performance, and when it does not? and what are the cases when using hash tables are not applicable?
What are the cases when using hash table can improve performance, and when it does not?
If you have reason to care, implement using hash tables and whatever else you're considering, put your actual data through, and measure which performs better.
That said, if the hash tables has the operations you need (i.e. you're not expecting to iterate it in sorted order, or compare it quickly to another hash table), and has millions or more (billions, trillions...) of elements, then it'll probably be your best choice, but a lot depends on the hash table implementation (especially the choice of closed vs. open hashing), object size, hash function quality and calculation cost / runtime), comparison cost, oddities of your computers memory performance at different cache levels... in short: too many things to make even an educated guess a better choice than measuring, when it matters.
and what are the cases when using hash tables are not applicable?
Mainly when:
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can be wasteful and slow when excessive memory's used), the majority of implementations will outgrow the arrays they're using for the hash table every now and then, then allocate a bigger array and copy content across: this can make the specific insertions that cause this rehashing to be much slower than the normal O(1) behaviour, even though the average is still O(1); if you need more consistent behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating on elements with keys that are "nearby" in some specific sort order), such that cache efficiency is better for other storage models that keep them nearby in memory (e.g. bucket sorted elements), even if you're not exactly relying on the sort order for e.g. iteration
We use Hash Tables to get access time of O(1). Imagine a dictionary. When you are looking for a word, eg "happy", you jump straight to 'H'. Here the hash function is determined by the starting alphabet. And then you look for happy within the H bucket (actually H bucket then HA bucket then HAP bucket anbd so on).
It doesn't make sense to use Hash Tables when your data is ordered or needs ordering like sorted numbers. (Alphabets are ordered ABCD....XYZ but it wouldn't matter if you switched A and Z, provided you know it is switched in your dictionary.)

Perfect Hash Building

Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions and are also not revertible. So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Only reason I am able to think is they require say large key say 32bit.But still avoiding collision so the look up will definitely be O(1).
Because they are very slow, for two reasons:
They aim to be crytographically secure, not only collision-resistant in general
They produce a much larger hash value than what you actually need in a hash table
Because they handle unstructured data (octet / byte streams) but the objects you need to hash are often structured and would require linearization first
Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions...
Wrong because:
Two inputs cam still happen to have the same hash value. Say the hash value is 32 bit, a great general-purpose hash routine (i.e. one that doesn't utilise insights into the set of actual keys) still has at least 1/2^32 chance of returning the same hash value for any 2 keys, then 2/2^32 chance of colliding with one of those as a third key is hashed, 3/2^32 for the fourth etc..
Having distinct hash values is a very different thing from having the hash values map to distinct hash buckets in a hash table. Hash values are generally modded into the table size to select a bucket, so at best - and again for general-purpose hashing - the chance of a collision when adding an element to a hash table is #preexisting-elements / table-size.
So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Because speed is often the programmer's goal when choosing to use a hash table over say a binary tree. If the hash values are mathematically complicated to calculate, they may take a lot longer than using a slightly more (but still not particularly) collision prone but faster-to-calculate hash function. That said, there are times when more effort on the hashing can pay off - for example, when the hash table exists on magnetic disk and the I/O costs of seeking & reading records dwarfs hash calculation effort.
antti makes an interesting point about data too... general purpose hashing routines often work on blocks of binary data with a specific starting address and a number of bytes (they may even require that number of bytes to be a multiple of 2 or 4). In many applications, data that needs to be hashed will be intermingled with data that must not be included in the hash - such as cached values, file handles, pointers/references to other data or virtual dispatch tables etc.. A common solution is to hash the desired fields separately and combine the hash keys - perhaps using exclusive-or. As there can be bit fields that should be hashed in the same byte of memory as other data that should not be hashed, you sometimes need custom code to extract those values. Still, even if some copying and padding was required beforehand, each individual field could eventually be hashed using md5, SHA-1 or whatever and those hash values could be similarly combined, so this complication doesn't really categorically rule out the approach you're interested in.
Only reason I am able to think is they require say large key say 32bit.
All other things being equal, the larger the key the better, though if the hash function is mathematically ideal then any N of its bits - where 2^N >= # hash buckets - will produce minimal collisions.
But still avoiding collision so the look up will definitely be O(1).
Again, wrong as mentioned above.
(BTW... I stress general-purpose in a couple places above. That's just because there are trivial cases where you might have some insight into the keys you'll need to hash that allows you to position them perfectly within the available hash buckets. For example, if you knew the keys were the numbers 1000, 2000, 3000 etc. up to 100000 and that you had at least 100 hash buckets, you could trivially define your hash function as x/1000 and know you'd have perfect hashing sans collisions. This situation of knowing that all your keys map to distinct hash table buckets is known as "perfect hashing" - as per your question title - a good general-purpose hash like md5 is not a perfect hash, and indeed it makes no sense to talk about perfect hashing without knowing the complete set of possible keys).

Choosing a Data structure for very large data

I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.
So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)
Could i get some advise on which data structure might be suitable for my situation
The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.
Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!
AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.
Have you looked into B-trees? The efficiency runs between log_m(n) and log_(m/2)(n) so if you choose m to be around 8-10 or so you should be able to keep your search depth to below 10.
Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.
If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.
A map where the key is the int, and the value is the name.
Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).
If you only need to store the 32-bit integers and not any associated record, use a set and not a map, like hash_set in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.
If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have bsearch(), which is as fast as it can get.
edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.

When do hashes collide?

I understand that according to pigeonhole principle, if number of items is greater than number of containers, then at least one container will have more than one item. Does it matter which container will it be? How does this apply to MD5, SHA1, SHA2 hashes?
No it doesn't matter which container it is, and in fact this is not that important to cryptographic hashes; much more important is the birthday paradox, which says that you only need to hash sqrt(numberNeededByPigeonHolePrincipal) values, on average, before finding a collision.
Thus, the hash needs to be large enough that the square-root of the search space is too large to brute-force. The square-root-of-search-space for SHA1 is 280, and as of March 2012, no two values have ever been found with the same SHA1-hash (though I predict that will happen within the next year or two..); same with SHA2, a family of hashes which all have an even larger search-space. MD5 has been broken for a while though.
If you have more items to hash than you have slots, then you'll have hash collisions. But if you have a poor hashing algorithm, then you'll see collisions even when the items / slots ratio is very small. A good hashing algorithm (including most of the ones you'll see in the wild) will attempt to spread the resulting hashes over the entire output space as evenly as possible, and thus minimize collisions.
Note that a hash collision is not the end of the world. When used in a hash table, for instance, it just means that more than one item is stored in a slot, and the table code will have to traverse a little bit more to find or add the target item, increasing lookup time slightly.
You'll see people refer to MD5 as a "broken" hashing algorithm, when in reality, it's just a poor one to use as a cryptographic hash. It'll be better than one you build yourself.
The point of a hash function is to randomly distribute items into containers. For any good hash function, it doesn't/shouldn't "matter" which container is which as they must be indistinguishable.
This does not apply to "perfect hash" implementations which attempt to do better than random distribution — unlike the algorithms you mentioned.
As Michael mentioned, collisions happen LONG before there are as many items as slots. You must have graceful collision handling (or a perfect hash) if you want to handle the birthday paradox.
I think which application you're using the hash function for is an important distinction. Frequent collision in hashing containers, for example, can degrade performance. Frequent collision in cryptography will have far more devastating consequences (see: cryptographic hash function on Wikipedia).
Collision happens relatively easily even with "decent" hashing algorithm. For example, in Java,
String s = new String(new char[size]);
always hashes to 0. That is, all strings containing only \0 hash to 0 in Java.
As for "does it matter which container will it be?", again it depends on the application. You can design hash functions that would hash "similar" objects to nearby values. This is useful when you want to search for similar objects, for example. Just hash them all and see where they fall. In this case, collisions or near-collisions are desirable, because it groups objects that are similar.
In other applications, you want even the slightest change in the object to result in an entirely different hash value. This is the case in cryptography, for example, where you want to be as certain as possible that something has not been modified. It is far more difficult to find different objects that hash to the same value in this case.
Depending on your application, cryptographic hashes like MDA, SHA1/2 etc. may not be the ideal choice, precisely because they appear as if entirely random, thus giving you collisions as prediced by the birthday paradox. Traditionally, one reason for using simple hashes based on the remainder operation is that keys were expected to be serial numbers or similar, so that a remainder operation would sustain fewer collisions than expected at random. E.g. if the keys are the integers are 1..1000 you might have no collisions at all in a container of size 1009 if your hash function is the key mod 1009. People would sometimes hand-tune systems by carefully picking container size and hash function to achieve an even split.
Of course, if you have to worry about people maliciously choosing keys that will cause you difficulty, or an upstream system sending you very biassed keys (because e.g. it has its own hash table and decides to process all keys that hash to X at once). you may wish to use a hash based on a keyed cryptographic hash function to defend against this.

Resources