Are there optimal sizes for an Hashtbl in OCaml? - performance

Say I need to store 20 keys/values, would it be more efficient to use a power of 2, e.g. 32? I read a paper where the authors used a size of 251 (for an unknown number of keys/values), is this just a random number, or is there some reasoning behind it?
I’m talking about the n in Hashtbl.create n.

It's not entirely clear what you're asking. Since you ask about Hashtbl by name, I assume you're talking about the standard hash table module. This module always allocates tables in power-of-2 sizes. So you don't have to worry about it.
There are two basic "extra good" sizes for hash tables. Powers of two are good because they make it easy to find your hash bucket. The last step of a hashing procedure is take the hash value modulo the size of your table. If the table size is a power of two, this modulo operation can be done very quickly with a masking operation. I'm not sure this matters in today's world, unless your hash function itself is very fast to compute.
The second good value is a prime number. A prime number is good because it tends to spread values throughout the table. If you have hash values that happen to be predominantly a multiple of some number, this will cause dense clusters in the hash table unless the hash table size is relatively prime to the predominant number. A large-ish prime number is relatively prime to virtually everything, so it prevents clustering. So, 251 is good because it's a prime number.

Related

Use of universal hashing

I'm trying to understand the usefullness of universal hashing over normal hashing, other than the function is randomly produced everytime, reading Cormen's book.
From what i understand in universal hashing we choose the function to be
H(x)=[(ax+b)mod p]mod m
with p being a prime number larger than all the keys, m the size of the data table, and a,b random numbers.
So for example if i want to read the ID of 80 people, and each ID has a value between [0,200], then m would be 80 and p would be 211(next prime number). Right?
I could use the function lets say
H(x)=[(100x+50)mod 211]mod 80
But why would this help? There is a high chance that i'm going to end up having a lot of empty slots of the table, taking space without reason. Wouldn't it be more usefull to lower the number m in order to get a smaller table so space isn't used wtihout reason?
Any help appreciated
I think the best way to answer your question is to abstract away from the particulars of the formula that you're using to compute hash codes and to think more about, generally, what the impact is of changing the size of a hash table.
The parameter m that you're considering tuning adjusts how many slots are in your hash table. Let's imagine that you're planning on dropping n items into your hash table. The ratio n / m is called the load factor of the hash table and is typically denoted by the letter α.
If you have a table with a high load factor (large α, small m), then you'll have less wasted space in the table. However, you'll also increase the cost of doing a lookup, since with lots of objects distributed into a small space you're likely to get a bunch of collisions that will take time to resolve.
On the other hand, if you have a table with a low load factor (small α, large m), then you decrease the likelihood of collisions and therefore will improve the cost of performing lookups. However, if α gets too small - say, you have 1,000 slots per element actually stored - then you'll have a lot of wasted space.
Part of the engineering aspect of crafting a good hash table is figuring out how to draw the balance between these two options. The best way to see what works and what doesn't is to pull out a profiler and measure how changes to α change your runtime.

Shuffling a huge range of numbers using minimal storage

I've got a very large range/set of numbers, (1..1236401668096), that I would basically like to 'shuffle', i.e. randomly traverse without revisiting the same number. I will be running a Web service, and each time a request comes in it will increment a counter and pull the next 'shuffled' number from the range. The algorithm will have to accommodate for the server going offline, being able to restart traversal using the persisted value of the counter (something like how you can seed a pseudo-random number generator, and get the same pseudo-random number given the seed and which iteration you are on).
I'm wondering if such an algorithm exists or is feasible. I've seen the Fisher-Yates Shuffle, but the 1st step is to "Write down the numbers from 1 to N", which would take terabytes of storage for my entire range. Generating a pseudo-random number for each request might work for awhile, but as the database/tree gets full, collisions will become more common and could degrade performance (already a 0.08% chance of collision after 1 billion hits according to my calculation). Is there a more ideal solution for my scenario, or is this just a pipe dream?
The reason for the shuffling is that being able to correctly guess the next number in the sequence could lead to a minor DOS vulnerability in my app, but also because the presentation layer will look much nicer with a wider number distribution (I'd rather not go into details about exactly what the app does). At this point I'm considering just using a PRNG and dealing with collisions or shuffling range slices (starting with (1..10000000).to_a.shuffle, then, (10000001, 20000000).to_a.shuffle, etc. as each range's numbers start to run out).
Any mathemagicians out there have any better ideas/suggestions?
Concatenate a PRNG or LFSR sequence with /dev/random bits
There are several algorithms that can generate pseudo-random numbers with arbitrarily large and known periods. The two obvious candidates are the LCPRNG (LCG) and the LFSR, but there are more algorithms such as the Mersenne Twister.
The period of these generators can be easily constructed to fit your requirements and then you simply won't have collisions.
You could deal with the predictable behavior of PRNG's and LFSR's by adding 10, 20, or 30 bits of cryptographically hashed entropy from an interface like /dev/random. Because the deterministic part of your number is known to be unique it makes no difference if you ever repeat the actually random part of it.
Divide and conquer? Break down into manageable chunks and shuffle them. You could divide the number range e.g. by their value modulo n. The list is constructive and quite small depending on n. Once a group is exhausted, you can use the next one.
For example if you choose an n of 1000, you create 1000 different groups. Pick a random number between 1 and 1000 (let's call this x) and shuffle the numbers whose value modulo 1000 equals x. Once you have exhausted that range, you can choose a new random number between 1 and 1000 (without x obviously) to get the next subset to shuffle. It shouldn't exactly be challenging to keep track of which numbers of the 1..1000 range have already been used, so you'd just need a repeatable shuffle algorithm for the numbers in the subset (e.g. Fisher-Yates on their "indices").
I guess the best option is to use a GUID/UUID. They are made for this type of thing, and it shouldn't be hard to find an existing implementation to suit your needs.
While collisions are theoretically possible, they are extremely unlikely. To quote Wikipedia:
The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs

Mapping function

I have a set of 128bit number and the size of set < 2^32 ...so theoretically I can have a mapping function that maps all the 128bit numbers to 32 bit number ....how can I construct the mapping function ???
Seems like you are looking for a minimal perfect hash which maps n keys to n consecutive integers.
The wiki page link in the above sentence mentions two libraries which implement this.
Also see this for more detail: http://burtleburtle.net/bob/hash/perfect.html
Without knowing the nature of the input data, it's impossible to give the optimal hashing algorithm. But if the input is evenly distributed then you could use the lower 32 bits of the input. This means the possibility of collisions, so you have to deal with that.
The generic construction is to keep all your 128-bit values in a big array, sorted in ascending order. Then, each value is "mapped" to its index in the array. To "compute" the map, you do a binary search in the array, to get the precise index of the value in the array. With 232 values, the array has size 64 GB, and the binary search entails 35-or-so lookups in the array.
In all generality you cannot do really better than that. However, if your 128-bit values have a reasonably uniform spread (it depends from where they come), then the big array structure can be compressed by a large margin, especially if you can guarantee that all inputs to your map will always be part of the set of 128-bit values; my bet is that you can trim it down to a couple of gigabytes -- but the lookup will be more expensive.
For a more practical solution, you will have to work with the structure of your 128-bit values: where they come from, what they represent...
Set a position of your number as division of it's value on 2^32.

Efficiently estimating the number of unique elements in a large list

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.
You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.
This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.
If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.
Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.

Does a hash function output need to be bounded less than the number of buckets?

I was reading about this person's interview "at a well-known search company".
http://asserttrue.blogspot.com/2009/05/one-of-toughest-job-interview-questions.html
He was asked a question which led to him implementing a hash table. He said the following:
HASH = INITIAL_VALUE;
FOR EACH ( CHAR IN WORD ) {
HASH *= MAGIC_NUMBER
HASH ^= CHAR
HASH %= BOUNDS
}
RETURN HASH
I explained that the hash table array
length should be prime, and the BOUNDS
number is less than the table length,
but coprime to the table length.
Why should the BOUNDS number be less than the number of buckets? What does being coprime to the table length do? Isn't it supposed to be coprime to the BOUNDS?
I would hazard that he is completely wrong. BOUNDS should be the number of buckets or the last few buckets are going to be underused.
Further, the bounding of the output to the number of buckets should be OUTSIDE the hash function. This is an implementation detail of that particular hash table. You might have a very large table using lots of buckets and another using few. Both should share the same string->hash function
Further, if you read the page that you linked to it is quite interesting. I would have implemented his hash table as something like 10,000 buckets - For those who haven't read it, the article suggests ~ 4,000,000,000 buckets to store 1,000,000 or so possible words. For collisions, each bucket has a vector of word structures, each of those containing a count, a plaintext string and a hash (unique within the bucket). This would use far less memory and work better with modern caches since your working set would be much smaller.
To further reduce memory usage you could experiment with culling words from the hash during the input phase that look like they are below the top 100,000 based on the current count.
I once interviewed for a job at a well-known search company. I got the exact same question. I tried to tackle it by using hash table.
One thing that I learnt from that interview was that at a well-known search company, you do not propose hashes as solutions. You use any tree-like structure you like but you always use ordered structure, not hash table.
A simple explicit suffix tree would only use worst case maybe 500k memory (with a moderately efficient implementation, 4 byte character encodings, and relatively long English words that have minimal overlap) to do the same thing.
I think the guy in the article outsmarted himself.

Resources