When building a hash table using linear probing for collision resolution, is the extra term always added to the hash or only when a collision occurs? - algorithm

I'm building a table, where an attempt to insert a new key into the table when there is a collision follows the sequence { hash(x) + i, where i = 1,2,3, ... }. If I'm building a hash table using linear probing would my Insert() algorithm do something like this:
hashValue = hash(x)
while hashValue is taken in table
hashValue += 1
where I only add the increment value when there's a collision, or would I add the increment value to the hash right from the start when i = 1 , so something like this:
hashValue = hash(x) + 1
while hashValue is taken in table
hashValue += 1

As long as you do it consistently, it does not matter. The effect of adding one (or any other constant, for that matter) to hash code has no effect on the composition of the table, except that the bucket numbering would be "shifted off" by a constant "offset". Since bucket numbering is a private matter of your has table, nobody should care.
In essence, a linear probing hash function is
H(x, i) = (H(x) + i) % N
where N is the number of buckets. It is conventional to start i at zero, which means incrementing the value of hash only when you get a collision.

It does not hurt (it simply shifts the probe sequence by one element), but it doesn't have any benefits either, and conceptually it's a bit silly. That's why the canonical form starts at hash(x) and increments only when encountering collisions.

Related

Hashing function to distribute over n values (with a twist)

I was wondering if there are any hashing functions to distribute input over n values. The distribution should of course be fairly uniform. But there is a twist. with small changes of n, few elements should get a new hash. Optimally it should split all k uniformly over n values and if n increases to n+1 only k/n-k/(n+1) values would have to move to uniformly distribute in the new hash. Obviously having a hash which simply creates uniform values and then mod it would work, but that would move a lot of hashes to fill the new node. The goal here is that as few values as possible falls into a new bucket.
Suppose 2^{n-1} < N <= 2^n. Then there is a standard trick for turning a hash function H that produces (at least) n bits into one that produces a number from 0 to N.
Compute H(v).
Keep just the first n bits.
If that's smaller than N, stop and output it. Otherwise, start from the top with H(v) instead of v.
Some properties of this technique:
You might worry that you have to repeat the loop many times in some cases. But actually the expected number of loops is at most 2.
If you bump up N and n doesn't have to change, very few things get a new hash: only those ones that had exactly N somewhere in their chain of hashes. (Of course, identifying which elements have this property is kind of hard -- in general it may require rehashing every element!)
If you bump up N and n does have to change, about half of the elements have to be rebucketed. But this happens more and more rarely the bigger N is -- it is an amortized O(1) cost on each bump.
Edit to add an additional comment about the "have to rehash everything" requirement: One might consider modifying step 3 above to "start from the top with the first n bits of H(v)" instead. This reduces the problem with identifying which elements need to be rehashed -- since they'll be in the bucket for the hash of N -- though I'm not confident the resulting hash will have quite as good collision avoidance properties. It certainly makes the process a bit more fragile -- one would want to prove something special about the choice of H (that the bottom few bits aren't "critical" to its collision avoidance properties somehow).
Here is a simple example implementation in Python, together with a short main that shows that most strings do not move when bumping normally, and about half of strings get moved when bumping across a 2^n boundary. Forgive me for any idiosyncracies of my code -- Python is a foreign language.
import math
def ilog2(m): return int(math.ceil(math.log(m,2)))
def hash_into(obj, N):
cur_hash = hash(obj)
mask = pow(2, ilog2(N)) - 1
while (cur_hash & mask) >= N:
# seems Python uses the identity for its hash on integers, which
# doesn't iterate well; let's use literally any other hash at all
cur_hash = hash(str(cur_hash))
return cur_hash & mask
def same_hash(obj, N, N2):
return hash_into(obj, N) == hash_into(obj, N2)
def bump_stat(objs, N):
return len([obj for obj in objs if same_hash(obj, N, N+1)])
alphabet = [chr(x) for x in range(ord('a'),ord('z')+1)]
ascending = alphabet + [c1 + c2 for c1 in alphabet for c2 in alphabet]
def main():
print len(ascending)
print bump_stat(ascending, 10)
print float(bump_stat(ascending, 16))/len(ascending)
# prints:
# 702
# 639
# 0.555555555556
Well, when you add a node, you will want it to fill up, so you will actually want k/(n+1) elements to move from their old nodes to the new one.
That is easily accomplished:
Just generate a hash value for each key as you normally would. Then, to assign key k to a node in [0,N):
Let H(k) be the hash of k.
int hash = H(k);
for (int n=N-1;n>0;--n) {
if ((mix(hash,n) % (i+1))==0) {
break;
}
}
//put it in node n
So, when you add node node 1, it steals half the items from node 0.
When you add node 2, it steals 1/3 of the items from the previous 2 nodes.
And so on...
EDIT: added the mix() function, to mix up the hash differently for every n -- otherwise you get non-uniformities when n is not prime.

Best way to resize a hash table

I am creating my own implementation to hash a table for education purposes.
What would be the best way to increase a hash table size?
I currently double the hash array size.
The hashing function I'm using is: key mod arraysize.
The problem with this is that if the keys are: 2, 4, 6, 8, then the array size will just keep increasing.
What is the best way of overcoming this issue? Is there a better way of increasing a hash table size? Would changing my hashing function help?
NOTE: My keys are all integers!
Hash tables often avoid this problem by making sure that the hash table size is a prime number. When you resize the table, double the size and then round up to the first prime number larger than that. Doing this avoids the clustering problems similar to what you describe.
Now, it does take a little bit of time to find the next prime number, but not a whole lot. When compared to the time involved in rehashing the hash table's contents, finding the next prime number takes almost no time at all. See Optimizing the wrong thing for a description.
OpenJDK uses powers of 2 for the capacity of a HashMap, which will lead to a lot of collisions if the keys are all multiples of a power of two. It prevents this by applying another hash function on top of the key's hashCode:
/**
* Applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions.
* This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions
* for hashCodes that do not differ in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
If you try to implement your own hash table, here is some tips:
Chose a prime number for table size if you use the mod for the hash function.
Use Quadratic Probing to find the final position for collisions, h(x,i) = (Hash(x) + i*i) mod TableSize for the ith collision.
Double the size to the nearest prime number when hash table get half full which you will merely never do if your collision function is ok for your input.
Here is an elegant implement for Quadratic Probing:
//find a position to set the key
int findPos( int key, YourHashTable h )
{
int curPos;
int collisionNum = 0;
curPos = key % h.TableSize;
//while find a collision
while( h[curPos] != null && h[curPos] != key )
{
//f(i) = i*i = f(i-1) + 2*i -1
curPos += 2 * ++collisionNum - 1;
//do the mod only use - for efficiency
if( curPos >= h.TableSize )
curPos -= h.TableSize;
}
return curPos;
}
Hashing and hash functions are a complex topic, fortunately with lots of online resources.
It is not clear how you determine the array size in the first place.
In the Java HashMap implementation, the size of the underlying array is always a power of 2. This has the slight advantage that you don't need to compute the modulo, but can compute the array index as index = hashValue & (array.length-1) (which is equivalent to a modulo operation when array.length is a power of 2).
Additionally, the HashMap uses some "magic function" to reduce the number of hash collisions for the case that several hash values only differ by a constant factor, as in your example.
The actual size of the array is then determined by a "load factor". (You can even specify this as a constructor parameter of HashMap). When the number of array entries that are occupied exceeds loadFactor * array.length, then the length of the array will be doubled.
This load factor allows a certain trade-off: When the load factor is high (0.9 or so), then it will be more likely that hash collisions will occur. When it is low (0.3 or so), then hash collisions will be more unlikely, but there will be a lot of "wasted" space, because only few entries of the array will actually be occupied at any point in time.

Limit for quadratic probing a hash table

I was doing a program to compare the average and maximum accesses required for linear probing, quadratic probing and separate chaining in hash table.
I had done the element insertion part for 3 cases. While finding the element from hash table, I need to have a limit for ending the searching.
In the case of separate chaining, I can stop when next pointer is null.
For linear probing, I can stop when probed the whole table (ie size of table).
What should I use as limit in quadratic probing? Will table size do?
My quadratic probing function is like this
newKey = (key + i*i) % size;
where i varies from 0 to infinity. Please help me..
For such problems analyse the growth of i in two pieces:
First Interval : i goes from 0 to size-1
In this case, I haven't got the solution for now. Hopefully will update.
Second Interval : i goes from size to infinity
In this case i can be expressed as i = size + k, then
newKey = (key + i*i) % size
= (key + (size+k)*(size+k)) % size
= (key + size*size + 2*k*size + k*k) % size
= (key + k*k) % size
So it's sure that we will start probing previously probed cells, after i reaches to size. So you only need to consider the situation where i goes from 0 to size-1. Because rest is only the same story again and again.
What the story tells up to now: A simple analysis showed me that I need to probe at most size times because beyond size times I started probing the same cells.
See this link. If your table size is power of 2 and you are using a reprobe function f(i)=i*(i+1)/2, you are guaranteed to traverse the entire table. If your table size is a prime number, you are guaranteed to traverse at least half of the table. In general, you can check if at some point you are back to the original point. If that happens, you need to rehash.
After doing some simulations in Excel, it appears that iterating up to i = size / 2 would be all that needs to be tested. This is when using the standard method of adding sequential perfect squares to the single-hashed position.
The answer that you can quit if a position is revisited would not allow testing of all possible positions that could be reached by the quadratic-probe method, at least not for all array sizes. (I tested array size 21 and found that i=5 revisits the same position as i=2, but i=6 yields a previously not-calculated position.)

How to design a data structure that allows one to search, insert and delete an integer X in O(1) time

Here is an exercise (3-15) in the book "Algorithm Design Manual".
Design a data structure that allows one to search, insert, and delete an integer X in O(1) time (i.e. , constant time, independent of the total number of integers stored). Assume that 1 ≤ X ≤ n and that there are m + n units of space available, where m is the maximum number of integers that can be in the table at any one time. (Hint: use two arrays A[1..n] and B[1..m].) You are not allowed to initialize either A or B, as that would take O(m) or O(n) operations. This means the arrays are full of random garbage to begin with, so you must be very careful.
I am not really seeking for the answer, because I don't even understand what this exercise asks.
From the first sentence:
Design a data structure that allows one to search, insert, and delete an integer X in O(1) time
I can easily design a data structure like that. For example:
Because 1 <= X <= n, so I just have an bit vector of n slots, and let X be the index of the array, when insert, e.g., 5, then a[5] = 1; when delete, e.g., 5, then a[5] = 0; when search, e.g.,5, then I can simply return a[5], right?
I know this exercise is harder than I imagine, but what's the key point of this question?
You are basically implementing a multiset with bounded size, both in number of elements (#elements <= m), and valid range for elements (1 <= elementValue <= n).
Search: myCollection.search(x) --> return True if x inside, else False
Insert: myCollection.insert(x) --> add exactly one x to collection
Delete: myCollection.delete(x) --> remove exactly one x from collection
Consider what happens if you try to store 5 twice, e.g.
myCollection.insert(5)
myCollection.insert(5)
That is why you cannot use a bit vector. But it says "units" of space, so the elaboration of your method would be to keep a tally of each element. For example you might have [_,_,_,_,1,_,...] then [_,_,_,_,2,_,...].
Why doesn't this work however? It seems to work just fine for example if you insert 5 then delete 5... but what happens if you do .search(5) on an uninitialized array? You are specifically told you cannot initialize it, so you have no way to tell if the value you'll find in that piece of memory e.g. 24753 actually means "there are 24753 instances of 5" or if it's garbage.
NOTE: You must allow yourself O(1) initialization space, or the problem cannot be solved. (Otherwise a .search() would not be able to distinguish the random garbage in your memory from actual data, because you could always come up with random garbage which looked like actual data.) For example you might consider having a boolean which means "I have begun using my memory" which you initialize to False, and set to True the moment you start writing to your m words of memory.
If you'd like a full solution, you can hover over the grey block to reveal the one I came up with. It's only a few lines of code, but the proofs are a bit longer:
SPOILER: FULL SOLUTION
Setup:
Use N words as a dispatch table: locationOfCounts[i] is an array of size N, with values in the range location=[0,M]. This is the location where the count of i would be stored, but we can only trust this value if we can prove it is not garbage. >!
(sidenote: This is equivalent to an array of pointers, but an array of pointers exposes you being able to look up garbage, so you'd have to code that implementation with pointer-range checks.)
To find out how many is there are in the collection, you can look up the value counts[loc] from above. We use M words as the counts themselves: counts is an array of size N, with two values per element. The first value is the number this represents, and the second value is the count of that number (in the range [1,m]). For example a value of (5,2) would mean that there are 2 instances of the number 5 stored in the collection.
(M words is enough space for all the counts. Proof: We know there can never be more than M elements, therefore the worst-case is we have M counts of value=1. QED)
(We also choose to only keep track of counts >= 1, otherwise we would not have enough memory.)
Use a number called numberOfCountsStored that IS initialized to 0 but is updated whenever the number of item types changes. For example, this number would be 0 for {}, 1 for {5:[1 times]}, 1 for {5:[2 times]}, and 2 for {5:[2 times],6:[4 times]}.
                          1  2  3  4  5  6  7  8...
locationOfCounts[<N]: [☠, ☠, ☠, ☠, ☠, 0, 1, ☠, ...]
counts[<M]:           [(5,⨯2), (6,⨯4), ☠, ☠, ☠, ☠, ☠, ☠, ☠, ☠..., ☠]
numberOfCountsStored:          2
Below we flush out the details of each operation and prove why it's correct:
Algorithm:
There are two main ideas: 1) we can never allow ourselves to read memory without verifying that is not garbage first, or if we do we must be able to prove that it was garbage, 2) we need to be able to prove in O(1) time that the piece of counter memory has been initialized, with only O(1) space. To go about this, the O(1) space we use is numberOfItemsStored. Each time we do an operation, we will go back to this number to prove that everything was correct (e.g. see ★ below). The representation invariant is that we will always store counts in counts going from left-to-right, so numberOfItemsStored will always be the maximum index of the array that is valid.
.search(e) -- Check locationsOfCounts[e]. We assume for now that the value is properly initialized and can be trusted. We proceed to check counts[loc], but first we check if counts[loc] has been initialized: it's initialized if 0<=loc<numberOfCountsStored (if not, the data is nonsensical so we return False). After checking that, we look up counts[loc] which gives us a number,count pair. If number!=e, we got here by following randomized garbage (nonsensical), so we return False (again as above)... but if indeed number==e, this proves that the count is correct (★proof: numberOfCountsStored is a witness that this particular counts[loc] is valid, and counts[loc].number is a witness that locationOfCounts[number] is valid, and thus our original lookup was not garbage.), so we would return True.
.insert(e) -- Perform the steps in .search(e). If it already exists, we only need to increment the count by 1. However if it doesn't exist, we must tack on a new entry to the right of the counts subarray. First we increment numberOfCountsStored to reflect the fact that this new count is valid: loc = numberOfCountsStored++. Then we tack on the new entry: counts[loc] = (e,⨯1). Finally we add a reference back to it in our dispatch table so we can look it up quickly locationOfCounts[e] = loc.
.delete(e) -- Perform the steps in .search(e). If it doesn't exist, throw an error. If the count is >= 2, all we need to do is decrement the count by 1. Otherwise the count is 1, and the trick here to ensure the whole numberOfCountsStored-counts[...] invariant (i.e. everything remains stored on the left part of counts) is to perform swaps. If deletion would get rid of the last element, we will have lost a counts pair, leaving a hole in our array: [countPair0, countPair1, _hole_, countPair2, countPair{numberOfItemsStored-1}, ☠, ☠, ☠..., ☠]. We swap this hole with the last countPair, decrement numberOfCountsStored to invalidate the hole, and update locationOfCounts[the_count_record_we_swapped.number] so it now points to the new location of the count record.
Here is an idea:
treat the array B[1..m] as a stack, and make a pointer p to point to the top of the stack (let p = 0 to indicate that no elements have been inserted into the data structure). Now, to insert an integer X, use the following procedure:
p++;
A[X] = p;
B[p] = X;
Searching should be pretty easy to see here (let X' be the integer you want to search for, then just check that 1 <= A[X'] <= p, and that B[A[X']] == X'). Deleting is trickier, but still constant time. The idea is to search for the element to confirm that it is there, then move something into its spot in B (a good choice is B[p]). Then update A to reflect the pointer value of the replacement element and pop off the top of the stack (e.g. set B[p] = -1 and decrement p).
It's easier to understand the question once you know the answer: an integer is in the set if A[X]<total_integers_stored && B[A[X]]==X.
The question is really asking if you can figure out how to create a data structure that is usable with a minimum of initialization.
I first saw the idea in Cameron's answer in Jon Bentley Programming Pearls.
The idea is pretty simple but it's not straightforward to see why the initial random values that may be on the uninitialized arrays does not matter. This link explains pretty well the insertion and search operations. Deletion is left as an exercise, but is answered by one of the commenters:
remove-member(i):
if not is-member(i): return
j = dense[n-1];
dense[sparse[i]] = j;
sparse[j] = sparse[i];
n = n - 1

Is there an efficient data structure for row and column swapping?

I have a matrix of numbers and I'd like to be able to:
Swap rows
Swap columns
If I were to use an array of pointers to rows, then I can easily switch between rows in O(1) but swapping a column is O(N) where N is the amount of rows.
I have a distinct feeling there isn't a win-win data structure that gives O(1) for both operations, though I'm not sure how to prove it. Or am I wrong?
Without having thought this entirely through:
I think your idea with the pointers to rows is the right start. Then, to be able to "swap" the column I'd just have another array with the size of number of columns and store in each field the index of the current physical position of the column.
m =
[0] -> 1 2 3
[1] -> 4 5 6
[2] -> 7 8 9
c[] {0,1,2}
Now to exchange column 1 and 2, you would just change c to {0,2,1}
When you then want to read row 1 you'd do
for (i=0; i < colcount; i++) {
print m[1][c[i]];
}
Just a random though here (no experience of how well this really works, and it's a late night without coffee):
What I'm thinking is for the internals of the matrix to be a hashtable as opposed to an array.
Every cell within the array has three pieces of information:
The row in which the cell resides
The column in which the cell resides
The value of the cell
In my mind, this is readily represented by the tuple ((i, j), v), where (i, j) denotes the position of the cell (i-th row, j-th column), and v
The would be a somewhat normal representation of a matrix. But let's astract the ideas here. Rather than i denoting the row as a position (i.e. 0 before 1 before 2 before 3 etc.), let's just consider i to be some sort of canonical identifier for it's corresponding row. Let's do the same for j. (While in the most general case, i and j could then be unrestricted, let's assume a simple case where they will remain within the ranges [0..M] and [0..N] for an M x N matrix, but don't denote the actual coordinates of a cell).
Now, we need a way to keep track of the identifier for a row, and the current index associated with the row. This clearly requires a key/value data structure, but since the number of indices is fixed (matrices don't usually grow/shrink), and only deals with integral indices, we can implement this as a fixed, one-dimensional array. For a matrix of M rows, we can have (in C):
int RowMap[M];
For the m-th row, RowMap[m] gives the identifier of the row in the current matrix.
We'll use the same thing for columns:
int ColumnMap[N];
where ColumnMap[n] is the identifier of the n-th column.
Now to get back to the hashtable I mentioned at the beginning:
Since we have complete information (the size of the matrix), we should be able to generate a perfect hashing function (without collision). Here's one possibility (for modestly-sized arrays):
int Hash(int row, int column)
{
return row * N + column;
}
If this is the hash function for the hashtable, we should get zero collisions for most sizes of arrays. This allows us to read/write data from the hashtable in O(1) time.
The cool part is interfacing the index of each row/column with the identifiers in the hashtable:
// row and column are given in the usual way, in the range [0..M] and [0..N]
// These parameters are really just used as handles to the internal row and
// column indices
int MatrixLookup(int row, int column)
{
// Get the canonical identifiers of the row and column, and hash them.
int canonicalRow = RowMap[row];
int canonicalColumn = ColumnMap[column];
int hashCode = Hash(canonicalRow, canonicalColumn);
return HashTableLookup(hashCode);
}
Now, since the interface to the matrix only uses these handles, and not the internal identifiers, a swap operation of either rows or columns corresponds to a simple change in the RowMap or ColumnMap array:
// This function simply swaps the values at
// RowMap[row1] and RowMap[row2]
void MatrixSwapRow(int row1, int row2)
{
int canonicalRow1 = RowMap[row1];
int canonicalRow2 = RowMap[row2];
RowMap[row1] = canonicalRow2
RowMap[row2] = canonicalRow1;
}
// This function simply swaps the values at
// ColumnMap[row1] and ColumnMap[row2]
void MatrixSwapColumn(int column1, int column2)
{
int canonicalColumn1 = ColumnMap[column1];
int canonicalColumn2 = ColumnMap[column2];
ColumnMap[row1] = canonicalColumn2
ColumnMap[row2] = canonicalColumn1;
}
So that should be it - a matrix with O(1) access and mutation, as well as O(1) row swapping and O(1) column swapping. Of course, even an O(1) hash access will be slower than the O(1) of array-based access, and more memory will be used, but at least there is equality between rows/columns.
I tried to be as agnostic as possible when it comes to exactly how you implement your matrix, so I wrote some C. If you'd prefer another language, I can change it (it would be best if you understood), but I think it's pretty self descriptive, though I can't ensure it's correctedness as far as C goes, since I'm actually a C++ guys trying to act like a C guy right now (and did I mention I don't have coffee?). Personally, writing in a full OO language would do it the entrie design more justice, and also give the code some beauty, but like I said, this was a quickly whipped up implementation.

Resources