Limit for quadratic probing a hash table - algorithm

I was doing a program to compare the average and maximum accesses required for linear probing, quadratic probing and separate chaining in hash table.
I had done the element insertion part for 3 cases. While finding the element from hash table, I need to have a limit for ending the searching.
In the case of separate chaining, I can stop when next pointer is null.
For linear probing, I can stop when probed the whole table (ie size of table).
What should I use as limit in quadratic probing? Will table size do?
My quadratic probing function is like this
newKey = (key + i*i) % size;
where i varies from 0 to infinity. Please help me..

For such problems analyse the growth of i in two pieces:
First Interval : i goes from 0 to size-1
In this case, I haven't got the solution for now. Hopefully will update.
Second Interval : i goes from size to infinity
In this case i can be expressed as i = size + k, then
newKey = (key + i*i) % size
= (key + (size+k)*(size+k)) % size
= (key + size*size + 2*k*size + k*k) % size
= (key + k*k) % size
So it's sure that we will start probing previously probed cells, after i reaches to size. So you only need to consider the situation where i goes from 0 to size-1. Because rest is only the same story again and again.
What the story tells up to now: A simple analysis showed me that I need to probe at most size times because beyond size times I started probing the same cells.

See this link. If your table size is power of 2 and you are using a reprobe function f(i)=i*(i+1)/2, you are guaranteed to traverse the entire table. If your table size is a prime number, you are guaranteed to traverse at least half of the table. In general, you can check if at some point you are back to the original point. If that happens, you need to rehash.

After doing some simulations in Excel, it appears that iterating up to i = size / 2 would be all that needs to be tested. This is when using the standard method of adding sequential perfect squares to the single-hashed position.
The answer that you can quit if a position is revisited would not allow testing of all possible positions that could be reached by the quadratic-probe method, at least not for all array sizes. (I tested array size 21 and found that i=5 revisits the same position as i=2, but i=6 yields a previously not-calculated position.)

Related

A special sample method in Map-Reduce implementation

I have a table with 4*10^8(roughly) records, and I want to get a 4*10^6(exactly) sample of it.
But my way to get the sample is somehow special:
I select 1 record from the 4*10^8 record randomly(every record has the same probability to be select).
repeat step 1 4*10^6 times(no matter if one record be selected multiple times).
I think up a method to solve this:
Generate a table A(num int), and there only one number in every record of table A which is random integer from 1 to n(n is the size of my original table, roughly 4*10^8 as mentioned above).
Load table A as resource file to every map, and if the ordinal number of the record which is on decision now is in table A, output this record, otherwise discard it.
I think my method is not so good because if I want to sample more record from the original table, the table A will became very large and can't be loaded as resource file.
So, could any one please give an elegant algorithm?
I'm not sure what "elegant" means, but perhaps you're interested in something analogous to reservoir sampling. Let k be the size of the sample and initialize a k-element array with nulls. The elements from which we are sampling arrive one by one. When the jth (counting from 1) element arrives, we iterate through the array and, for each cell, replace its contents by the current element independently with probability 1/j.
Naively, the running time is pretty bad -- to sample k elements from n with replacement costs O(k n). The number of writes into the array, however, is O(k log n) in expectation, because later elements in the stream rarely result in writes. Here's an efficient method based on the exponential distribution (warning: lightly tested Python ahead). The running time is O(n + k log n).
import math
import random
def sample_from(population, k):
for i, x in enumerate(population):
if i == 0:
sample = [x] * k
else:
t = float(k) * math.log(1.0 - 1.0 / float(i + 1))
while True:
t -= math.log(1.0 - random.random())
if t >= 0.0:
break
sample[random.randrange(k)] = x
return sample

When building a hash table using linear probing for collision resolution, is the extra term always added to the hash or only when a collision occurs?

I'm building a table, where an attempt to insert a new key into the table when there is a collision follows the sequence { hash(x) + i, where i = 1,2,3, ... }. If I'm building a hash table using linear probing would my Insert() algorithm do something like this:
hashValue = hash(x)
while hashValue is taken in table
hashValue += 1
where I only add the increment value when there's a collision, or would I add the increment value to the hash right from the start when i = 1 , so something like this:
hashValue = hash(x) + 1
while hashValue is taken in table
hashValue += 1
As long as you do it consistently, it does not matter. The effect of adding one (or any other constant, for that matter) to hash code has no effect on the composition of the table, except that the bucket numbering would be "shifted off" by a constant "offset". Since bucket numbering is a private matter of your has table, nobody should care.
In essence, a linear probing hash function is
H(x, i) = (H(x) + i) % N
where N is the number of buckets. It is conventional to start i at zero, which means incrementing the value of hash only when you get a collision.
It does not hurt (it simply shifts the probe sequence by one element), but it doesn't have any benefits either, and conceptually it's a bit silly. That's why the canonical form starts at hash(x) and increments only when encountering collisions.

how to read all 1's in an Array of 1's and 0's spread-ed all over the array randomly

I have an Array with 1 and 0 spread over the array randomly.
int arr[N] = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N}
Now I want to retrive all the 1's in the array as fast as possible, but the condition is I should not loose the exact position(based on index) of the array , so sorting option not valid.
So the only option left is linear searching ie O(n) , is there anything better than this.
The main problem behind linear scan is , I need to run the scan even
for X times. So I feel I need to have some kind of other datastructure
which maintains this list once the first linear scan happens, so that
I need not to run the linear scan again and again.
Let me be clear about final expectations-
I just need to find the number of 1's in a certain range of array , precisely I need to find numbers of 1's in the array within range of 40-100. So this can be random range and I need to find the counts of 1 within that range. I can't do sum and all as I need to iterate over the array over and over again because of different range requirements
I'm surprised you considered sorting as a faster alternative to linear search.
If you don't know where the ones occur, then there is no better way than linear searching. Perhaps if you used bits or char datatypes you could do some optimizations, but it depends on how you want to use this.
The best optimization that you could do on this is to overcome branch prediction. Because each value is zero or one, you can use it to advance the index of the array that is used to store the one-indices.
Simple approach:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
if( arr[i] ) indices[end++] = i; // Slow due to branch prediction
}
Without branching:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
indices[end] = i;
end += arr[i];
}
[edit] I tested the above, and found the version without branching was almost 3 times faster (4.36s versus 11.88s for 20 repeats on a randomly populated 100-million element array).
Coming back here to post results, I see you have updated your requirements. What you want is really easy with a dynamic programming approach...
All you do is create a new array that is one element larger, which stores the number of ones from the beginning of the array up to (but not including) the current index.
arr : 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1
count : 0 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 5 6 6 6 6 7
(I've offset arr above so it lines up better)
Now you can compute the number of 1s in any range in O(1) time. To compute the number of 1s between index A and B, you just do:
int num = count[B+1] - count[A];
Obviously you can still use the non-branch-prediction version to generate the counts initially. All this should give you a pretty good speedup over the naive approach of summing for every query:
int *count = new int[N+1];
int total = 0;
count[0] = 0;
for( int i = 0; i < N; i++ )
{
total += arr[i];
count[i+1] = total;
}
// to compute the ranged sum:
int range_sum( int *count, int a, int b )
{
if( b < a ) return range_sum(b,a);
return count[b+1] - count[a];
}
Well one time linear scanning is fine. Since you are looking for multiple scans across ranges of array I think that can be done in constant time. Here you go:
Scan the array and create a bitmap where key = key of array = sequence (1,2,3,4,5,6....).The value storedin bitmap would be a tuple<IsOne,cumulativeSum> where isOne is whether you have a one in there and cumulative Sum is addition of 1's as and wen you encounter them
Array = 1 1 0 0 1 0 1 1 1 0 1 0
Tuple: (1,1) (1,2) (0,2) (0,2) (1,3) (0,3) (1,4) (1,5) (1,6) (0,6) (1,7) (0,7)
CASE 1: When lower bound of cumulativeSum has a 0. Number of 1's [6,11] =
cumulativeSum at 11th position - cumulativeSum at 6th position = 7 - 3 = 4
CASE 2: When lower bound of cumulativeSum has a 1. Number of 1's [2,11] =
cumulativeSum at 11th position - cumulativeSum at 2nd position + 1 = 7-2+1 = 6
Step 1 is O(n)
Step 2 is 0(1)
Total complexity is linear no doubt but for your task where you have to work with the ranges several times the above Algorithm seems to be better if you have ample memory :)
Does it have to be a simple linear array data structure? Or can you create your own data structure which happens to have the desired properties, for which you're able to provide the required API, but whose implementation details can be hidden (encapsulated)?
If you can implement your own and if there is some guaranteed sparsity (to either 1s or 0s) then you might be able to offer better than linear performance. I see that you want to preserve (or be able to regenerate) the exact stream, so you'll have to store an array or bitmap or run-length encoding for that. (RLE will be useless if the stream is actually random rather than arbitrary but could be quite useful if there are significant sparsity or patterns with long strings of one or the other. For example a black&white raster of a bitmapped image is often a good candidate for RLE).
Let's say that your guaranteed that the stream will be sparse --- that no more than 10%, for example, of the bits will be 1s (or, conversely that more than 90% will be). If that's the case then you might model your solution on an RLE and maintain a count of all 1s (simply incremented as you set bits and decremented as you clear them). If there might be a need to quickly get the number of set bits for arbitrary ranges of these elements then instead of a single counter you can have a conveniently sized array of counters for partitions of the stream. (Conveniently-sized, in this case, means something which fits easily within memory, within your caches, or register sets, but which offers a reasonable trade off between computing a sum (all the partitions fully within the range) and the linear scan. The results for any arbitrary range is the sum of all the partitions fully enclosed by the range plus the results of linear scans for any fragments that are not aligned on your partition boundaries.
For a very, very, large stream you could even have a multi-tier "index" of partition sums --- traversing from the largest (most coarse) granularity down toward the "fragments" to either end (using the next layer of partition sums) and finishing with the linear search of only the small fragments.
Obviously such a structure represents trade offs between the complexity of building and maintaining the structure (inserting requires additional operations and, for an RLE, might be very expensive for anything other than appending/prepending) vs the expense of performing arbitrarily long linear search/increment scans.
If:
the purpose is to be able to find the number of 1s in the array at any time,
given that relatively few of the values in the array might change between one moment when you want to know the number and another moment, and
if you have to find the number of 1s in a changing array of n values m times,
... you can certainly do better than examining every cell in the array m times by using a caching strategy.
The first time you need the number of 1s, you certainly have to examine every cell, as others have pointed out. However, if you then store the number of 1s in a variable (say sum) and track changes to the array (by, for instance, requiring that all array updates occur through a specific update() function), every time a 0 is replaced in the array with a 1, the update() function can add 1 to sum and every time a 1 is replaced in the array with a 0, the update() function can subtract 1 from sum.
Thus, sum is always up-to-date after the first time that the number of 1s in the array is counted and there is no need for further counting.
(EDIT to take the updated question into account)
If the need is to return the number of 1s in a given range of the array, that can be done with a slightly more sophisticated caching strategy than the one I've just described.
You can keep a count of the 1s in each subset of the array and update the relevant subset count whenever a 0 is changed to a 1 or vice versa within that subset. Finding the total number of 1s in a given range within the array would then be a matter of adding the number of 1s in each subset that is fully contained within the range and then counting the number of 1s that are in the range but not in the subsets that have already been counted.
Depending on circumstances, it might be worthwhile to have a hierarchical arrangement in which (say) the number of 1s in the whole array is at the top of the hierarchy, the number of 1s in each 1/q th of the array is in the second level of the hierarchy, the number of 1s in each 1/(q^2) th of the array is in the third level of the hierarchy, etc. e.g. for q = 4, you would have the total number of 1s at the top, the number of 1s in each quarter of the array at the second level, the number of 1s in each sixteenth of the array at the third level, etc.
Are you using C (or derived language)? If so, can you control the encoding of your array? If, for example, you could use a bitmap to count. The nice thing about a bitmap, is that you can use a lookup table to sum the counts, though if your subrange ends aren't divisible by 8, you'll have to deal with end partial bytes specially, but the speedup will be significant.
If that's not the case, can you at least encode them as single bytes? In that case, you may be able to exploit sparseness if it exists (more specifically, the hope that there are often multi index swaths of zeros).
So for:
u8 input = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N};
You can write something like (untested):
uint countBytesBy1FromTo(u8 *input, uint start, uint stop)
{ // function for counting one byte at a time, use with range of less than 4,
// use functions below for longer ranges
// assume it's just one's and zeros, otherwise we have to test/branch
uint sum;
u8 *end = input + stop;
for (u8 *each = input + start; each < end; each++)
sum += *each;
return sum;
}
countBytesBy8FromTo(u8 *input, uint start, uint stop)
{
u64 *chunks = (u64*)(input+start);
u64 *end = chunks + ((start - stop) >> 3);
uint sum = countBytesBy1FromTo((u8*)end, 0, stop - (u8*)end);
for (; chunks < end; chunks++)
{
if (*chunks)
{
sum += countBytesBy1FromTo((u8*)chunks, 0, 8);
}
}
}
The basic trick, is exploiting the ability to cast slices of your target array to single entities your language can look at in one swoop, and test by inference if ANY of the values of it are zeros, and then skip the whole block. The more zeros, the better it will work. In the case where your large cast integer always has at least one, this approach just adds overhead. You might find that using a u32 is better for your data. Or that adding a u32 test between the 1 and 8 helps. For datasets where zeros are much more common than ones, I've used this technique to great advantage.
Why is sorting invalid? You can clone the original array, sort the clone, and count and/or mark the locations of the 1s as needed.

Algorithm for detecting duplicates in a dataset which is too large to be completely loaded into memory

Is there an optimal solution to this problem?
Describe an algorithm for finding duplicates in a file of one million phone numbers. The algorithm, when running, would only have two megabytes of memory available to it, which means you cannot load all the phone numbers into memory at once.
My 'naive' solution would be an O(n^2) solution which iterates over the values and just loads the file in chunks instead of all at once.
For i = 0 to 999,999
string currentVal = get the item at index i
for j = i+1 to 999,999
if (j - i mod fileChunkSize == 0)
load file chunk into array
if data[j] == currentVal
add currentVal to duplicateList and exit for
There must be another scenario were you can load the whole dataset in a really unique way and verify if a number is duplicated. Anyone have one?
Divide the file into M chunks, each of which is large enough to be sorted in memory. Sort them in memory.
For each set of two chunks, we will then carry out the last step of mergesort on two chunks to make one larger chunk (c_1 + c_2) (c_3 + c_4) .. (c_m-1 + c_m)
Point at the first element on c_1 and c_2 on disk, and make a new file (we'll call it c_1+2).
if c_1's pointed-to element is a smaller number than c_2's pointed-to element, copy it into c_1+2 and point to the next element of c_1.
Otherwise, copy c_2's pointed element into and point to the next element of c_2.
Repeat the previous step until both arrays are empty. You only need to use the space in memory needed to hold the two pointed-to numbers. During this process, if you encounter c_1 and c_2's pointed-to elements being equal, you have found a duplicate - you can copy it in twice and increment both pointers.
The resulting m/2 arrays can be recursively merged in the same manner- it will take log(m) of these merge steps to generate the correct array. Each number will be compared against each other number in a way that will find the duplicates.
Alternately, a quick and dirty solution as alluded to by #Evgeny Kluev is to make a bloom filter which is as large as you can reasonably fit in memory. You can then make a list of the index of each element which fails the bloom filter and loop through the file a second time in order to test these members for duplication.
I think Airza's solution is heading towards a good direction, but since sorting is not what you want, and it is more expensive you can do the following by combining with angelatlarge's approach:
Take a chunk C that fits in the memory of size M/2 .
Get the chunk Ci
Iterate through i and hash each element into a hash-table. If the element already exists then you know it is a duplicate, and you can mark it as a duplicate. (add its index into an array or something).
Get the next chunk Ci+1 and check if any of the key already exists in the hash table. If an element exists mark it for deletion.
Repeat with all chunks until you know they will not contain any duplicates from chunk Ci
Repeat steps 1,2 with chunk Ci+1
Deleted all elements marked for deletion (could be done during, whatever is more appropriate, it might be more expensive to delete one at the time if you have to shift everything else around).
This runs in O((N/M)*|C|) , where |C| is the chunk size. Notice that if M > 2N, then we only have one chunk, and this runs in O(N), which is optimal for deleting duplicates.
We simply hash them and make sure that all collisions are deleted.
Edit: Per requested, I'm providing details:
* N is the number phone numbers.
The size of the chunk will depend on the memory, it should be of size M/2.
This is the size of memory that will load a chunk of the file, since the whole file is too big to be loaded to memory.
This leaves another M/2 bytes to keep the hash table2, and/or a duplicate list1.
Hence, there should be N/(M/2) chunks, each of size |C| = M/2
The run time will be the number of chunks(N/(M/2)), times the size of each chunk |C| (or M/2). Overall, this should be linear (plus or minus the overhead of changing from one chunk to the other, which is why the best way to describe it is O( (N/M) * |C| )
a. Loading a chunk Ci. O(|C|)
b. Iterate through each element, test and set if not there O(1) will be hashed in which insertion and lookup should take.
c. If the element is already there, you can delete it.1
d. Get the next chunk, rinse and repeat (2N/M chunks, so O(N/M))
1 Removing an element might cost O(N), unless we keep a list and remove them all in one go, by avoiding to shift all the remaining elements whenever an element is removed.
2 If the phone numbers can all be represented as an integer < 232 - 1, we can avoid having a full hash-table and just use a flag map, saving piles of memory (we'll only need N-bits of memory)
Here's a somewhat detailed pseudo-code:
void DeleteDuplicate(File file, int numberOfPhones, int maxMemory)
{
//Assume each 1'000'000 number of phones that fit in 32-bits.
//Assume 2MB of memory
//Assume that arrays of bool are coalesced into 8 bools per byte instead of 1 bool per byte
int chunkSize = maxMemory / 2; // 2MB / 2 / 4-byes per int = 1MB or 256K integers
//numberOfPhones-bits. C++ vector<bool> for example would be space efficient
// Coalesced-size ~= 122KB | Non-Coalesced-size (worst-case) ~= 977KB
bool[] exists = new bool[numberOfPhones];
byte[] numberData = new byte[chunkSize];
int fileIndex = 0;
int bytesLoaded;
do //O(chunkNumber)
{
bytesLoaded = file.GetNextByes(chunkSize, /*out*/ numberData);
List<int> toRemove = new List<int>(); //we still got some 30KB-odd to spare, enough for some 6 thousand-odd duplicates that could be found
for (int ii = 0; ii < bytesLoaded; ii += 4)//O(chunkSize)
{
int phone = BytesToInt(numberData, ii);
if (exists[phone])
toRemove.push(ii);
else
exists[phone] = true;
}
for (int ii = toRemove.Length - 1; ii >= 0; --ii)
numberData.removeAt(toRemove[ii], 4);
File.Write(fileIndex, numberData);
fileIndex += bytesLoaded;
} while (bytesLoaded > 0); // while still stuff to load
}
If you can store temporary files you can load the file in chunks, sort each chunk, write it to a file, and then iterate through the chunks and look for duplicates. You can easily tell if a number is duplicated by comparing it to the next number in the file and the next number in each of the chunks. Then move to the next lowest number of all of the chunks and repeat until you run out of numbers.
Your runtime is O(n log n) due to the sorting.
I like the #airza solution, but perhaps there is another algorithm to consider: maybe one million phone numbers cannot be loaded into memory at once because they are expressed inefficiently, i.e. using more bytes per phone number than necessary. In that case, you might be able to have an efficient solution by hashing the phone numbers and storing the hashes in a (hash) table. Hash tables support dictionary operations (such as in) that let you find dupes easily.
To be more concrete about it, if each phone number is 13 bytes (such as a string in the format (NNN)NNN-NNNN), the string represents one of a billion numbers. As an integer, this can be stored in 4 bytes (instead of 13 in the string format). We then might be able to store this 4 byte "hash" in a hash table, because now our 1 billion hashed numbers take up as much space as 308 million numbers, not one billion. Ruling out impossible numbers (everything in area codes 000, 555, etc) might allow us reduce the hash size further.

How to design a data structure that allows one to search, insert and delete an integer X in O(1) time

Here is an exercise (3-15) in the book "Algorithm Design Manual".
Design a data structure that allows one to search, insert, and delete an integer X in O(1) time (i.e. , constant time, independent of the total number of integers stored). Assume that 1 ≤ X ≤ n and that there are m + n units of space available, where m is the maximum number of integers that can be in the table at any one time. (Hint: use two arrays A[1..n] and B[1..m].) You are not allowed to initialize either A or B, as that would take O(m) or O(n) operations. This means the arrays are full of random garbage to begin with, so you must be very careful.
I am not really seeking for the answer, because I don't even understand what this exercise asks.
From the first sentence:
Design a data structure that allows one to search, insert, and delete an integer X in O(1) time
I can easily design a data structure like that. For example:
Because 1 <= X <= n, so I just have an bit vector of n slots, and let X be the index of the array, when insert, e.g., 5, then a[5] = 1; when delete, e.g., 5, then a[5] = 0; when search, e.g.,5, then I can simply return a[5], right?
I know this exercise is harder than I imagine, but what's the key point of this question?
You are basically implementing a multiset with bounded size, both in number of elements (#elements <= m), and valid range for elements (1 <= elementValue <= n).
Search: myCollection.search(x) --> return True if x inside, else False
Insert: myCollection.insert(x) --> add exactly one x to collection
Delete: myCollection.delete(x) --> remove exactly one x from collection
Consider what happens if you try to store 5 twice, e.g.
myCollection.insert(5)
myCollection.insert(5)
That is why you cannot use a bit vector. But it says "units" of space, so the elaboration of your method would be to keep a tally of each element. For example you might have [_,_,_,_,1,_,...] then [_,_,_,_,2,_,...].
Why doesn't this work however? It seems to work just fine for example if you insert 5 then delete 5... but what happens if you do .search(5) on an uninitialized array? You are specifically told you cannot initialize it, so you have no way to tell if the value you'll find in that piece of memory e.g. 24753 actually means "there are 24753 instances of 5" or if it's garbage.
NOTE: You must allow yourself O(1) initialization space, or the problem cannot be solved. (Otherwise a .search() would not be able to distinguish the random garbage in your memory from actual data, because you could always come up with random garbage which looked like actual data.) For example you might consider having a boolean which means "I have begun using my memory" which you initialize to False, and set to True the moment you start writing to your m words of memory.
If you'd like a full solution, you can hover over the grey block to reveal the one I came up with. It's only a few lines of code, but the proofs are a bit longer:
SPOILER: FULL SOLUTION
Setup:
Use N words as a dispatch table: locationOfCounts[i] is an array of size N, with values in the range location=[0,M]. This is the location where the count of i would be stored, but we can only trust this value if we can prove it is not garbage. >!
(sidenote: This is equivalent to an array of pointers, but an array of pointers exposes you being able to look up garbage, so you'd have to code that implementation with pointer-range checks.)
To find out how many is there are in the collection, you can look up the value counts[loc] from above. We use M words as the counts themselves: counts is an array of size N, with two values per element. The first value is the number this represents, and the second value is the count of that number (in the range [1,m]). For example a value of (5,2) would mean that there are 2 instances of the number 5 stored in the collection.
(M words is enough space for all the counts. Proof: We know there can never be more than M elements, therefore the worst-case is we have M counts of value=1. QED)
(We also choose to only keep track of counts >= 1, otherwise we would not have enough memory.)
Use a number called numberOfCountsStored that IS initialized to 0 but is updated whenever the number of item types changes. For example, this number would be 0 for {}, 1 for {5:[1 times]}, 1 for {5:[2 times]}, and 2 for {5:[2 times],6:[4 times]}.
                          1  2  3  4  5  6  7  8...
locationOfCounts[<N]: [☠, ☠, ☠, ☠, ☠, 0, 1, ☠, ...]
counts[<M]:           [(5,⨯2), (6,⨯4), ☠, ☠, ☠, ☠, ☠, ☠, ☠, ☠..., ☠]
numberOfCountsStored:          2
Below we flush out the details of each operation and prove why it's correct:
Algorithm:
There are two main ideas: 1) we can never allow ourselves to read memory without verifying that is not garbage first, or if we do we must be able to prove that it was garbage, 2) we need to be able to prove in O(1) time that the piece of counter memory has been initialized, with only O(1) space. To go about this, the O(1) space we use is numberOfItemsStored. Each time we do an operation, we will go back to this number to prove that everything was correct (e.g. see ★ below). The representation invariant is that we will always store counts in counts going from left-to-right, so numberOfItemsStored will always be the maximum index of the array that is valid.
.search(e) -- Check locationsOfCounts[e]. We assume for now that the value is properly initialized and can be trusted. We proceed to check counts[loc], but first we check if counts[loc] has been initialized: it's initialized if 0<=loc<numberOfCountsStored (if not, the data is nonsensical so we return False). After checking that, we look up counts[loc] which gives us a number,count pair. If number!=e, we got here by following randomized garbage (nonsensical), so we return False (again as above)... but if indeed number==e, this proves that the count is correct (★proof: numberOfCountsStored is a witness that this particular counts[loc] is valid, and counts[loc].number is a witness that locationOfCounts[number] is valid, and thus our original lookup was not garbage.), so we would return True.
.insert(e) -- Perform the steps in .search(e). If it already exists, we only need to increment the count by 1. However if it doesn't exist, we must tack on a new entry to the right of the counts subarray. First we increment numberOfCountsStored to reflect the fact that this new count is valid: loc = numberOfCountsStored++. Then we tack on the new entry: counts[loc] = (e,⨯1). Finally we add a reference back to it in our dispatch table so we can look it up quickly locationOfCounts[e] = loc.
.delete(e) -- Perform the steps in .search(e). If it doesn't exist, throw an error. If the count is >= 2, all we need to do is decrement the count by 1. Otherwise the count is 1, and the trick here to ensure the whole numberOfCountsStored-counts[...] invariant (i.e. everything remains stored on the left part of counts) is to perform swaps. If deletion would get rid of the last element, we will have lost a counts pair, leaving a hole in our array: [countPair0, countPair1, _hole_, countPair2, countPair{numberOfItemsStored-1}, ☠, ☠, ☠..., ☠]. We swap this hole with the last countPair, decrement numberOfCountsStored to invalidate the hole, and update locationOfCounts[the_count_record_we_swapped.number] so it now points to the new location of the count record.
Here is an idea:
treat the array B[1..m] as a stack, and make a pointer p to point to the top of the stack (let p = 0 to indicate that no elements have been inserted into the data structure). Now, to insert an integer X, use the following procedure:
p++;
A[X] = p;
B[p] = X;
Searching should be pretty easy to see here (let X' be the integer you want to search for, then just check that 1 <= A[X'] <= p, and that B[A[X']] == X'). Deleting is trickier, but still constant time. The idea is to search for the element to confirm that it is there, then move something into its spot in B (a good choice is B[p]). Then update A to reflect the pointer value of the replacement element and pop off the top of the stack (e.g. set B[p] = -1 and decrement p).
It's easier to understand the question once you know the answer: an integer is in the set if A[X]<total_integers_stored && B[A[X]]==X.
The question is really asking if you can figure out how to create a data structure that is usable with a minimum of initialization.
I first saw the idea in Cameron's answer in Jon Bentley Programming Pearls.
The idea is pretty simple but it's not straightforward to see why the initial random values that may be on the uninitialized arrays does not matter. This link explains pretty well the insertion and search operations. Deletion is left as an exercise, but is answered by one of the commenters:
remove-member(i):
if not is-member(i): return
j = dense[n-1];
dense[sparse[i]] = j;
sparse[j] = sparse[i];
n = n - 1

Resources