What number of probes are needed to avoid collision in hashing? - data-structures

I have placed even values (i.e., 0, 2, 4, 6, .., 19996, 19998) in my hash table such that:
Value 0 is stored at home address 0, value 2 is stored at home address 2. Similarly, value 16,382 is stored at home address 16,382, but values 16,384 to 19,998 will have collisions.
Now to avoid collisions what number of probes needed to search for the target values from 0, 1, 2, 3, 4, till 19999 ?

Related

How to remap group indices of a sparse set to a compact set?

Assume we have a list of data, for example
data [0] = /* employee 0 name, age, DOB, position */
data [1] = /* employee 1 name, age, DOB, position */
...
data [n-1] = /* employee n-1 name, age, DOB, position */
We also have a list of groups/teams, which is a list of lists:
group [0] = {0, 1, 72}
group [1] = {38, 1, 40}
...
group [k] = {0, 70, 72, 90}
Groups can have any non zero number of indices. Indices can be repeated as many times as possible.
The input is guaranteed that every index from [1-n] is present in at least one group.
A random list of teams to be deleted is given to you, for example remove {1, 6,7,8} means to remove the groups on the group list at indices 1,6,7,8.
Assume you do remove the groups.
You now potentially have indices in the data that belong to no group.
You want to remove any such datum, but you also want to keep indices contiguous. So for example if the input is
data has 4 elements
group[0] = {0, 1, 2}
group[1] = {0, 2, 3}
group[2] = {2, 3}
And you are to remove group 0, then datum 1 is to be removed. meaning you must shift the indices of groups 0 and 2.
The new data would look like:
data has 3 elements
group[1] = {0, 1, 2}
group[2] = {1, 2}
I want to implement this in an efficient way.
My current solution is to delete all elements, iterate over the list checking for data without an assigned group, and creating a permutation map for each index.
Then copy all surviving groups using the permutation map.
This is, very, very, very slow for large data. Is there a way to do this without using O(n) additional memory? Or at the bare minimum a data structure with better cache performance than a map?

Encoding a number with a suffix or prefix in a compact way

Let's say I have 6 digits order ids:
000000
000001
000003
...
000020
...
999999
And assume each of these come from a different node in a distributed system and I would like to encode the node id into the order.
The easiest way to do this would be to simply reserve the first 2 digits for the node id like this:
010000 - second node
020001 - third node
010003 - second node again
150004 - 16th node
...
This works sort of fine, but since I know for sure I'm only expecting a small number of nodes (let's say 16) I'm losing lots of possible ids limiting myself to basically 10^4 instead of 10^6. Is there a smart way to encode the 15 unique nodes without limiting the possible numbers? Ideally, I would have 10^6 - 15 possibilities.
EDIT: I'm looking for a solution that won't equally distribute a range to each node id. I'm looking for a way to encode the node id in an already existing unique id, without losing (ideally) more that the number of nodes of possibilities.
EDIT2: The reason for which this has to be the string representation of a 6 digit number is because the API I'm working with requires this. There's no way around it, unfortunately.
I'm losing lots of possible ids limiting myself to basically 10^4 instead of 10^6.
We still have 10^4 * 16 ids in total.
Is there a smart way to encode the 15 unique nodes without limiting the possible numbers?
This problem is similar to the distributed hash table keyspace partitioning. The best known solution for the problem is to create lots of virtual nodes, divide the keyspace among those virtual nodes and then assign those virtual nodes to physical in a particular manner (round-robin, random, on demand etc).
The easiest way to implement keyspace partition is to make sure each node generates such an id, that:
vnode_id = order_id % total_number_of_vnodes
For example, if we have just 3 vnodes [0, 1, 2] then:
vnode 0 must generate ids: 0, 3, 6, 9...
vnode 1 must generate ids: 1, 4, 7, 10...
vnode 2 must generate ids: 2, 5, 7, 11...
If we have 7 vnodes [0, 1, 2, 3, 4, 5, 6] then:
vnode 0 must generate ids: 0, 7, 14, 21...
vnode 1 must generate ids: 1, 8, 15, 22...
vnode 2 must generate ids: 2, 9, 16, 23...
...
vnode 6 must generate ids: 6, 13, 20, 27...
Then all physical nodes must map to the virtual in the known and common way, for example 1:1 mapping:
physical node 0 takes vnode 0
physical node 1 takes vnode 1
physical node 2 takes vnode 2
on demand mapping:
physical node 0 takes vnode 0, 3, 7 (many orders)
physical node 1 takes vnode 1, 4 (less orders)
physical node 2 takes vnode 2 (no orders)
I hope you grasp the idea.
Ideally, I would have 10^6 - 15 possibilities.
Unfortunately, it is not possible. Consider this: we have a 10^6 of possible ids and 15 different nodes each generating an unique id.
Basically, this means that one way or another we are dividing our ids among nodes, i.e. each node gets in average 10^6 / 15, which is much less than desirable 10^6 - 15.
Using the method described above we still have 10^6 ids in total, but they will be partitioned among vnodes which in turn will be mapped to physical nodes. That is the best practical solution for your problem AFAIK.
I'm looking for a solution that won't equally distribute a range to each node id. I'm looking for a way to encode the node id in an already existing unique id, without losing (ideally) more that the number of nodes of possibilities.
Do not expect a miracle. There are might be lots of other tricks worth trying.
For example, if Server and all Clients know that the next order id must be 235, but say Client 5 generates order id 240 (235 + 5) and send it to Server.
Server expects order id 235, but receive order id 240. So now Server knows that this order comes form Client 5 (240 - 235).
Or we can try to use another field to store client id. For instance if you have a time filed (HH:MM.SS), we might use seconds to store Client id.
Just some examples, I guess you get the idea...
Let n be the number represented by the first 2 digits of the 6 digit input. Assuming you have 16 nodes we can do:
nodeId = n % 16
Also:
highDigit = n / 16
Where / represents integer division. For 16 nodes, highDigit = [0..6]
If m is the number represented by the last 4 digits of the input then we can recover the original order id by:
orderId = highDigit*10^5 + m
With this scheme, and 16 nodes, you can represent 6*10^5 + 10^4 order ids.
You could split up the 10^6 possible IDs into close-to-equal chunks where the beginning index of each chunk is equal to 10^6 divided by the number of chunks, round down, times the chunk index, and the chunk size is 10^6 divided by the number of chunks, round down. In your example there are sixteen chunks:
10^6 / 16 = 62,500
chunk1: [ 0, 62500)
chunk2: [ 62500, 125000)
chunk3: [125000, 187500)
chunk4: [187500, 250000)
chunk5: [250000, 312500)
chunk6: [312500, 375000)
chunk7: [375000, 437500)
chunk8: [437500, 500000)
chunk9: [500000, 562500)
chunk10: [562500, 625000)
chunk11: [625000, 687500)
chunk12: [687500, 750000)
chunk13: [750000, 812500)
chunk14: [812500, 875000)
chunk15: [875000, 937500)
chunk16: [937500, 1000000)
To compute a global ID from a local ID on node X, calculate 62500 * X + local ID. To determine the node and local ID from a node calculate node = global ID / 62500 round down and local ID = global ID mod 62500.
Doing this you get to use basically all of the available indices up to a rounding error. Division and modulus on integers should be relatively quick compared to I/O between nodes.
Since you've chosen to use digits (rather than bits, where we could compact this entire exercise into a 32-bit number), here's one way to encode node ids. Perhaps others can come up with some more ideas.
Extend the digit alphabet up to J. Imagine the bits of the node's ID are distributed over the six digits. For each set bit, map the decimal digit of the order ID to a letter:
0 -> A
1 -> B
2 -> C
...
9 -> J
For example:
{759243, 5} -> 759C4D
Now you can encode all 10^6 order IDs together with a 6-bit node ID.

Storing a specific set of numbers in ruby

How do you generate some sort of checksum of an array of 5 numbers that would distinguish a set of numbers from another?
For example:
[ 1, 2, 3, 4, 5] has the same checksum as [ 2, 3, 4, 5, 1]
I want to generate millions of 5 digit combinations and compare them against a predetermined set of numbers. I want to be able to checksum the ones I generate, and then compare them against a bank of numbers I've already generated.
Let me explain:
I create an array of numbers as an array
I generate 6 numbers in an array using Rand()
I compare the generated numbers to the array I created, exit if they match
If the numbers don't match, create a hash that I can compare future arrays to. The arrangement of the numbers inside of the array do not matter.
I thought about using an md5sum, but then if the elements change inside, then the md5 would be the same.
I could just store the arrays in memory, but I'm trying to minimize the amount of numbers I store in memory
[1,2,3,4,5].sort.hash #=> 1777030444607087813
[2,3,4,5,1].sort.hash #=> 1777030444607087813
should make the sets distinguishable.
Also should be a more memory-friendly solution because
5.size #=> 8
1777030444607087813.size #=> 8
The problem with hashes is that you always need to worry about collisions. Here is a way to make sure each value is unique, (and it's even O(N))
require 'prime'
pr = Prime.take(10)
[ 1, 2, 3, 4, 5].map{|x| pr[x]}.reduce(&:*)
=> 15015
[ 2, 3, 4, 5, 1].map{|x| pr[x]}.reduce(&:*)
=> 15015

Union of two sets given a certain ordering in O(n) time

[Note: I am hoping this problem can be solved in O(n) time. If not, I am looking for a proof that it cannot be solved in O(n) time. If I get the proof, I'll try to implement a new algorithm to reach to the union of these sorted sets in a different way.]
Consider the sets:
(1, 4, 0, 6, 3)
(0, 5, 2, 6, 3)
The resultant should be:
(1, 4, 0, 5, 2, 6, 3)
Please note that the problem of union of sorted sets is easy. These are also sorted sets but the ordering is defined by some other properties from which these indices have been resolved. But the ordering (whatever it is) is valid to both the sets, i.e. for any i, j ∈ Set X if i <= j, then in some other Set Y, for the same i, j, i <= j.
EDIT: I am sorry I have missed something very important that I have covered in one of the comments below — intersection of two sets is not a null set, i.e. the two sets have common elements.
Insert each item in the first set into a hash table.
Go through each item in the second set, looking up that value.
If not found, insert that item into the resulting set.
If found, insert all items from the first set between the last item we inserted up to this value.
At the end, insert all remaining items from the first set into the resulting set.
Running time
Expected O(n).
Side note
With the constraints given, the union is not necessarily unique.
For e.g. (1) (2), the resulting set can be either (1, 2) or (2, 1).
This answer will pick (2, 1).
Implementation note
Obviously looping through the first set to find the last inserted item is not going to result in an O(n) algorithm. Instead we must keep an iterator into the first set (not the hash table), and then we can simply continue from the last position that iterator had.
Here's some pseudo-code, assuming both sets are arrays (for simplicity):
for i = 0 to input1.length
hashTable.insert(input1[i])
i = 0 // this will be our 'iterator' into the first set
for j = 0 to input2.length
if hashTable.contains(input2[j])
do
output.append(input1[i])
i++
while input1[i] != input2[j]
else
output.append(input2[j])
while i < input.length
output.append(input1[i])
The do-while-loop inside the for-loop may look suspicious, but note that each iteration that that loop runs, we increase i, so it can run a total of input1.length times.
Example
Input:
(1, 4, 0, 6, 8, 3)
(0, 5, 2, 6, 3)
Hash table: (1, 4, 0, 6, 8, 3)
Then, go through the second set.
Look up 0, found, so insert 1, 4, 0 into the resulting set
(no item from first set inserted yet, so insert all items from the start until we get 0).
Look up 5, not found, so insert 5 into the resulting set.
Look up 2, not found, so insert 2 into the resulting set.
Look up 6, found, so insert 6 into the resulting set
(last item inserted from first set is 0, so only 6 needs to be inserted).
Look up 3, found, so insert 8, 3 into the resulting set
(last item inserted from first set is 6, so insert all items from after 6 until we get 3).
Output: (1, 4, 0, 5, 2, 6, 8, 3)
We have two ordered sets of indices A and B, which are ordered by some function f(). So we know that f(A[i]) < f(A[j]) iff i < j, and the same holds true for set B.
From here, we got a linear mapping to a "sorted" linear sets, thus reduced to the "problem of union of sorted sets".
This also doesn't have the best space complexity, but you can try:
a = [1,2,3,4,5]
b = [4,2,79,8]
union = {}
for each in a:
union[each]=1
for each in b:
union[each]=1
for each in union:
print each,' ',
Output:
>>> 1 2 3 4 5 8 79

Selecting a surviving population in a "voter" Genetic Algorithm

I've been working on a genetic algorithm where there is a population consisting of individuals with a color, and a preference. Preference and color are from a small number of finite states, probably around 4 or 5. (example: 1|1, 5|2, 3|3 etc)
Every individual casts a "vote" for their preference, which assists those individuals with that vote as their color.
My current idea is to cycle through every individual, and calculate the chance that they should survive, based on number of votes, etc. and then roll a die to see if they live.
I'm currently doing it so that if v[x] represents the percent of votes for color x, individual k with color c has v[c] chance of surviving. However, this means that if there are equal numbers of all 5 types of (a|a) individuals, 4/5 of them perish, and that's not good.
Does anyone have any idea of a method of randomness I could use to determine the chance an individual has to survive? For instance, an algorithm that for v votes for c, v individuals with color c survive (on statistical average).
Assign your fitness (likelyness of survival in your case) to each individual as is, then sort them on descending fitness and use binary tournament selection or something similar to sample another population of your chosen size.
Well, you can weight the probabilities according to the value returned by passing each
member of the population to the cost function.
That seems to me the most straightforward way, consistent with the genetic
meta-heuristic.
More common though, is to divide the current population into segments, based on
the value returned from passing them to the cost function.
So for instance,
if each generation consists of 100 members, then the top N (N is just a user-defined
parameter, often something like 5-10% of the total) members w/ the lowest cost
function result) are carried forward to the next generation just as they are (elitism).
Perhaps this is what you mean by 'survive.' If so, then again, these 'survivors'
are determined by ranking the members of the population according to the cost function
value and selecting those members above your defined elitism fraction constant.
The rest (the majority) of the next generation are created either by
mutation or cross-over.
mutation:
# one member of the current population:
[4, 5, 1, 7, 4, 2, 8, 9]
# small random change in one member of prior generation, to create mutant that is
# a member of the next generation
[4, 9, 1, 7, 4, 2, 8, 9]
crossover:
# two of the 'top' members of the current generation
[4, 5, 1, 7, 4, 2, 8, 9]
[2, 3, 6, 9, 2, 1, 6, 4]
# offpsring is a member of the next generation
[4, 5, 1, 7, 2, 1, 6, 4]

Resources