Finding a hash function that has specific properties - algorithm

My question relates a lot to this topic:
Hash function on list independant of order of items in it
Basically, I have a set of N numbers. N is fixed and is typically quite large, eg. 1000 for instance. These numbers can be integers or floating-point. They can be equal, some or all of them. No number can be zero.
Every combination of K numbers where K is anything between 1 and N leads to the calculation of a hash.
Let's take an example with 3 numbers, that I will call A, B and C. I need to calculate a hash for the following combinations:
A
B
C
A+B
B+C
A+B+C
A+C
Things are order-independent, C+A is just equal to A+C. '+' can be a real addition or something different, like a XOR, but it is fixed. Likewise, every value may go through a function first, eg.
f(A)
f(B)
f(A)+f(B)+f(C)
...
Now, I need to avoid collisions, but in a specific way only.
Each combination is tagged with a number, either 0 or 1.
Collisions may occur such that, if possible, only those tagged with the same number (0 or 1) may collide. In this case many collisions are even welcome indeed, especially if this makes the hash value compact. I mean, ideally, the best hash is only 1 bit long ! (0 or 1).
Collisions between combinations tagged with different numbers (0 and 1) should only rarely happen if possible - this is the whole point.
Let's take an example. Combination -> tag -> calculated hash value:
Combination Tag Hash
A -> 0 -> 0
B -> 1 -> 1
C -> 0 -> 2
A+B -> 0 -> 0
B+C -> 1 -> 1
A+B+C -> 1 -> 3
A+C -> 0 -> 2
Here, the hash values are valid because there is no collision between combinations of different tags. A collides with A+B for instance, but they're both tagged '0'.
However, the hash is not very good overall, because I need 4 bits, which seems a lot for only 4 input numbers.
How can find a good (good enough) hash function for this purpose?
Thank you for your insight.

Related

Quick way to compute n-th sequence of bits of size b with k bits set?

I want to develop a way to be able to represent all combinations of b bits with k bits set (equal to 1). It needs to be a way that given an index, can get quickly the binary sequence related, and the other way around too. For instance, the tradicional approach which I thought would be to generate the numbers in order, like:
For b=4 and k=2:
0- 0011
1- 0101
2- 0110
3- 1001
4-1010
5-1100
If I am given the sequence '1010', I want to be able to quickly generate the number 4 as a response, and if I give the number 4, I want to be able to quickly generate the sequence '1010'. However I can't figure out a way to do these things without having to generate all the sequences that come before (or after).
It is not necessary to generate the sequences in that order, you could do 0-1001, 1-0110, 2-0011 and so on, but there has to be no repetition between 0 and the (combination of b choose k) - 1 and all sequences have to be represented.
How would you approach this? Is there a better algorithm than the one I'm using?
pkpnd's suggestion is on the right track, essentially process one digit at a time and if it's a 1, count the number of options that exist below it via standard combinatorics.
nCr() can be replaced by a table precomputation requiring O(n^2) storage/time. There may be another property you can exploit to reduce the number of nCr's you need to store by leveraging the absorption property along with the standard recursive formula.
Even with 1000's of bits, that table shouldn't be intractably large. Storing the answer also shouldn't be too bad, as 2^1000 is ~300 digits. If you meant hundreds of thousands, then that would be a different question. :)
import math
def nCr(n,r):
return math.factorial(n) // math.factorial(r) // math.factorial(n-r)
def get_index(value):
b = len(value)
k = sum(c == '1' for c in value)
count = 0
for digit in value:
b -= 1
if digit == '1':
if b >= k:
count += nCr(b, k)
k -= 1
return count
print(get_index('0011')) # 0
print(get_index('0101')) # 1
print(get_index('0110')) # 2
print(get_index('1001')) # 3
print(get_index('1010')) # 4
print(get_index('1100')) # 5
Nice question, btw.

Comparing numbers using a modular hash function?

I want to compare b base-b numbers of b digits each to determine which ones are the same, using a hash table. If I use a modular hash function, should I use h(a) = a mod (b) or h(a) = a mod (b-1)? I am not sure how to determine if these are suitable or not.
So you have b numbers in the range 0 ... b^b - 1 (e.g. 10 numbers in the range 0 ... 9999999999).
If you want to guarantee that the hash function is collision-free, you cannot use mod. If you use e.g. a mod 10, then 31 and 56465421 both get a hash of 1 and collide, and this happens for every mod below 10000000000.
So you can only reduce the probability of hash collisions. And the smallest mod value with a chance to avoid collisions is b (but most probably, you'll run into collisions then). Without doing proper probability computations, I'd go for something like mod b*b, effectively taking the two trailing digits.

Generating a perfect hash function given known list of strings?

Suppose I have a list of N strings, known at compile-time.
I want to generate (at compile-time) a function that will map each string to a distinct integer between 1 and N inclusive. The function should take very little time or space to execute.
For example, suppose my strings are:
{"apple", "orange", "banana"}
Such a function may return:
f("apple") -> 2
f("orange") -> 1
f("banana") -> 3
What's a strategy to generate this function?
I was thinking to analyze the strings at compile time and look for a couple of constants I could mod or add by or something?
The compile-time generation time/space can be quite expensive (but obviously not ridiculously so).
Say you have m distinct strings, and let ai, j be the jth character of the ith string. In the following, I'll assume that they all have the same length. This can be easily translated into any reasonable programming language by treating ai, j as the null character if j ≥ |ai|.
The idea I suggest is composed of two parts:
Find (at most) m - 1 positions differentiating the strings, and store these positions.
Create a perfect hash function by considering the strings as length-m vectors, and storing the parameters of the perfect hash function.
Obviously, in general, the hash function must check at least m - 1 positions. It's easy to see this by induction. For 2 strings, at least 1 character must be checked. Assume it's true for i strings: i - 1 positions must be checked. Create a new set of strings by appending 0 to the end of each of the i strings, and add a new string that is identical to one of the strings, except it has a 1 at the end.
Conversely, it's obvious that it's possible to find at most m - 1 positions sufficient for differentiating the strings (for some sets the number of course might be lower, as low as log to the base of the alphabet size of m). Again, it's easy to see so by induction. Two distinct strings must differ at some position. Placing the strings in a matrix with m rows, there must be some column where not all characters are the same. Partitioning the matrix into two or more parts, and applying the argument recursively to each part with more than 2 rows, shows this.
Say the m - 1 positions are p1, ..., pm - 1. In the following, recall the meaning above for ai, pj for pj ≥ |ai|: it is the null character.
let us define h(ai) = ∑j = 1m - 1[qj ai, pj % n], for random qj and some n. Then h is known to be a universal hash function: the probability of pair-collision P(x ≠ y ∧ h(x) = h(y)) ≤ 1/n.
Given a universal hash function, there are known constructions for creating a perfect hash function from it. Perhaps the simplest is creating a vector of size m2 and successively trying the above h with n = m2 with randomized coefficients, until there are no collisions. The number of attempts needed until this is achieved, is expected 2 and the probability that more attempts are needed, decreases exponentially.
It is simple. Make a dictionary and assign 1 to the first word, 2 to the second, ... No need to make things complicated, just number your words.
To make the lookup effective, use trie or binary search or whatever tool your language provides.

Cartesian product in J

I'm trying to reproduce APL code for the game of life function in J. A YouTube video explaining this code can be found after searching "Game of Life in APL". Currently I have a matrix R which is:
0 0 0 0 0 0 0
0 0 0 1 1 0 0
0 0 1 1 0 0 0
0 0 0 1 0 0 0
0 0 0 0 0 0 0
I wrote J code which produces the adjacency list (number of living cells in adjacent squares) which is the following:
+/ ((#:i.4),-(#:1+i.2),(1 _1),.(_1 1)) |. R
And produces:
0 0 1 2 2 1 0
0 1 3 4 3 1 0
0 1 4 5 3 0 0
0 1 3 2 1 0 0
0 0 1 1 0 0 0
My main issue with this code is that it isn't elegant, as ((#:i.4),-(#:1+i.2),(1 _1),.(_1 1)) is needed just to produce:
0 0
0 1
1 0
1 1
0 _1
_1 0
_1 1
1 _1
Which is really just the outer product or Cartesian product between vectors 1 0 _1 and itself. I could not find an easy way to produce this Cartesian product, so my end question is how would I produce the required vector more elegantly?
A Complete Catalog
#Michael Berry's answer is very clear and concise. A sterling example of the J idiom table (f"0/~). I love it because it demonstrates how the subtle design of J has permitted us to generalize and extend a concept familiar to anyone from 3rd grade: arithmetic tables (addition tables, +/~ i. 10 and multiplication tables */~ i.12¹), which even in APL were relatively clunky.
In addition to that fine answer, it's also worth noting that there is a primitive built into J to calculate the Cartesian product, the monad {.
For example:
> { 2 # <1 0 _1 NB. Or i:_1 instead of 1 0 _1
1 1
1 0
1 _1
0 1
0 0
0 _1
_1 1
_1 0
_1 _1
Taking Inventory
Note that the input to monad { is a list of boxes, and the number of boxes in that list determines the number of elements in each combination. A list of two boxes produces an array of 2-tuples, a list of 3 boxes produces an array of 3-tuples, and so on.
A Tad Excessive
Given that full outer products (Cartesian products) are so expensive (O(n^m)), it occurs to one to ask: why does J have a primitive for this?
A similar misgiving arises when we inspect monad {'s output: why is it boxed? Boxes are used in J when, and only when, we want to consolidate arrays of incompatible types and shapes. But all the results of { y will have identical types and shapes, by the very definition of {.
So, what gives? Well, it turns out these two issues are related, and justified, once we understand why the monad { was introduced in the first place.
I'm Feeling Ambivalent About This
We must recall that all verbs in J are ambivalent perforce. J's grammar does not admit a verb which is only a monad, or only a dyad. Certainly, one valence or another might have an empty domain (i.e. no valid inputs, like monad E. or dyad ~. or either valence of [:), but it still exists.
An valence with an empty domain is "real", but its range of valid inputs is empty (an extension of the idea that the range of valid inputs to e.g. + is numbers, and anything else, like characters, produces a "domain error").
Ok, fine, so all verbs have two valences, so what?
A Selected History
Well, one of the primary design goals Ken Iverson had for J, after long experience with APL, was ditching the bracket notation for array indexing (e.g. A[3 4;5 6;2]), and recognizing that selection from an array is a function.
This was an enormous insight, with a serious impact on both the design and use of the language, which unfortunately I don't have space to get into here.
And since all functions need a name, we had to give one to the selection function. All primitive verbs in J are spelled with either a glyph, an inflected glyph (in my head, the ., :, .: etc suffixes are diacritics), or an inflected alphanumeric.
Now, because selection is so common and fundamental to array-oriented programming, it was given some prime real estate (a mark of distinction in J's orthography), a single-character glyph: {².
So, since { was defined to be selection, and selection is of course dyadic (i.e having two arguments: the indices and the array), that accounts for the dyad {. And now we see why it's important to note that all verbs are ambivalent.
I Think I'm Picking Up On A Theme
When designing the language, it would be nice to give the monad { some thematic relationship to "selection"; having the two valences of a verb be thematically linked is a common pattern in J, for elegance and mnemonic purposes.
That broad pattern is also a topic worthy of a separate discussion, but for now let's focus on why catalog / Cartesian product was chosen for monad {. What's the connection? And what accounts for the other quirk, that its results are always boxed?
Bracketectomy
Well, remember that { was introduced to replace -- replace completely -- the old bracketing subscript syntax of APL and (and many other programming languages and notations). This at once made selection easier, more useful, and also simplified J's syntax: in APL, the grammar, and consequently parser, had to have special rules for indexing like:
A[3 4;5 6;2]
The syntax was an anomaly. But boy, wasn't it useful and expressive from the programmer's perspective, huh?
But why is that? What accounts for the multi-dimensional bracketing notation's economy? How is it that we can say so much in such little space?
Well, let's look at what we're saying. In the expression above A[3 4;5 6;2], we're asking for the 3rd and 4th rows, the 5th and 6th columns, and the 2nd plane.
That is, we want
plane 2, row 3, column 5, and
plane 2, row 3, column 6, and
plane 2, row 4, column 5 and
plane 2, row 4, column 6
Think about that a second. I'll wait.
The Moment Ken Blew Your Mind
Boom, right?
Indexing is a Cartesian product.
Always has been. But Ken saw it.
So, now, instead of saying A[3 4;5 6;2] in APL (with some hand-waving about whether []IO is 1 or 0), in J we say:
(3 4;5 6;2) { A
which is, of course, just shorthand, or syntactic sugar, for:
idx =. { 3 4;5 6;2 NB. Monad {
idx { A NB. Dyad {
So we retained the familiar, convenient, and suggestive semicolon syntax (what do you want to bet link being spelled ; is also not a coincidence?) while getting all the benefits of turning { into a first-class function, as it always should have been³.
Opening The Mystery Box
Which brings us back to that other, final, quibble. Why the heck are monad {'s results boxed, if they're all regular in type and shape? Isn't that superfluous and inconvenient?
Well, yes, but remember that an unboxed, i.e. numeric, LHA in x { y only selects items from y.
This is convenient because it's a frequent need to select the same item multiple times (e.g. in replacing 'abc' with 'ABC' and defaulting any non-abc character to '?', we'd typically say ('abc' i. y) { 'ABC','?', but that only works because we're allowed to select index 4, which is '?', multiple times).
But that convenience precludes using straight numeric arrays to also do multidimensional indexing. That is, the convenience of unboxed numbers to select items (most common use case) interferes with also using unboxed numeric arrays to express, e.g. A[17;3;8] by 17 3 8 { A. We can't have it both ways.
So we needed some other notation to express multi-dimensional selections, and since dyad { has left-rank 0 (precisely because of the foregoing), and a single, atomic box can encapsulate an arbitrary structure, boxes were the perfect candidate.
So, to express A[17;3;8], instead of 17 3 8 { A, we simply say (< 17;3;8) { A, which again is straighforward, convenient, and familiar, and allows us to do any number of multi-dimensional selections simultaneously e.g. ( (< 17;3;8) , (<42; 7; 2) { A), which is what you'd want and expect in an array-oriented language.
Which means, of course, that in order to produce the kinds of outputs that dyad { expects as inputs, monad { must produce boxes⁴. QED.
Oh, and PS: since, as I said, boxing permits arbitrary structure in a single atom, what happens if we don't box a box, or even a list of a boxes, but box a boxed box? Well, have you ever wanted a way to say "I want every index except the last" or 3rd, or 42nd and 55th? Well...
Footnotes:
¹ Note that in the arithmetic tables +/~ i.10 and */~ i.12, we can elide the explicit "0 (present in ,"0/~ _1 0 1) because arithmetic verbs are already scalar (obviously)
² But why was selection given that specific glyph, {?
Well, Ken intentionally never disclosed the specific mnemonic choices used in J's orthography, because he didn't want to dictate such a personal choice for his users, but to me, Dan, { looks like a little funnel pointing right-to-left. That is, a big stream of data on the right, and a much smaller stream coming out the left, like a faucet dripping.
Similarly, I've always seen dyad |: like a little coffee table or Stonehenge trilithon kicked over on its side, i.e. transposed.
And monad # is clearly mnemonic (count, tally, number of items), but the dyad was always suggestive to me because it looked like a little net, keeping the items of interest and letting everything else "fall through".
But, of course, YMMV. Which is precisely why Ken never spelled this out for us.
³ Did you also notice that while in APL the indices, which are control data, are listed to the right of the array, whereas in J they're now on the left, where control data belong?
⁴ Though this Jer would still like to see monad { produce unboxed results, at the cost of some additional complexity within the J engine, i.e. at the expense of the single implementer, and to the benefit of every single user of the language
n There is a lot of interesting literature which goes into this material in more detail, but unfortunately I do not have time now to dig it up. If there's enough interest, I may come back and edit the answer with references later. For now, it's worth reading Mastering J, an early paper on J by one of the luminaries of J, Donald McIntyre, which makes mention of the eschewing of the "anomalous bracket notation" of APL, and perhaps a tl;dr version of this answer I personally posted to the J forums in 2014.
,"0/ ~ 1 0 _1
will get you the Cartesian product you ask for (but you may want to reshape it to 9 by 2).
The cartesian product is the monadic verb catalog: {
{ ;~(1 0 _1)
┌────┬────┬─────┐
│1 1 │1 0 │1 _1 │
├────┼────┼─────┤
│0 1 │0 0 │0 _1 │
├────┼────┼─────┤
│_1 1│_1 0│_1 _1│
└────┴────┴─────┘
Ravel (,) and unbox (>) for a 9,2 list:
>,{ ;~(1 0 _1)
1 1
1 0
1 _1
0 1
0 0
0 _1
_1 1
_1 0
_1 _1

How to implement Random(a,b) with only Random(0,1)? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
how to get uniformed random between a, b by a known uniformed random function RANDOM(0,1)
In the book of Introduction to algorithms, there is an excise:
Describe an implementation of the procedure Random(a, b) that only makes calls to Random(0,1). What is the expected running time of your procedure, as a function of a and b? The probability of the result of Random(a,b) should be pure uniformly distributed, as Random(0,1)
For the Random function, the results are integers between a and b, inclusively. For e.g., Random(0,1) generates either 0 or 1; Random(a, b) generates a, a+1, a+2, ..., b
My solution is like this:
for i = 1 to b-a
r = a + Random(0,1)
return r
the running time is T=b-a
Is this correct? Are the results of my solutions uniformly distributed?
Thanks
What if my new solution is like this:
r = a
for i = 1 to b - a //including b-a
r += Random(0,1)
return r
If it is not correct, why r += Random(0,1) makes r not uniformly distributed?
Others have explained why your solution doesn't work. Here's the correct solution:
1) Find the smallest number, p, such that 2^p > b-a.
2) Perform the following algorithm:
r=0
for i = 1 to p
r = 2*r + Random(0,1)
3) If r is greater than b-a, go to step 2.
4) Your result is r+a
So let's try Random(1,3).
So b-a is 2.
2^1 = 2, so p will have to be 2 so that 2^p is greater than 2.
So we'll loop two times. Let's try all possible outputs:
00 -> r=0, 0 is not > 2, so we output 0+1 or 1.
01 -> r=1, 1 is not > 2, so we output 1+1 or 2.
10 -> r=2, 2 is not > 2, so we output 2+1 or 3.
11 -> r=3, 3 is > 2, so we repeat.
So 1/4 of the time, we output 1. 1/4 of the time we output 2. 1/4 of the time we output 3. And 1/4 of the time we have to repeat the algorithm a second time. Looks good.
Note that if you have to do this a lot, two optimizations are handy:
1) If you use the same range a lot, have a class that computes p once so you don't have to compute it each time.
2) Many CPUs have fast ways to perform step 1 that aren't exposed in high-level languages. For example, x86 CPUs have the BSR instruction.
No, it's not correct, that method will concentrate around (a+b)/2. It's a binomial distribution.
Are you sure that Random(0,1) produces integers? it would make more sense if it produced floating point values between 0 and 1. Then the solution would be an affine transformation, running time independent of a and b.
An idea I just had, in case it's about integer values: use bisection. At each step, you have a range low-high. If Random(0,1) returns 0, the next range is low-(low+high)/2, else (low+high)/2-high.
Details and complexity left to you, since it's homework.
That should create (approximately) a uniform distribution.
Edit: approximately is the important word there. Uniform if b-a+1 is a power of 2, not too far off if it's close, but not good enough generally. Ah, well it was a spontaneous idea, can't get them all right.
No, your solution isn't correct. This sum'll have binomial distribution.
However, you can generate a pure random sequence of 0, 1 and treat it as a binary number.
repeat
result = a
steps = ceiling(log(b - a))
for i = 0 to steps
result += (2 ^ i) * Random(0, 1)
until result <= b
KennyTM: my bad.
I read the other answers. For fun, here is another way to find the random number:
Allocate an array with b-a elements.
Set all the values to 1.
Iterate through the array. For each nonzero element, flip the coin, as it were. If it is came up 0, set the element to 0.
Whenever, after a complete iteration, you only have 1 element remaining, you have your random number: a+i where i is the index of the nonzero element (assuming we start indexing on 0). All numbers are then equally likely. (You would have to deal with the case where it's a tie, but I leave that as an exercise for you.)
This would have O(infinity) ... :)
On average, though, half the numbers would be eliminated, so it would have an average case running time of log_2 (b-a).
First of all I assume you are actually accumulating the result, not adding 0 or 1 to a on each step.
Using some probabilites you can prove that your solution is not uniformly distibuted. The chance that the resulting value r is (a+b)/2 is greatest. For instance if a is 0 and b is 7, the chance that you get a value 4 is (combination 4 of 7) divided by 2 raised to the power 7. The reason for that is that no matter which 4 out of the 7 values are 1 the result will still be 4.
The running time you estimate is correct.
Your solution's pseudocode should look like:
r=a
for i = 0 to b-a
r+=Random(0,1)
return r
As for uniform distribution, assuming that the random implementation this random number generator is based on is perfectly uniform the odds of getting 0 or 1 are 50%. Therefore getting the number you want is the result of that choice made over and over again.
So for a=1, b=5, there are 5 choices made.
The odds of getting 1 involves 5 decisions, all 0, the odds of that are 0.5^5 = 3.125%
The odds of getting 5 involves 5 decisions, all 1, the odds of that are 0.5^5 = 3.125%
As you can see from this, the distribution is not uniform -- the odds of any number should be 20%.
In the algorithm you created, it is really not equally distributed.
The result "r" will always be either "a" or "a+1". It will never go beyond that.
It should look something like this:
r=0;
for i=0 to b-a
r = a + r + Random(0,1)
return r;
By including "r" into your computation, you are including the "randomness" of all the previous "for" loop runs.

Resources