Is there a pushable/poppable hash function for stack-like objects? - algorithm

I know of rolling hash functions that are similar to a hash on a bounded queue. Is there anything similar for stacks?
My use case is that I am doing a depth first search of possible program traces (with loop unrolling, so these stacks can get biiiiig) and I need to identify branching via these traces. Rather than store a bunch of stacks of depth 1000 I want to hash them so that I can index by int. However, if I have stacks of depth 10000+ this hash is going to be expensive, so I want to keep track of my last hash so that when I push/pop from my stack I can hash/unhash the new/old item respectively.
In particular, I am looking for a hash h(Object, Hash) with an unhash u(Object, Hash) with the property that for object x to be hashed we have:
u(x, h(x, baseHash)) = baseHash
Additionally, this hash shouldn't be commutative, since order matters.
One thought I had was matrix multiplication over GL(2, F(2^k)), maybe using a Cayley graph? For example, take two invertible matrices A_0, A_1, with inverses B_0 and B_1, in GL(2, F(2^k)), and compute the hash of an object x by first computing some integer hash with bits b31b30...b1b0, and then compute
H(x) = A_b31 . A_b30 . ... . A_b1 . A_b0
This has an inverse
U(x) = B_b0 . B_b1 . ... . B_b30 . B_31.
Thus the h(x, baseHash) = H(x) . baseHash and u(x, baseHash) = U(x) . baseHash, so that
u(x, h(x, base)) = U(x) . H(x) . base = base,
as desired.
This seems like it might be more expensive than is necessary, but for 2x2 matrices it shouldn't be too bad?

Most incremental hash functions can be made from two kinds of operations:
1) An invertible diffusion function that mixes up the previous hash. Invertible functions are chosen for this so that they don't loose information. Otherwise the hash would tend towards a few values; and
2) An invertible mixing function to mix new data into the hash. Invertible functions are used for this so that every part of the input has equivalent influence over the final hash value.
Since both these things are invertible, it's very easy to undo the last part of an incremental hash and "pop" off the previous value.
For instance, the most common kind of simple hash functions in use are polynomial hash functions. To update a previous hash value with a new input 'x', you calculate:
h' = h*A + x mod M
The multiplication is the diffusion function. In order for this to be invertible, A must have a multiplicative inverse mod M -- commonly either M is chosen to be prime, or M is a power of 2 and A is odd.
Because the multiplicative inverse exists, it's easy to pop off the last value from the hash, as long as you still have access to it:
h = (h' - x)*(1/A) mod M
You can use the extended Euclidean algorithm to find the inverse of A: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm
Most other common non-cryptographic hashes, like CRCs, FNV, murmurHash, etc. are similarly easy to pop values off.
Some of these hashes have a final diffusion step after the incremental work, but that step is pretty much always invertible as well, to ensure that the hash can take on any value, so you can undo it to get back to the incremental part.
Diffusion operations are often made from sequences of primitive invertible operations. To undo them you would undo each operation in reverse order. Some of the common types you'll see are:
cyclic shifts
invertible multiplication (as above)
x = x XOR (x >> shift)
Feistel rounds (see https://simple.wikipedia.org/wiki/Feistel_cipher)
mixing operations are usually + or XOR.

Related

Generating a perfect hash function given known list of strings?

Suppose I have a list of N strings, known at compile-time.
I want to generate (at compile-time) a function that will map each string to a distinct integer between 1 and N inclusive. The function should take very little time or space to execute.
For example, suppose my strings are:
{"apple", "orange", "banana"}
Such a function may return:
f("apple") -> 2
f("orange") -> 1
f("banana") -> 3
What's a strategy to generate this function?
I was thinking to analyze the strings at compile time and look for a couple of constants I could mod or add by or something?
The compile-time generation time/space can be quite expensive (but obviously not ridiculously so).
Say you have m distinct strings, and let ai, j be the jth character of the ith string. In the following, I'll assume that they all have the same length. This can be easily translated into any reasonable programming language by treating ai, j as the null character if j ≥ |ai|.
The idea I suggest is composed of two parts:
Find (at most) m - 1 positions differentiating the strings, and store these positions.
Create a perfect hash function by considering the strings as length-m vectors, and storing the parameters of the perfect hash function.
Obviously, in general, the hash function must check at least m - 1 positions. It's easy to see this by induction. For 2 strings, at least 1 character must be checked. Assume it's true for i strings: i - 1 positions must be checked. Create a new set of strings by appending 0 to the end of each of the i strings, and add a new string that is identical to one of the strings, except it has a 1 at the end.
Conversely, it's obvious that it's possible to find at most m - 1 positions sufficient for differentiating the strings (for some sets the number of course might be lower, as low as log to the base of the alphabet size of m). Again, it's easy to see so by induction. Two distinct strings must differ at some position. Placing the strings in a matrix with m rows, there must be some column where not all characters are the same. Partitioning the matrix into two or more parts, and applying the argument recursively to each part with more than 2 rows, shows this.
Say the m - 1 positions are p1, ..., pm - 1. In the following, recall the meaning above for ai, pj for pj ≥ |ai|: it is the null character.
let us define h(ai) = ∑j = 1m - 1[qj ai, pj % n], for random qj and some n. Then h is known to be a universal hash function: the probability of pair-collision P(x ≠ y ∧ h(x) = h(y)) ≤ 1/n.
Given a universal hash function, there are known constructions for creating a perfect hash function from it. Perhaps the simplest is creating a vector of size m2 and successively trying the above h with n = m2 with randomized coefficients, until there are no collisions. The number of attempts needed until this is achieved, is expected 2 and the probability that more attempts are needed, decreases exponentially.
It is simple. Make a dictionary and assign 1 to the first word, 2 to the second, ... No need to make things complicated, just number your words.
To make the lookup effective, use trie or binary search or whatever tool your language provides.

Hashing function to distribute over n values (with a twist)

I was wondering if there are any hashing functions to distribute input over n values. The distribution should of course be fairly uniform. But there is a twist. with small changes of n, few elements should get a new hash. Optimally it should split all k uniformly over n values and if n increases to n+1 only k/n-k/(n+1) values would have to move to uniformly distribute in the new hash. Obviously having a hash which simply creates uniform values and then mod it would work, but that would move a lot of hashes to fill the new node. The goal here is that as few values as possible falls into a new bucket.
Suppose 2^{n-1} < N <= 2^n. Then there is a standard trick for turning a hash function H that produces (at least) n bits into one that produces a number from 0 to N.
Compute H(v).
Keep just the first n bits.
If that's smaller than N, stop and output it. Otherwise, start from the top with H(v) instead of v.
Some properties of this technique:
You might worry that you have to repeat the loop many times in some cases. But actually the expected number of loops is at most 2.
If you bump up N and n doesn't have to change, very few things get a new hash: only those ones that had exactly N somewhere in their chain of hashes. (Of course, identifying which elements have this property is kind of hard -- in general it may require rehashing every element!)
If you bump up N and n does have to change, about half of the elements have to be rebucketed. But this happens more and more rarely the bigger N is -- it is an amortized O(1) cost on each bump.
Edit to add an additional comment about the "have to rehash everything" requirement: One might consider modifying step 3 above to "start from the top with the first n bits of H(v)" instead. This reduces the problem with identifying which elements need to be rehashed -- since they'll be in the bucket for the hash of N -- though I'm not confident the resulting hash will have quite as good collision avoidance properties. It certainly makes the process a bit more fragile -- one would want to prove something special about the choice of H (that the bottom few bits aren't "critical" to its collision avoidance properties somehow).
Here is a simple example implementation in Python, together with a short main that shows that most strings do not move when bumping normally, and about half of strings get moved when bumping across a 2^n boundary. Forgive me for any idiosyncracies of my code -- Python is a foreign language.
import math
def ilog2(m): return int(math.ceil(math.log(m,2)))
def hash_into(obj, N):
cur_hash = hash(obj)
mask = pow(2, ilog2(N)) - 1
while (cur_hash & mask) >= N:
# seems Python uses the identity for its hash on integers, which
# doesn't iterate well; let's use literally any other hash at all
cur_hash = hash(str(cur_hash))
return cur_hash & mask
def same_hash(obj, N, N2):
return hash_into(obj, N) == hash_into(obj, N2)
def bump_stat(objs, N):
return len([obj for obj in objs if same_hash(obj, N, N+1)])
alphabet = [chr(x) for x in range(ord('a'),ord('z')+1)]
ascending = alphabet + [c1 + c2 for c1 in alphabet for c2 in alphabet]
def main():
print len(ascending)
print bump_stat(ascending, 10)
print float(bump_stat(ascending, 16))/len(ascending)
# prints:
# 702
# 639
# 0.555555555556
Well, when you add a node, you will want it to fill up, so you will actually want k/(n+1) elements to move from their old nodes to the new one.
That is easily accomplished:
Just generate a hash value for each key as you normally would. Then, to assign key k to a node in [0,N):
Let H(k) be the hash of k.
int hash = H(k);
for (int n=N-1;n>0;--n) {
if ((mix(hash,n) % (i+1))==0) {
break;
}
}
//put it in node n
So, when you add node node 1, it steals half the items from node 0.
When you add node 2, it steals 1/3 of the items from the previous 2 nodes.
And so on...
EDIT: added the mix() function, to mix up the hash differently for every n -- otherwise you get non-uniformities when n is not prime.

Hashing sets of integers

I'm looking for a hash function over sets H(.) and a relation R(.,.) such that if A is included in B then R(H(A), H(B)). Of course, R(.,.) must be easy to verify (constant time), and H(A) should be computed in linear time.
One example of H and R is:
H(A) = OR over 1 << (h(x) % k), for x in A, k a fixed integer and h(x) a hash function over integers.
R(H(A), H(B)) = ((H(A) & H(B)) == H(A))
Are there any other good examples? (good is hard to define but intuitively if R(H(A), H(B)) then whp A is included in B).
After thinking about this, I ended up with the example you gave. I.e. each element in B sets a bit in the hash, and A is only contained in B if each bit which is set in H(A) is also set in H(B).
Maybe a Bloom filter is applicable in your case. It seems to use the same bit trick, but with multiple hash functions.

Lists Hash function

I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.
For exemple this is what I want:
f((1 2 3))=f((1 3 2))=f((2 1 3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).
Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.
I'm looking for a simple approach if there is any.
Sort the list and then:
list.each do |current_element|
hash = (37 * hash + current_element) % MAX_HASH_VALUE
end
You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.
Possibly you have N, k such that you could still make this work. Good luck.
If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).
So you are looking something provides these properties,
1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.
Knuth's Multiplicative Method
In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.
hash(i)=i*2654435761 mod 2^32
Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.
Robert Jenkins' 96 bit Mix Function
Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.
All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.
The following source is the mixing part of the hash function.
int mix(int a, int b, int c)
{
a=a-b; a=a-c; a=a^(c >>> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >>> 13);
a=a-b; a=a-c; a=a^(c >>> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >>> 5);
a=a-b; a=a-c; a=a^(c >>> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >>> 15);
return c;
}
You can read details from here
If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.
Hard to describe in words...
For example, if your maximum is 9 (that makes it easy to understand), you'd have :
f(2 3 9 8) = f(3 8 9 2) = 2389
If you maximum was 99, you'd have :
f(16 2 76 8) = (0)2081676
In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :
f(2 2 5) = 2*6^2 + 2*6 + 5 = 89
f(1 4 4) = 1*6^2 + 4*6 + 4 = 64
Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:
template <class T>
void hash_combine(size_t& seed, T const& v)
{
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).
However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.
Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...
I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.
[...] but it turned out that there are collisions
These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.
Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.
As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.
For example:
2 1 3
sorted becomes
1 2 3
Treating this as prime powers gives
2^1.3^2.5^3
which construct
2.9.125 = 2250
giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.
A naïve approach to solving your essential problem (comparing lists in an order-insensitive manner) is to convert all lists being compared to a set (set in Python or HashSet in Java). This is more effective than making a hash function since a perfect hash seems essential to your problem. For almost any other approach collisions are inevitable depending on input.

Map strings to numbers maintaining the lexicographic ordering

I'm looking for an algorithm or function that is able to map a string to a number in such way that the resulting values correspond the lexicographic ordering of strings. Example:
"book" -> 50000
"car" -> 60000
"card" -> 65000
"a longer string" -> 15000
"another long string" -> 15500
"awesome" -> 16000
As a function it should be something like: f(x) = y, so that for any x1 < x2 => f(x1) < f(x2), where x is an arbitrary string and y is a number.
If the input set of x is finite, then I could always do a sort and assign the proper values, but I'm looking for something generic for an unlimited input set for x.
If you require that f map to integers this is impossible.
Suppose that there is such a map f. Consider the strings a, aa, aaa, etc. Consider the values f(a), f(aa), f(aaa), etc. As we require that f(a) < f(aa) < f(aaa) < ... we see that f(a_n) tends to infinity as n tends to infinity; here I am using the obvious notation that a_n is the character a repeated n times. Now consider the string b. We require that f(a_n) < f(b) for all n. But f(b) is some finite integer and we just showed that f(a_n) goes to infinity. We have a contradiction. No such map is possible.
Maybe you could tell us what you need this for? This is fairly abstract and we might be able to suggest something more suitable. Further, don't necessarily worry about solving "it" generally. YAGNI and all that.
As a corollary to Jason's answer, if you can map your strings to rational numbers, such a mapping is very straightforward. If code(c) is the ASCII code of the character c and s[i] is theith character in the string s, just sum like follows:
result <- 0
scale <- 1
for i from 1 to length(s)
scale <- scale / 26
index <- (1 + code(s[i]) - code('a'))
result <- result + index / scale
end for
return result
This maps the empty string to 0, and every other string to a rational number between 0 and 1, maintaining lexicographical order. If you have arbitrary-precision decimal floating-point numbers, you can replace the division by powers of 26 with powers of 100 and still have exactly representable numbers; with arbitrary precision binary floating-point numbers, you can divide by powers of 32.
what you are asking for is a a temporary suspension of the pigeon hole principle (http://en.wikipedia.org/wiki/Pigeonhole_principle).
The strings are the pigeons, the numbers are the holes.
There are more pigeons than holes, so you can't put each pigeon in its own hole.
You would be much better off writing a comparator which you can supply to a sort function. The comparator takes two strings and returns -1, 0, or 1. Even if you could create such a map, you still have to sort on it. If you need both a "hash" and the order, then keep stuff in two data structures - one that preserves the order, and one that allows fast access.
Maybe a Radix Tree is what you're looking for?
A radix tree, Patricia trie/tree, or
crit bit tree is a specialized set
data structure based on the trie that
is used to store a set of strings. In
contrast with a regular trie, the
edges of a Patricia trie are labelled
with sequences of characters rather
than with single characters. These can
be strings of characters, bit strings
such as integers or IP addresses, or
generally arbitrary sequences of
objects in lexicographical order.
Sometimes the names radix tree and
crit bit tree are only applied to
trees storing integers and Patricia
trie is retained for more general
inputs, but the structure works the
same way in all cases.
LWN.net also has an article describing this data structures use in the Linux kernel.
I have post a question here https://stackoverflow.com/questions/22798824/what-lexicographic-order-means
As workaround you can append empty symbols with code zero to right side of the string, and use expansion from case II.
Without such expansion with extra empty symbols I' m actually don't know how to make such mapping....
But if you have a finite set of Symbols (V), then |V*| is eqiualent to |N| -- fact from Disrete Math.

Resources