Hashing sets of integers

Hashing sets of integers - algorithm

I'm looking for a hash function over sets H(.) and a relation R(.,.) such that if A is included in B then R(H(A), H(B)). Of course, R(.,.) must be easy to verify (constant time), and H(A) should be computed in linear time.
One example of H and R is:
H(A) = OR over 1 << (h(x) % k), for x in A, k a fixed integer and h(x) a hash function over integers.
R(H(A), H(B)) = ((H(A) & H(B)) == H(A))
Are there any other good examples? (good is hard to define but intuitively if R(H(A), H(B)) then whp A is included in B).

After thinking about this, I ended up with the example you gave. I.e. each element in B sets a bit in the hash, and A is only contained in B if each bit which is set in H(A) is also set in H(B).
Maybe a Bloom filter is applicable in your case. It seems to use the same bit trick, but with multiple hash functions.

Related

Is there a pushable/poppable hash function for stack-like objects?

I know of rolling hash functions that are similar to a hash on a bounded queue. Is there anything similar for stacks?
My use case is that I am doing a depth first search of possible program traces (with loop unrolling, so these stacks can get biiiiig) and I need to identify branching via these traces. Rather than store a bunch of stacks of depth 1000 I want to hash them so that I can index by int. However, if I have stacks of depth 10000+ this hash is going to be expensive, so I want to keep track of my last hash so that when I push/pop from my stack I can hash/unhash the new/old item respectively.
In particular, I am looking for a hash h(Object, Hash) with an unhash u(Object, Hash) with the property that for object x to be hashed we have:
u(x, h(x, baseHash)) = baseHash
Additionally, this hash shouldn't be commutative, since order matters.
One thought I had was matrix multiplication over GL(2, F(2^k)), maybe using a Cayley graph? For example, take two invertible matrices A_0, A_1, with inverses B_0 and B_1, in GL(2, F(2^k)), and compute the hash of an object x by first computing some integer hash with bits b31b30...b1b0, and then compute
H(x) = A_b31 . A_b30 . ... . A_b1 . A_b0
This has an inverse
U(x) = B_b0 . B_b1 . ... . B_b30 . B_31.
Thus the h(x, baseHash) = H(x) . baseHash and u(x, baseHash) = U(x) . baseHash, so that
u(x, h(x, base)) = U(x) . H(x) . base = base,
as desired.
This seems like it might be more expensive than is necessary, but for 2x2 matrices it shouldn't be too bad?

Most incremental hash functions can be made from two kinds of operations:
1) An invertible diffusion function that mixes up the previous hash. Invertible functions are chosen for this so that they don't loose information. Otherwise the hash would tend towards a few values; and
2) An invertible mixing function to mix new data into the hash. Invertible functions are used for this so that every part of the input has equivalent influence over the final hash value.
Since both these things are invertible, it's very easy to undo the last part of an incremental hash and "pop" off the previous value.
For instance, the most common kind of simple hash functions in use are polynomial hash functions. To update a previous hash value with a new input 'x', you calculate:
h' = h*A + x mod M
The multiplication is the diffusion function. In order for this to be invertible, A must have a multiplicative inverse mod M -- commonly either M is chosen to be prime, or M is a power of 2 and A is odd.
Because the multiplicative inverse exists, it's easy to pop off the last value from the hash, as long as you still have access to it:
h = (h' - x)*(1/A) mod M
You can use the extended Euclidean algorithm to find the inverse of A: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm
Most other common non-cryptographic hashes, like CRCs, FNV, murmurHash, etc. are similarly easy to pop values off.
Some of these hashes have a final diffusion step after the incremental work, but that step is pretty much always invertible as well, to ensure that the hash can take on any value, so you can undo it to get back to the incremental part.
Diffusion operations are often made from sequences of primitive invertible operations. To undo them you would undo each operation in reverse order. Some of the common types you'll see are:
cyclic shifts
invertible multiplication (as above)
x = x XOR (x >> shift)
Feistel rounds (see https://simple.wikipedia.org/wiki/Feistel_cipher)
mixing operations are usually + or XOR.

Check if a value belongs to a hash

I'm not sure if this is actually possible thus I ask here. Does anyone knows of an algorithm that would allow something like this?
const values = ['a', 'b', 'c', 'd'];
const hash = createHash(values); // => xjaks14sdffdghj23h4kjhgd9f81nkjrsdfg9aiojd
hash.includes('b'); // => true
hash.includes('v'); // => false
What this snippet does, is it first creates some sort of hash from a list of values, then checks if the certain value belongs to that hash.

Hash functions in general
The primary idea of hash functions is to reduce the space, that is the functions are not injective as they map from a bigger domain to a smaller.
So they produce collisions. That is, there are different elements x and y that get mapped to the same hash value:
h(x) = h(y)
So basically you loose information of the given argument x.
However, in order to answer the question whether all values are contained you would need to keep all information (or at least all non-duplicates). This is obviously not possible for nearly all practical hash-functions.
Possible hash-functions would be identity function:
h(x) = x for all x
but this doesn't reduce the space, not practical.
A natural idea would be to compute hash values of the individual elements and then concatenate them, like
h(a, b, c) = (h(a), h(b), h(c))
But this again doesn't reduce the space, hash values are as long as the message, not practical.
Another possibility is to drop all duplicates, so given values [a, b, c, a, b] we only keep [a, b, c]. But this, in most examples, only reduces the space marginally, again not practical.
But no matter what you do, you can not reduce more than the amount of non-duplicates. Else you wouldn't be able to answer the question for some values. For example if we use [a, b, c, a] but only keep [a, b], we are unable to answer "was c contained" correctly.
Perfect hash functions
However, there is the field of perfect hash functions (Wikipedia). Those are hash-functions that are injective, they don't produce collisions.
In some areas they are of interest.
For those you may be able to answer that question, for example if computing the inverse is easy.
Cryptographic hash functions
If you talk about cryptographic hash functions, the answer is no.
Those need to have three properties (Wikipedia):
Pre-image resistance - Given h it should be difficult to find m : hash(m) = h
Second pre-image resistance - Given m it should be difficult to find m' : hash(m) = hash(m')
Collision resistance - It should be difficult to find (m, m') : hash(m) = hash(m')
Informally you have especially:
A small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value.
If you now would have such a hash value you would be able to easily reconstruct it by asking whether some values are contained. Using that you can easily construct collisions on purpose and stuff like that.
Details would however depend on the specific hash algorithm.
For a toy-example let's use the previous algorithm that simply removes all duplicates:
[a, b, c, a, a] -> [a, b, c]
In that case we find messages like
[a, b, c]
[a, b, c, a]
[a, a, b, b, c]
...
that all map to the same hash value.

If the hash function produces collisions (as almost all hash function do) this cannot be possible.
Think about it this way if for example h('abc') = x and h('abd') = x, how can you decide based on x if the original string contains 'd'?
You could arguably decide to use identity as a has function, which would do the job.

Trivial solution will be a simple hash concatenation.
func createHash(values) {
var hash;
foreach (v in values)
hash += MD5(v);
return hash;
}
Can it be done with fixed length hash and variable input? I'd bet it's impossible.
In case of string hash (such as used in HashMaps), because it is additive, I think we can match partially (prefix match but not suffix).
const values = ['a', 'b', 'c', 'd'];
const hash = createStringHash(values); // => xjaks14sdffdghj23h4kjhgd9f81nkjrsdfg9aiojd
hash.includes('a'); // => true
hash.includes('a', 'b'); // => true
hash.includes('a', 'b', 'v'); // => false

Bit arrays
If you don't care what the resulting hash looks like, I'd recommend just using a bit array.
Take the range of all possible values
Map this to the range of integers starting from 0
Let each bit in our hash indicate whether or not this value appears in the input
This will require 1 bit for every possible value (which could be a lot of bits for large ranges).
Note: this representation is optimal in terms of the number of bits used, assuming there's no limit on the number of elements you can have (beyond 1 of each possible value) - if it were possible to use any fewer bits, you'd have an algorithm that's capable of providing guaranteed compression of any data, which is impossible by the pigeonhole principle.
For example:
If your range is a-z, you can map this to 0-25, then [a,d,g,h] would map to:
10010011000000000000000000 = 38535168 = 0x24c0000
(abcdefghijklmnopqrstuvwxyz)
More random-looking hashes
If you care what the hash looks like, you could take the output from the above and perform a perfect hash on it to map it either to the same length hash or a longer hash.
One trivial example of such a map would be to increment the resulting hash by a randomly chosen but deterministic value (i.e. it's the same for every hash we convert) - you can also do this for each byte (with wrap-around) if you want (e.g. byte0 = (byte0+5)%255, byte1 = (byte1+18)%255).
To determine whether an element appears, the simplest approach would be to reverse the above operation (subtract instead of add) and then just check if the corresponding bit is set. Depending on what you did, it might also be possible to only convert a single byte.
Bloom filters
If you don't mind false positives, I might recommend just using a bloom filter instead.
In short, this sets multiple bits for each value, and then checks each of those bits to check whether a value is in our collection. But the bits that are set for one value can overlap with the bits for other values, which allows us to significantly reduce the number of bits required at the cost of a few false positives (assuming the total number of elements isn't too large).

Generating a perfect hash function given known list of strings?

Suppose I have a list of N strings, known at compile-time.
I want to generate (at compile-time) a function that will map each string to a distinct integer between 1 and N inclusive. The function should take very little time or space to execute.
For example, suppose my strings are:
{"apple", "orange", "banana"}
Such a function may return:
f("apple") -> 2
f("orange") -> 1
f("banana") -> 3
What's a strategy to generate this function?
I was thinking to analyze the strings at compile time and look for a couple of constants I could mod or add by or something?
The compile-time generation time/space can be quite expensive (but obviously not ridiculously so).

Say you have m distinct strings, and let ai, j be the jth character of the ith string. In the following, I'll assume that they all have the same length. This can be easily translated into any reasonable programming language by treating ai, j as the null character if j ≥ |ai|.
The idea I suggest is composed of two parts:
Find (at most) m - 1 positions differentiating the strings, and store these positions.
Create a perfect hash function by considering the strings as length-m vectors, and storing the parameters of the perfect hash function.
Obviously, in general, the hash function must check at least m - 1 positions. It's easy to see this by induction. For 2 strings, at least 1 character must be checked. Assume it's true for i strings: i - 1 positions must be checked. Create a new set of strings by appending 0 to the end of each of the i strings, and add a new string that is identical to one of the strings, except it has a 1 at the end.
Conversely, it's obvious that it's possible to find at most m - 1 positions sufficient for differentiating the strings (for some sets the number of course might be lower, as low as log to the base of the alphabet size of m). Again, it's easy to see so by induction. Two distinct strings must differ at some position. Placing the strings in a matrix with m rows, there must be some column where not all characters are the same. Partitioning the matrix into two or more parts, and applying the argument recursively to each part with more than 2 rows, shows this.
Say the m - 1 positions are p1, ..., pm - 1. In the following, recall the meaning above for ai, pj for pj ≥ |ai|: it is the null character.
let us define h(ai) = ∑j = 1m - 1[qj ai, pj % n], for random qj and some n. Then h is known to be a universal hash function: the probability of pair-collision P(x ≠ y &wedge; h(x) = h(y)) ≤ 1/n.
Given a universal hash function, there are known constructions for creating a perfect hash function from it. Perhaps the simplest is creating a vector of size m2 and successively trying the above h with n = m2 with randomized coefficients, until there are no collisions. The number of attempts needed until this is achieved, is expected 2 and the probability that more attempts are needed, decreases exponentially.

It is simple. Make a dictionary and assign 1 to the first word, 2 to the second, ... No need to make things complicated, just number your words.
To make the lookup effective, use trie or binary search or whatever tool your language provides.

Lists Hash function

I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.
For exemple this is what I want:
f((1 2 3))=f((1 3 2))=f((2 1 3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).
Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.
I'm looking for a simple approach if there is any.

Sort the list and then:
list.each do |current_element|
hash = (37 * hash + current_element) % MAX_HASH_VALUE
end

You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.
Possibly you have N, k such that you could still make this work. Good luck.
If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).

So you are looking something provides these properties,
1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.
Knuth's Multiplicative Method
In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.
hash(i)=i*2654435761 mod 2^32
Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.
Robert Jenkins' 96 bit Mix Function
Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.
All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.
The following source is the mixing part of the hash function.
int mix(int a, int b, int c)
{
a=a-b; a=a-c; a=a^(c >>> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >>> 13);
a=a-b; a=a-c; a=a^(c >>> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >>> 5);
a=a-b; a=a-c; a=a^(c >>> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >>> 15);
return c;
}
You can read details from here

If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.
Hard to describe in words...
For example, if your maximum is 9 (that makes it easy to understand), you'd have :
f(2 3 9 8) = f(3 8 9 2) = 2389
If you maximum was 99, you'd have :
f(16 2 76 8) = (0)2081676
In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :
f(2 2 5) = 2*6^2 + 2*6 + 5 = 89
f(1 4 4) = 1*6^2 + 4*6 + 4 = 64

Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:
template <class T>
void hash_combine(size_t& seed, T const& v)
{
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).
However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.
Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...

I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.
[...] but it turned out that there are collisions
These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.
Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.
As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.
For example:
2 1 3
sorted becomes
1 2 3
Treating this as prime powers gives
2^1.3^2.5^3
which construct
2.9.125 = 2250
giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.

A naïve approach to solving your essential problem (comparing lists in an order-insensitive manner) is to convert all lists being compared to a set (set in Python or HashSet in Java). This is more effective than making a hash function since a perfect hash seems essential to your problem. For almost any other approach collisions are inevitable depending on input.

Finding sets that are a subset of a specific set

Lets say I have 4 different values A,B,C,D with sets of identifiers attached.
A={1,2,3,4,5}
B={8,9,4}
C={3,4,5}
D={12,8}
And given set S of identifiers {1,30,3,4,5,12,8} I want it to return C and D. i.e. retrieve all sets from a group of sets for which S is a superset.
Is there any algorithms to perform this task efficiently (Preferably with low memory complexity. Using external device for storing data is not an option) ?
A trivial solution would be for each member in the superset S retrieve list of sets that include that member (basically inverted index) and for each returned set check that all of his members are in the superset. Unfortunately because on average the superset will include at least one member for each set there is a significant and unacceptable performance hit with this approach.
I am trying to do this in Java. Set consist of integers and the value they identify is an object.
Collection of sets is not static and bound to change during the course of execution. There will be some limit on the set number though.
Set size is not limited. But on average it's between 1 and 20.

Go through each element x in S.
For each set t for which x ∈ t, increment a counter—call it tcount—associated with t.
After all that, for each set t for which tcount = | t |, you know that t ⊆ S.
Application.
After step 2.
Acount = 4,
Bcount = 1,
Ccount = 3,
Dcount = 2.
Step 3 processing.
Acount ≠ |A| (4 ≠ 5) — Reject,
Bcount ≠ |B| (1 ≠ 3) — Reject,
Ccount = |C| (3 = 3) — Accept,
Dcount = |D| (2 = 2) — Accept.

Note after cgkanchi note: The following algorithm is under the assumption that you don't really use sets but arrays. If that is not the case, you should look for a method which implements intersection of sets and then the problem is trivial. This is about how to implement the notion of intersection using arrays.
Sort all sets using heapsort for in-place sorting O(1) space. It runs in O(nlogn) and soon enough it will pay you back.
For each set L of all sets:
2.1. j = 0
2.2. For the i element in L:
2.2.1. Starting from j element find L[i] in S for which L[i] = S[j] else reject. If L and S and large enough use binary search or interpolation search (for the second one, have a look at your data distibution)
2.3. Accept

As for Java, I’d use a Hashtable for the lookup table of the elements in S. Then for each element in X, the set you want to test if it’s a subset of S, test if it’s in the lookup table. If all elements of X are also in S, then S is a superset of X.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio