Check if a value belongs to a hash - algorithm

I'm not sure if this is actually possible thus I ask here. Does anyone knows of an algorithm that would allow something like this?
const values = ['a', 'b', 'c', 'd'];
const hash = createHash(values); // => xjaks14sdffdghj23h4kjhgd9f81nkjrsdfg9aiojd
hash.includes('b'); // => true
hash.includes('v'); // => false
What this snippet does, is it first creates some sort of hash from a list of values, then checks if the certain value belongs to that hash.

Hash functions in general
The primary idea of hash functions is to reduce the space, that is the functions are not injective as they map from a bigger domain to a smaller.
So they produce collisions. That is, there are different elements x and y that get mapped to the same hash value:
h(x) = h(y)
So basically you loose information of the given argument x.
However, in order to answer the question whether all values are contained you would need to keep all information (or at least all non-duplicates). This is obviously not possible for nearly all practical hash-functions.
Possible hash-functions would be identity function:
h(x) = x for all x
but this doesn't reduce the space, not practical.
A natural idea would be to compute hash values of the individual elements and then concatenate them, like
h(a, b, c) = (h(a), h(b), h(c))
But this again doesn't reduce the space, hash values are as long as the message, not practical.
Another possibility is to drop all duplicates, so given values [a, b, c, a, b] we only keep [a, b, c]. But this, in most examples, only reduces the space marginally, again not practical.
But no matter what you do, you can not reduce more than the amount of non-duplicates. Else you wouldn't be able to answer the question for some values. For example if we use [a, b, c, a] but only keep [a, b], we are unable to answer "was c contained" correctly.
Perfect hash functions
However, there is the field of perfect hash functions (Wikipedia). Those are hash-functions that are injective, they don't produce collisions.
In some areas they are of interest.
For those you may be able to answer that question, for example if computing the inverse is easy.
Cryptographic hash functions
If you talk about cryptographic hash functions, the answer is no.
Those need to have three properties (Wikipedia):
Pre-image resistance - Given h it should be difficult to find m : hash(m) = h
Second pre-image resistance - Given m it should be difficult to find m' : hash(m) = hash(m')
Collision resistance - It should be difficult to find (m, m') : hash(m) = hash(m')
Informally you have especially:
A small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value.
If you now would have such a hash value you would be able to easily reconstruct it by asking whether some values are contained. Using that you can easily construct collisions on purpose and stuff like that.
Details would however depend on the specific hash algorithm.
For a toy-example let's use the previous algorithm that simply removes all duplicates:
[a, b, c, a, a] -> [a, b, c]
In that case we find messages like
[a, b, c]
[a, b, c, a]
[a, a, b, b, c]
...
that all map to the same hash value.

If the hash function produces collisions (as almost all hash function do) this cannot be possible.
Think about it this way if for example h('abc') = x and h('abd') = x, how can you decide based on x if the original string contains 'd'?
You could arguably decide to use identity as a has function, which would do the job.

Trivial solution will be a simple hash concatenation.
func createHash(values) {
var hash;
foreach (v in values)
hash += MD5(v);
return hash;
}
Can it be done with fixed length hash and variable input? I'd bet it's impossible.
In case of string hash (such as used in HashMaps), because it is additive, I think we can match partially (prefix match but not suffix).
const values = ['a', 'b', 'c', 'd'];
const hash = createStringHash(values); // => xjaks14sdffdghj23h4kjhgd9f81nkjrsdfg9aiojd
hash.includes('a'); // => true
hash.includes('a', 'b'); // => true
hash.includes('a', 'b', 'v'); // => false

Bit arrays
If you don't care what the resulting hash looks like, I'd recommend just using a bit array.
Take the range of all possible values
Map this to the range of integers starting from 0
Let each bit in our hash indicate whether or not this value appears in the input
This will require 1 bit for every possible value (which could be a lot of bits for large ranges).
Note: this representation is optimal in terms of the number of bits used, assuming there's no limit on the number of elements you can have (beyond 1 of each possible value) - if it were possible to use any fewer bits, you'd have an algorithm that's capable of providing guaranteed compression of any data, which is impossible by the pigeonhole principle.
For example:
If your range is a-z, you can map this to 0-25, then [a,d,g,h] would map to:
10010011000000000000000000 = 38535168 = 0x24c0000
(abcdefghijklmnopqrstuvwxyz)
More random-looking hashes
If you care what the hash looks like, you could take the output from the above and perform a perfect hash on it to map it either to the same length hash or a longer hash.
One trivial example of such a map would be to increment the resulting hash by a randomly chosen but deterministic value (i.e. it's the same for every hash we convert) - you can also do this for each byte (with wrap-around) if you want (e.g. byte0 = (byte0+5)%255, byte1 = (byte1+18)%255).
To determine whether an element appears, the simplest approach would be to reverse the above operation (subtract instead of add) and then just check if the corresponding bit is set. Depending on what you did, it might also be possible to only convert a single byte.
Bloom filters
If you don't mind false positives, I might recommend just using a bloom filter instead.
In short, this sets multiple bits for each value, and then checks each of those bits to check whether a value is in our collection. But the bits that are set for one value can overlap with the bits for other values, which allows us to significantly reduce the number of bits required at the cost of a few false positives (assuming the total number of elements isn't too large).

Related

Is there a pushable/poppable hash function for stack-like objects?

I know of rolling hash functions that are similar to a hash on a bounded queue. Is there anything similar for stacks?
My use case is that I am doing a depth first search of possible program traces (with loop unrolling, so these stacks can get biiiiig) and I need to identify branching via these traces. Rather than store a bunch of stacks of depth 1000 I want to hash them so that I can index by int. However, if I have stacks of depth 10000+ this hash is going to be expensive, so I want to keep track of my last hash so that when I push/pop from my stack I can hash/unhash the new/old item respectively.
In particular, I am looking for a hash h(Object, Hash) with an unhash u(Object, Hash) with the property that for object x to be hashed we have:
u(x, h(x, baseHash)) = baseHash
Additionally, this hash shouldn't be commutative, since order matters.
One thought I had was matrix multiplication over GL(2, F(2^k)), maybe using a Cayley graph? For example, take two invertible matrices A_0, A_1, with inverses B_0 and B_1, in GL(2, F(2^k)), and compute the hash of an object x by first computing some integer hash with bits b31b30...b1b0, and then compute
H(x) = A_b31 . A_b30 . ... . A_b1 . A_b0
This has an inverse
U(x) = B_b0 . B_b1 . ... . B_b30 . B_31.
Thus the h(x, baseHash) = H(x) . baseHash and u(x, baseHash) = U(x) . baseHash, so that
u(x, h(x, base)) = U(x) . H(x) . base = base,
as desired.
This seems like it might be more expensive than is necessary, but for 2x2 matrices it shouldn't be too bad?
Most incremental hash functions can be made from two kinds of operations:
1) An invertible diffusion function that mixes up the previous hash. Invertible functions are chosen for this so that they don't loose information. Otherwise the hash would tend towards a few values; and
2) An invertible mixing function to mix new data into the hash. Invertible functions are used for this so that every part of the input has equivalent influence over the final hash value.
Since both these things are invertible, it's very easy to undo the last part of an incremental hash and "pop" off the previous value.
For instance, the most common kind of simple hash functions in use are polynomial hash functions. To update a previous hash value with a new input 'x', you calculate:
h' = h*A + x mod M
The multiplication is the diffusion function. In order for this to be invertible, A must have a multiplicative inverse mod M -- commonly either M is chosen to be prime, or M is a power of 2 and A is odd.
Because the multiplicative inverse exists, it's easy to pop off the last value from the hash, as long as you still have access to it:
h = (h' - x)*(1/A) mod M
You can use the extended Euclidean algorithm to find the inverse of A: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm
Most other common non-cryptographic hashes, like CRCs, FNV, murmurHash, etc. are similarly easy to pop values off.
Some of these hashes have a final diffusion step after the incremental work, but that step is pretty much always invertible as well, to ensure that the hash can take on any value, so you can undo it to get back to the incremental part.
Diffusion operations are often made from sequences of primitive invertible operations. To undo them you would undo each operation in reverse order. Some of the common types you'll see are:
cyclic shifts
invertible multiplication (as above)
x = x XOR (x >> shift)
Feistel rounds (see https://simple.wikipedia.org/wiki/Feistel_cipher)
mixing operations are usually + or XOR.

Difference on arrays, lists, hashes

Is there a concept and/or algorithms that deal with the minimal sequence of elementary operations to handle differences between structured objects such as arrays/lists/hashes? I am imagining something like various notions of string distances, but I want to handle not only strings. For example, the difference between the first and the second arrays below:
["a", {b: 1}, false, "b"]
[{b: 1}, "a", false, true]
can be represented by two operations: transposing elements at indices 0 and 1 and replacing the element at index 3 with true. Replacing the whole array may apparently seem less (single) operation, but that involves larger objects, and should not be counted as the minimal operation. Is there a notion like this in programming?
I don't know what exactly should be considered as elementary operations that would make sense. I imagine insertion, deletion, (and perhaps transposition, substitution and/or assignment of a value under a different key in case of a hash). They should all be dealing with the structural difference. I clearly don't want to include operations like "add +3 to a number."
If you can encode the data structure as a string then you could use a variant of something like the Levenshtein distance algorithm. Assign a different symbol to each unique element of the two data structures, in this case "a" = A, {b: 1} = B, false = C, true = D, and "b" = E; then you'd be looking for the edit distance between the strings ABCE and BACD

Algorithm/Data Structure for finding combinations of minimum values easily

I have a symmetric matrix like shown in the image attached below.
I've made up the notation A.B which represents the value at grid point (A, B). Furthermore, writing A.B.C gives me the minimum grid point value like so: MIN((A,B), (A,C), (B,C)).
As another example A.B.D gives me MIN((A,B), (A,D), (B,D)).
My goal is to find the minimum values for ALL combinations of letters (not repeating) for one row at a time e.g for this example I need to find min values with respect to row A which are given by the calculations:
A.B = 6
A.C = 8
A.D = 4
A.B.C = MIN(6,8,6) = 6
A.B.D = MIN(6, 4, 4) = 4
A.C.D = MIN(8, 4, 2) = 2
A.B.C.D = MIN(6, 8, 4, 6, 4, 2) = 2
I realize that certain calculations can be reused which becomes increasingly important as the matrix size increases, but the problem is finding the most efficient way to implement this reuse.
Can point me in the right direction to finding an efficient algorithm/data structure I can use for this problem?
You'll want to think about the lattice of subsets of the letters, ordered by inclusion. Essentially, you have a value f(S) given for every subset S of size 2 (that is, every off-diagonal element of the matrix - the diagonal elements don't seem to occur in your problem), and the problem is to find, for each subset T of size greater than two, the minimum f(S) over all S of size 2 contained in T. (And then you're interested only in sets T that contain a certain element "A" - but we'll disregard that for the moment.)
First of all, note that if you have n letters, that this amounts to asking Omega(2^n) questions, roughly one for each subset. (Excluding the zero- and one-element subsets and those that don't include "A" saves you n + 1 sets and a factor of two, respectively, which is allowed for big Omega.) So if you want to store all these answers for even moderately large n, you'll need a lot of memory. If n is large in your applications, it might be best to store some collection of pre-computed data and do some computation whenever you need a particular data point; I haven't thought about what would work best, but for example computing data only for a binary tree contained in the lattice would not necessarily help you anything beyond precomputing nothing at all.
With these things out of the way, let's assume you actually want all the answers computed and stored in memory. You'll want to compute these "layer by layer", that is, starting with the three-element subsets (since the two-element subsets are already given by your matrix), then four-element, then five-element, etc. This way, for a given subset S, when we're computing f(S) we will already have computed all f(T) for T strictly contained in S. There are several ways that you can make use of this, but I think the easiest might be to use two such subset S: let t1 and t2 be two different elements of T that you may select however you like; let S be the subset of T that you get when you remove t1 and t2. Write S1 for S plus t1 and write S2 for S plus t2. Now every pair of letters contained in T is either fully contained in S1, or it is fully contained in S2, or it is {t1, t2}. Look up f(S1) and f(S2) in your previously computed values, then look up f({t1, t2}) directly in the matrix, and store f(T) = the minimum of these 3 numbers.
If you never select "A" for t1 or t2, then indeed you can compute everything you're interested in while not computing f for any sets T that don't contain "A". (This is possible because the steps outlined above are only interesting whenever T contains at least three elements.) Good! This leaves just one question - how to store the computed values f(T). What I would do is use a 2^(n-1)-sized array; represent each subset-of-your-alphabet-that-includes-"A" by the (n-1) bit number where the ith bit is 1 whenever the (i+1)th letter is in that set (so 0010110, which has bits 2, 4, and 5 set, represents the subset {"A", "C", "D", "F"} out of the alphabet "A" .. "H" - note I'm counting bits starting at 0 from the right, and letters starting at "A" = 0). This way, you can actually iterate through the sets in numerical order and don't need to think about how to iterate through all k-element subsets of an n-element set. (You do need to include a special case for when the set under consideration has 0 or 1 element, in which case you'll want to do nothing, or 2 elements, in which case you just copy the value from the matrix.)
Well, it looks simple to me, but perhaps I misunderstand the problem. I would do it like this:
let P be a pattern string in your notation X1.X2. ... .Xn, where Xi is a column in your matrix
first compute the array CS = [ (X1, X2), (X1, X3), ... (X1, Xn) ], which contains all combinations of X1 with every other element in the pattern; CS has n-1 elements, and you can easily build it in O(n)
now you must compute min (CS), i.e. finding the minimum value of the matrix elements corresponding to the combinations in CS; again you can easily find the minimum value in O(n)
done.
Note: since your matrix is symmetric, given P you just need to compute CS by combining the first element of P with all other elements: (X1, Xi) is equal to (Xi, X1)
If your matrix is very large, and you want to do some optimization, you may consider prefixes of P: let me explain with an example
when you have solved the problem for P = X1.X2.X3, store the result in an associative map, where X1.X2.X3 is the key
later on, when you solve a problem P' = X1.X2.X3.X7.X9.X10.X11 you search for the longest prefix of P' in your map: you can do this by starting with P' and removing one component (Xi) at a time from the end until you find a match in your map or you end up with an empty string
if you find a prefix of P' in you map then you already know the solution for that problem, so you just have to find the solution for the problem resulting from combining the first element of the prefix with the suffix, and then compare the two results: in our example the prefix is X1.X2.X3, and so you just have to solve the problem for
X1.X7.X9.X10.X11, and then compare the two values and choose the min (don't forget to update your map with the new pattern P')
if you don't find any prefix, then you must solve the entire problem for P' (and again don't forget to update the map with the result, so that you can reuse it in the future)
This technique is essentially a form of memoization.

Hashing sets of integers

I'm looking for a hash function over sets H(.) and a relation R(.,.) such that if A is included in B then R(H(A), H(B)). Of course, R(.,.) must be easy to verify (constant time), and H(A) should be computed in linear time.
One example of H and R is:
H(A) = OR over 1 << (h(x) % k), for x in A, k a fixed integer and h(x) a hash function over integers.
R(H(A), H(B)) = ((H(A) & H(B)) == H(A))
Are there any other good examples? (good is hard to define but intuitively if R(H(A), H(B)) then whp A is included in B).
After thinking about this, I ended up with the example you gave. I.e. each element in B sets a bit in the hash, and A is only contained in B if each bit which is set in H(A) is also set in H(B).
Maybe a Bloom filter is applicable in your case. It seems to use the same bit trick, but with multiple hash functions.

Map strings to numbers maintaining the lexicographic ordering

I'm looking for an algorithm or function that is able to map a string to a number in such way that the resulting values correspond the lexicographic ordering of strings. Example:
"book" -> 50000
"car" -> 60000
"card" -> 65000
"a longer string" -> 15000
"another long string" -> 15500
"awesome" -> 16000
As a function it should be something like: f(x) = y, so that for any x1 < x2 => f(x1) < f(x2), where x is an arbitrary string and y is a number.
If the input set of x is finite, then I could always do a sort and assign the proper values, but I'm looking for something generic for an unlimited input set for x.
If you require that f map to integers this is impossible.
Suppose that there is such a map f. Consider the strings a, aa, aaa, etc. Consider the values f(a), f(aa), f(aaa), etc. As we require that f(a) < f(aa) < f(aaa) < ... we see that f(a_n) tends to infinity as n tends to infinity; here I am using the obvious notation that a_n is the character a repeated n times. Now consider the string b. We require that f(a_n) < f(b) for all n. But f(b) is some finite integer and we just showed that f(a_n) goes to infinity. We have a contradiction. No such map is possible.
Maybe you could tell us what you need this for? This is fairly abstract and we might be able to suggest something more suitable. Further, don't necessarily worry about solving "it" generally. YAGNI and all that.
As a corollary to Jason's answer, if you can map your strings to rational numbers, such a mapping is very straightforward. If code(c) is the ASCII code of the character c and s[i] is theith character in the string s, just sum like follows:
result <- 0
scale <- 1
for i from 1 to length(s)
scale <- scale / 26
index <- (1 + code(s[i]) - code('a'))
result <- result + index / scale
end for
return result
This maps the empty string to 0, and every other string to a rational number between 0 and 1, maintaining lexicographical order. If you have arbitrary-precision decimal floating-point numbers, you can replace the division by powers of 26 with powers of 100 and still have exactly representable numbers; with arbitrary precision binary floating-point numbers, you can divide by powers of 32.
what you are asking for is a a temporary suspension of the pigeon hole principle (http://en.wikipedia.org/wiki/Pigeonhole_principle).
The strings are the pigeons, the numbers are the holes.
There are more pigeons than holes, so you can't put each pigeon in its own hole.
You would be much better off writing a comparator which you can supply to a sort function. The comparator takes two strings and returns -1, 0, or 1. Even if you could create such a map, you still have to sort on it. If you need both a "hash" and the order, then keep stuff in two data structures - one that preserves the order, and one that allows fast access.
Maybe a Radix Tree is what you're looking for?
A radix tree, Patricia trie/tree, or
crit bit tree is a specialized set
data structure based on the trie that
is used to store a set of strings. In
contrast with a regular trie, the
edges of a Patricia trie are labelled
with sequences of characters rather
than with single characters. These can
be strings of characters, bit strings
such as integers or IP addresses, or
generally arbitrary sequences of
objects in lexicographical order.
Sometimes the names radix tree and
crit bit tree are only applied to
trees storing integers and Patricia
trie is retained for more general
inputs, but the structure works the
same way in all cases.
LWN.net also has an article describing this data structures use in the Linux kernel.
I have post a question here https://stackoverflow.com/questions/22798824/what-lexicographic-order-means
As workaround you can append empty symbols with code zero to right side of the string, and use expansion from case II.
Without such expansion with extra empty symbols I' m actually don't know how to make such mapping....
But if you have a finite set of Symbols (V), then |V*| is eqiualent to |N| -- fact from Disrete Math.

Resources