Related
I know of rolling hash functions that are similar to a hash on a bounded queue. Is there anything similar for stacks?
My use case is that I am doing a depth first search of possible program traces (with loop unrolling, so these stacks can get biiiiig) and I need to identify branching via these traces. Rather than store a bunch of stacks of depth 1000 I want to hash them so that I can index by int. However, if I have stacks of depth 10000+ this hash is going to be expensive, so I want to keep track of my last hash so that when I push/pop from my stack I can hash/unhash the new/old item respectively.
In particular, I am looking for a hash h(Object, Hash) with an unhash u(Object, Hash) with the property that for object x to be hashed we have:
u(x, h(x, baseHash)) = baseHash
Additionally, this hash shouldn't be commutative, since order matters.
One thought I had was matrix multiplication over GL(2, F(2^k)), maybe using a Cayley graph? For example, take two invertible matrices A_0, A_1, with inverses B_0 and B_1, in GL(2, F(2^k)), and compute the hash of an object x by first computing some integer hash with bits b31b30...b1b0, and then compute
H(x) = A_b31 . A_b30 . ... . A_b1 . A_b0
This has an inverse
U(x) = B_b0 . B_b1 . ... . B_b30 . B_31.
Thus the h(x, baseHash) = H(x) . baseHash and u(x, baseHash) = U(x) . baseHash, so that
u(x, h(x, base)) = U(x) . H(x) . base = base,
as desired.
This seems like it might be more expensive than is necessary, but for 2x2 matrices it shouldn't be too bad?
Most incremental hash functions can be made from two kinds of operations:
1) An invertible diffusion function that mixes up the previous hash. Invertible functions are chosen for this so that they don't loose information. Otherwise the hash would tend towards a few values; and
2) An invertible mixing function to mix new data into the hash. Invertible functions are used for this so that every part of the input has equivalent influence over the final hash value.
Since both these things are invertible, it's very easy to undo the last part of an incremental hash and "pop" off the previous value.
For instance, the most common kind of simple hash functions in use are polynomial hash functions. To update a previous hash value with a new input 'x', you calculate:
h' = h*A + x mod M
The multiplication is the diffusion function. In order for this to be invertible, A must have a multiplicative inverse mod M -- commonly either M is chosen to be prime, or M is a power of 2 and A is odd.
Because the multiplicative inverse exists, it's easy to pop off the last value from the hash, as long as you still have access to it:
h = (h' - x)*(1/A) mod M
You can use the extended Euclidean algorithm to find the inverse of A: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm
Most other common non-cryptographic hashes, like CRCs, FNV, murmurHash, etc. are similarly easy to pop values off.
Some of these hashes have a final diffusion step after the incremental work, but that step is pretty much always invertible as well, to ensure that the hash can take on any value, so you can undo it to get back to the incremental part.
Diffusion operations are often made from sequences of primitive invertible operations. To undo them you would undo each operation in reverse order. Some of the common types you'll see are:
cyclic shifts
invertible multiplication (as above)
x = x XOR (x >> shift)
Feistel rounds (see https://simple.wikipedia.org/wiki/Feistel_cipher)
mixing operations are usually + or XOR.
Non-Functional way:
arr = [1, 2, 3] becomes arr = [1, 5, 3] . Here we change same array.
This is discouraged in functional programming. I know that since computers are becoming faster and faster every day and there is more memory to store, functional programming seems more feasible for better readability and clean code.
Functional way:
arr = [1, 2, 3] isn't changed arr2 = [1, 5, 3]. I see a general trend that we use more memory and time to just change one variable.
Here, we doubled our memory and the time complexity changed from O(1) to O(n).
This might be costly for bigger algorithms. Where is this compensated? Or since we can afford for costlier calculations (like when Quantum computing becomes mainstream), do we just trade speed off for readability?
Functional data structures don't necessarily take up a lot more space or require more processing time. The important aspect here is that purely functional data structures are immutable, but that doesn't mean you always make a complete copy of something. In fact, the immutability is precisely the key to working efficiently.
I'll provide as an example a simple list. Suppose we have the following list:
The head of the list is element 1. The tail of the list is (2, 3). Suppose this list is entirely immutable.
Now, we want to add an element at the start of that list. Our new list must look like this:
You can't change the existing list, it is immutable. So, we have to make a new one, right? However, note how the tail of our new list is (1, 2 ,3). That's identical to the old list. So, you can just re-use that. The new list is simply the element 0 with a pointer to the start of the old list as its tail. Here's the new list with various parts highlighted:
If our lists were mutable, this would not be safe. If you changed something in the old list (for example, replacing element 2 with a different one) the change would reflect in the new list as well. That's exactly where the danger is in mutability: concurrent access on data structures needs to be synchronized to avoid unpredictable results, and changes can have unintended side-effects. But, because that can't happen with immutable data structures, it's safe to re-use part of another structure in a new one. Sometimes you want changes in one thing to reflect in another; for example, when you remove an entry in the key set of a Map in Java, you want the mapping itself to be removed too. But in other situations mutability leads to trouble (the infamous Calendar class in Java).
So how can this work, if you can't change the data structure itself? How do you make a new list? Remember that if we're working purely functionally, we move away from the classical data structures with changeable pointers, and instead evaluate functions.
In functional languages, making lists is done with the cons function. cons makes a "cell" of two elements. If you want to make a list with only one element, the second one is nil. So a list with only one 3 element is:
(cons 3 nil)
If the above is a function and you ask what its head is, you get 3. Ask for the tail, you get nil. Now, the tail itself can be a function, like cons.
Our first list then is expressed as such:
(cons 1 (cons 2 (cons 3 nil)))
Ask the head of the above function and you get 1. Ask for the tail and you get (cons 2 (cons 3 nil)).
If we want to append 0 in the front, you just make a new function that evaluates to cons with 0 as head and the above as tail.
(cons 0 (cons 1 (cons 2 (cons 3 nil))))
Since the functions we make are immutable, our lists become immutable. Things like adding elements is a matter of making a new function that calls the old one in the right place. Traversing a list in the imperative and object-oriented way is going through pointers to get from one element to another. Traversing a list in the functional way is evaluating functions.
I like to think of data structures as this: a data structure is basically storing the result of running some algorithm in memory. It "caches" the result of computation, so we don't have to do the computation every time. Purely functional data structures model the computation itself via functions.
This in fact means that it can be quite memory efficient because a lot of data copying can be avoided. And with an increasing focus on parallelization in processing, immutable data structures can be very useful.
EDIT
Given the additional questions in the comments, I'll add a bit to the above to the best of my abilities.
What about my example? Is it something like cons(1 fn) and that function can be cons(2 fn2) where fn2 is cons(3 nil) and in some other case cons(5 fn2)?
The cons function is best compared to a single-linked list. As you might imagine, if you're given a list composed of cons cells, what you're getting is the head and thus random access to some index isn't possible. In your array you can just call arr[1] and get the second item (since it's 0-indexed) in the array, in constant time. If you state something like val list = (cons 1 (cons 2 (cons 3 nil))) you can't just ask the second item without traversing it, because list is now actually a function you evaluate. So access requires linear time, and access to the last element will take longer than access to the head element. Also, given that it's equivalent to a single-linked list, traversal can only be in one direction. So the behavior and performance is more like that of a single-linked list than of, say, an arraylist or array.
Purely functional data structures don't necessarily provide better performance for some operations such as indexed access. A "classic" data structure may have O(1) for some operation where a functional one may have O(log n) for the same one. That's a trade-off; functional data structures aren't a silver bullet, just like object-orientation wasn't. You use them where they make sense. If you're always going to traverse a whole list or part of it and want to be capable of safe parallel access, a structure composed of cons cells works perfectly fine. In functional programming, you'd often traverse a structure using recursive calls where in imperative programming you'd use a for loop.
There are of course many other functional data structures, some of which come much closer to modeling an array that allows random access and updates. But they're typically a lot more complex than the simple example above. There's of course advantages: parallel computation can be trivially easy thanks to immutability; memoization allows us to cache the results of function calls based on inputs since a purely functional approach always yields the same result for the same input.
What are we actually storing underneath? If we need to traverse a list, we need a mechanism to point to next elements right? or If I think a bit, I feel like it is irrelevant question to traverse a list since whenever a list is required it should probably be reconstructed everytime?
We store data structures containing functions. What is a cons? A simple structure consisting of two elements: a head and tail. It's just pointers underneath. In an object-oriented language like Java, you could model it as a class Cons that contains two final fields head and tail assigned on construction (immutable) and has corresponding methods to fetch these. This in a LISP variant
(cons 1 (cons 2 nil))
would be equivalent to
new Cons(1, new Cons(2, null))
in Java.
The big difference in functional languages is that functions are first-class types. They can be passed around and assigned to variables just like object references. You can compose functions. I could just as easily do this in a functional language
val list = (cons 1 (max 2 3))
and if I ask list.head I get 1, if I ask list.tail I get (max 2 3) and evaluating that just gives me 3. You compose functions. Think of it as modeling behavior instead of data. Which brings us to
Could you elaborate "Purely functional data structures model the computation itself via functions."?
Calling list.tail on our above list returns something that can be evaluated and then returns a value. In other words, it returns a function. If I call list.tail in that example it returns (max 2 3), clearly a function. Evaluating it yields 3 as that's the highest number of the arguments. In this example
(cons 1 (cons 2 nil))
calling tail evaluates to a new cons (the (cons 2 nil) one) which in turn can be used.
Suppose we want a sum of all the elements in our list. In Java, before the introduction of lambdas, if you had an array int[] array = new int[] {1, 2, 3} you'd do something like
int sum = 0;
for (int i = 0; i < array.length; ++i) {
sum += array[i];
}
In a functional language it would be something like (simplified pseudo-code)
(define sum (arg)
(eq arg nil
(0)
(+ arg.head (sum arg.tail))
)
)
This uses prefix notation like we've used with our cons so far. So a + b is written as (+ a b). define lets us define a function, with as arguments the name (sum), a list of arguments for the function ((arg)), and then the actual function body (the rest).
The function body consists of an eq function which we'll define as comparing its first two arguments (arg and nil) and if they're equal it evaluates to its next argument ((0) in this case), otherwise to the argument after that (the sum). So think of it as (eq arg1 arg2 true false) with true and false whatever you want (a value, a function...).
The recursion bit then comes in the sum (+ arg.head (sum arg.tail)). We're stating that we take the addition of the head of the argument with a recursive call to the sum function itself on the tail. Suppose we do this:
val list = (cons 1 (cons 2 (cons 3 nil)))
(sum list)
Mentally step through what that last line would do to see how it evaluates to the sum of all the elements in list.
Note, now, how sum is a function. In the Java example we had some data structure and then iterated over it, performing access on it, to create our sum. In the functional example the evaluation is the computation. A useful aspect of this is that sum as a function could be passed around and evaluated only when it's actually needed. That is lazy evaluation.
Another example of how data structures and algorithms are actually the same thing in a different form. Take a set. A set can contain only one instance of an element, for some definition of equality of elements. For something like integers it's simple; if they are the same value (like 1 == 1) they're equal. For objects, however, we typically have some equality check (like equals() in Java). So how can you know whether a set already contains an element? You go over each element in the set and check if it is equal to the one you're looking for.
A hash set, however, computes some hash function for each element and places elements with the same hash in a corresponding bucket. For a good hash function there will rarely be more than one element in a bucket. If you now provide some element and want to check if it's in the set, the actions are:
Get the hash of the provided element (typically takes constant time).
Find the hash bucket in the set for that hash (again should take constant time).
Check if there's an element in that bucket which is equal to the given element.
The requirement is that two equal elements must have the same hash.
So now you can check if something is in the set in constant time. The reason being that our data structure itself has stored some computation information: the hashes. If you store each element in a bucket corresponding to its hash, we have put some computation result in the data structure itself. This saves time later if we want to check whether the set contains an element. In that way, data structures are actually computations frozen in memory. Instead of doing the entire computation every time, we've done some work up-front and re-use those results.
When you think of data structures and algorithms as being analogous in this way, it becomes clearer how functions can model the same thing.
Make sure to check out the classic book "Structure and Interpetation of Computer Programs" (often abbreviated as SICP). It'll give you a lot more insight. You can read it for free here: https://mitpress.mit.edu/sicp/full-text/book/book.html
This is a really broad question with a lot of room for opinionated answers, but G_H provides a really nice breakdown of some of the differences
Could you elaborate "Purely functional data structures model the computation itself via functions."?
This is one of my favourite topics, so I'm happy to share an example in JavaScript because it will allow you to run the code here in the browser and see the answer for yourself
Below you will see a linked list implemented using functions. I use a couple Numbers for example data and I use a String so that I can log something to the console for you to see, but other that that, it's just functions – no fancy objects, no arrays, no other custom stuff.
const cons = (x,y) => f => f(x,y)
const head = f => f((x,y) => x)
const tail = f => f((x,y) => y)
const nil = () => {}
const isEmpty = x => x === nil
const comp = f => g => x => f(g(x))
const reduce = f => y => xs =>
isEmpty(xs) ? y : reduce (f) (f (y,head(xs))) (tail(xs))
const reverse = xs =>
reduce ((acc,x) => cons(x,acc)) (nil) (xs)
const map = f =>
comp (reverse) (reduce ((acc, x) => (cons(f(x), acc))) (nil))
// this function is required so we can visualise the data
// it effectively converts a linked-list of functions to readable strings
const list2str = xs =>
isEmpty(xs) ? 'nil' : `(${head(xs)} . ${list2str(tail(xs))})`
// example input data
const xs = cons(1, cons(2, cons(3, cons(4, nil))))
// example derived data
const ys = map (x => x * x) (xs)
console.log(list2str(xs))
// (1 . (2 . (3 . (4 . nil))))
console.log(list2str(ys))
// (1 . (4 . (9 . (16 . nil))))
Of course this isn't of practical use in real-world JavaScript, but that's beside the point. It's just showing you how functions alone could be used to represent complex data structures.
Here's another example of implementing rational numbers using nothing but functions and numbers – again, we're only using strings so we can convert the functional structure to a visual representation we can understand in the console - this exact scenario is examine thoroughly in the SICP book that G_H mentions
We even implement our higher-order data rat using cons. This shows how functional data structures can easily be made up of (composed of) other functional data structures
const cons = (x,y) => f => f(x,y)
const head = f => f((x,y) => x)
const tail = f => f((x,y) => y)
const mod = y => x =>
y > x ? x : mod (y) (x - y)
const gcd = (x,y) =>
y === 0 ? x : gcd(y, mod (y) (x))
const rat = (n,d) =>
(g => cons(n/g, d/g)) (gcd(n,d))
const numer = head
const denom = tail
const ratAdd = (x,y) =>
rat(numer(x) * denom(y) + numer(y) * denom(x),
denom(x) * denom(y))
const rat2str = r => `${numer(r)}/${denom(r)}`
// example complex data
let x = rat(1,2)
let y = rat(1,4)
console.log(rat2str(x)) // 1/2
console.log(rat2str(y)) // 1/4
console.log(rat2str(ratAdd(x,y))) // 3/4
Suppose I have the data set {A,B,C,D}, of arbitrary type, and I want to compare it to another data set. I want the comparison to be true for {A,B,C,D}, {B,C,D,A}, {C,D,A,B}, and {D,A,B,C}, but not for {A,C,B,D} or any other set that is not ordered similarly. What is a fast way to do this?
Storing them in arrays,rotating, and doing comparison that way is an O(n^2) task so that's not very good.
My first intuition would be to store the data as a set like {A,B,C,D,A,B,C} and then search for a subset, which is only O(n). Can this be done any faster?
There is a fast algorithm for finding the minimum rotation of a string - https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation. So you can store and compare the minimum rotation.
One option is to use a directed graph. Set up a graph with the following transitions:
A -> B
B -> C
C -> D
D -> A
All other transitions will put you in an error state. Thus, provided each member is unique (which is implied by your use of the word set), you will be able to determine membership provided you end on the same graph node on which you started.
If a value can appear multiple times in your search, you'll need a smarter set of states and transitions.
This approach is useful if you precompute a single search and then match it to many data points. It's not so useful if you have to constantly regenerate the graph. It could also be cache-inefficient if your state table is large.
Well Dr Zoidberg, if you are interested in order, as you are, then you need to store your data in a structure that preserves order and also allows for easy rotation.
In Python a list would do.
Find the smallest element of the list then rotate each list you want to compare until the smallest element of them is at the beginning. Note: this is not a sort, but a rotation. With all the lists for comparison so normalised, a straight forward list compare between any two would tell if they are the same after rotation.
>>> def rotcomp(lst1, lst2):
while min(lst1) != lst1[0]:
lst1 = lst1[1:] + [lst1[0]]
while min(lst2) != lst2[0]:
lst2 = lst2[1:] + [lst2[0]]
return lst1 == lst2
>>> rotcomp(list('ABCD'), list('CDAB'))
True
>>> rotcomp(list('ABCD'), list('CDBA'))
False
>>>
>>> rotcomp(list('AABC'), list('ABCA'))
False
>>> def rotcomp2(lst1, lst2):
return repr(lst1)[1:-1] in repr(lst2 + lst2)
>>> rotcomp2(list('ABCD'), list('CDAB'))
True
>>> rotcomp2(list('ABCD'), list('CDBA'))
False
>>> rotcomp2(list('AABC'), list('ABCA'))
True
>>>
NEW SECTION: WITH DUPLICATES?
If the input may contain duplicates then, (from the possible twin question mentioned under the question), An algorithm is to see if one list is a sub-list of the other list repeated twice.
function rotcomp2 uses that algorithm and a textual comparison of the repr of the list contents.
How can I select some parts of an matrix and cut the single dimensions?
Example: B = zeros(100,100,3,'double');
When I select B(2,3,:) I get a 1x1x3 matrix as result - this is not the expected result, because for some operations (like norm) I need a vector as result. To handle this problem I used squeeze, but this operations seems to be very time consuming, especially when heavily used.
How can I select only the vector and 'cut' the single dimensions?
In your case you could use the colon operator, like this:
x = B(2,3,:);
x = x(:);
This places all elements of X into a number-of-elements by 1 vector.
You could also permute the dimensions to bring the non-singleton one to front. Either:
>> permute(B(2,3,:),[3 1 2])
ans =
0.97059
0.69483
0.2551
or
>> permute(B(2,3,:),[1 3 2])
ans =
0.97059 0.69483 0.2551
depending on whether you want a row or a column vector.
I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.
For exemple this is what I want:
f((1 2 3))=f((1 3 2))=f((2 1 3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).
Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.
I'm looking for a simple approach if there is any.
Sort the list and then:
list.each do |current_element|
hash = (37 * hash + current_element) % MAX_HASH_VALUE
end
You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.
Possibly you have N, k such that you could still make this work. Good luck.
If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).
So you are looking something provides these properties,
1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.
Knuth's Multiplicative Method
In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.
hash(i)=i*2654435761 mod 2^32
Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.
Robert Jenkins' 96 bit Mix Function
Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.
All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.
The following source is the mixing part of the hash function.
int mix(int a, int b, int c)
{
a=a-b; a=a-c; a=a^(c >>> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >>> 13);
a=a-b; a=a-c; a=a^(c >>> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >>> 5);
a=a-b; a=a-c; a=a^(c >>> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >>> 15);
return c;
}
You can read details from here
If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.
Hard to describe in words...
For example, if your maximum is 9 (that makes it easy to understand), you'd have :
f(2 3 9 8) = f(3 8 9 2) = 2389
If you maximum was 99, you'd have :
f(16 2 76 8) = (0)2081676
In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :
f(2 2 5) = 2*6^2 + 2*6 + 5 = 89
f(1 4 4) = 1*6^2 + 4*6 + 4 = 64
Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:
template <class T>
void hash_combine(size_t& seed, T const& v)
{
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).
However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.
Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...
I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.
[...] but it turned out that there are collisions
These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.
Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.
As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.
For example:
2 1 3
sorted becomes
1 2 3
Treating this as prime powers gives
2^1.3^2.5^3
which construct
2.9.125 = 2250
giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.
A naïve approach to solving your essential problem (comparing lists in an order-insensitive manner) is to convert all lists being compared to a set (set in Python or HashSet in Java). This is more effective than making a hash function since a perfect hash seems essential to your problem. For almost any other approach collisions are inevitable depending on input.