Perfect Hashing Time Complexity - algorithm

Would a perfect hash have delete insert and search in 0(1) time? If so then why don't computer scientists use perfect hashing all the time if not what would be the time complexity?

Would a perfect hash have delete insert and search in 0(1) time?
Yes, as you'd get only one-element buckets. So you get as many buckets as there are elements.
If so then why don't computer scientists use perfect hashing all the time if not what would be the time complexity?
One of the reasons is theoretical.
Perfect hash function is injective, and definition of such a function might be difficult if not impossible.
Consider following trivial structure:
{
int x;
int y;
}
Basically you want to have your hash() function to give unique results for every possible values of x and y, this might not be possible. Basically for trivial ones like x + y, x * y, x ^ y, you can always create another input that would give the same result.
On the other hand, it's possible (as |N^2| = |N| = aleph-null) - see Cantor pairing function.
The other of the reasons is practical - your return function has to have a return type. The return type would need to be capable of storing all possible injection results - so for hash of two 32-bit values it would need 64 bits, for three 3 * 32 etc. (compare that above mentioned pairing function effectively multiplies the arguments)

Related

Why do we have double hashing function as [(hash1(key) + i * hash2(key)) % TABLE_SIZE] but not simply as [(i * hash2(key)) % TABLE_SIZE]?

I learned the notation of double hashing [(hash1(key) + i * hash2(key)) % TABLE_SIZE] couple days ago. There is a part I couldn't understand after thinking about it and searching for answer for days.
Why don't we discard the [hash1(key)] part from the double hashing function and simply make it as [(i * hash2(key)) % TABLE_SIZE]?
I couldn't find any downside of doing this, except all the hashcodes would start from 0 (when i = 0).
The main purpose of using double hashing, avoiding clusters, can still be achieved.
It will be super thankful if anyone can help:D
A quick summary of this answer:
There is a practical performance hit to your modified version, though it's not very large.
I think this is due to there not being as many different probe sequences as in regular double hashing, leading to some extra collisions due to the birthday paradox.
Now, the actual answer. :-)
Let's start off with some empirical analysis. What happens if you switch from the "standard" version of double hashing to the variant of double hashing that you've proposed?
I wrote a C++ program that generates uniformly-random choices of h1 and h2 values for each of the elements. It then inserts them into two different double-hashing tables, one using the normal approach and one using the variant. It repeats this process multiple times and reports the average number of probes required across each insertion. Here's what I found:
#include <iostream>
#include <vector>
#include <random>
#include <utility>
using namespace std;
/* Table size is picked to be a prime number. */
const size_t kTableSize = 5003;
/* Load factor for the hash table. */
const double kLoadFactor = 0.9;
/* Number of rounds to use. */
const size_t kNumRounds = 100000;
/* Random generator. */
static mt19937 generator;
/* Creates and returns an empty double hashing table. */
auto emptyTable(const size_t numSlots) {
return vector<bool>(numSlots, false);
}
/* Simulation of double hashing. Each vector entry represents an item to store.
* The first element of the pair is the value of h1(x), and the second element
* of the pair is the value of h2(x).
*/
auto hashCodes(const size_t numItems) {
vector<pair<size_t, size_t>> result;
uniform_int_distribution<size_t> first(0, kTableSize - 1), second(1, kTableSize - 1);
for (size_t i = 0; i < numItems; i++) {
result.push_back({ first(generator), second(generator) });
}
return result;
}
/* Returns the probe location to use given a number of steps taken so far.
* If modified is true, we ignore h1.
*/
size_t locationOf(size_t tableSize, size_t numProbes, size_t h1, size_t h2, bool modified) {
size_t result = (numProbes == 0 || !modified)? h1 : 0;
result += h2 * numProbes;
return result % tableSize;
}
/* Performs a double-hashing insert, returning the number of probes required to
* settle the element into its place.
*/
size_t insert(vector<bool>& table, size_t h1, size_t h2, bool modified) {
size_t numProbes = 0;
while (table[locationOf(table.size(), numProbes, h1, h2, modified)]) {
numProbes++;
}
table[locationOf(table.size(), numProbes, h1, h2, modified)] = true;
return numProbes + 1; // Count the original location as one probe
}
int main() {
size_t normalProbes = 0, variantProbes = 0;
for (size_t round = 0; round < kNumRounds; round++) {
auto normalTable = emptyTable(kTableSize);
auto variantTable = emptyTable(kTableSize);
/* Insert a collection of items into the table. */
for (auto [h1, h2]: hashCodes(kTableSize * kLoadFactor)) {
normalProbes += insert(normalTable, h1, h2, false);
variantProbes += insert(variantTable, h1, h2, true);
}
}
cout << "Normal probes: " << normalProbes << endl;
cout << "Variant probes: " << variantProbes << endl;
}
Output:
Normal probes: 1150241942
Variant probes: 1214644088
So, empirically, it looks like the modified approach leads to about 5% more probes being needed to place all the elements. The question, then, is why this is.
I do not have a fully-developed theoretical explanation as to why the modified version is slower, but I do have a reasonable guess as to what's going on. Intuitively, double hashing works by assigning each element that's inserted a random probe sequence, which is some permutation of the table slots. It's not a truly-random permutation, since not all permutations can be achieved, but it's random enough for some definition of "random enough" (see, for example, Guibas and Szemendi's "The Analysis of Double Hashing).
Let's think about what happens when we do an insertion. How many times, on expectation, will we need to look at the probe sequence beyond just h1? The first item has 0 probability of needing to look at h2. The second item has 1/T probability, since it hits the first element with probability 1/T. The second item has 2/T probability, since it has a 2/T chance of hitting the first two items. More generally, using linearity of expectation, we can show that the the expected number of times an item will be in a spot that's already taken is given by
1/T + 2/T + 3/T + 4/T + ... + (n-1)/T
=(1+2+3+...+(n-1))/T
= n(n-1)/2T
Now, let's imagine that the load factor on our hash table is α, meaning that αT = n. Then the expected number of collisions works out to
αT(αT - 1) / 2T
≈ α2T / 2.
In other words, we should expect to see a pretty decent number of times where we need to inspect h2 when using double hashing.
Now, what happens when we look at the probe sequence? The number of different probe sequences using traditional double hashing is T(T-1), where T is the number of slots in the table. This is because there are T possible choices of h1(x) and T-1 choices for h2(x).
The math behind the birthday paradox says that once approximately √(2T(T-1)) ≈ T√2 items have been inserted into the table, we have a 50% chance that two of them will end up having the same probe sequence assigned. The good news here is that it's not possible to insert T√2 items into a T-slot hash table - that's more elements than slots! - and so there's a fairly low probability that we see elements that get assigned the same probe sequences. That means that the conflicts we get in the table are mostly due to collisions between elements that have different probe sequences, but coincidentally end up landing smack on top of one another.
On the other hand, let's think about your variation. Technically speaking, there are still T(T-1) distinct probe sequences. However, I'm going to argue that there are "effectively" more like only T-1 distinct probe sequences. The reason for this is that
probe sequences don't really matter unless you have a collision when you do an insertion, and
once there's a collision after an insertion, the probe sequence for a given element is determined purely by its h2 value.
This is not a rigorous argument - it's more of an intuition - but it shows that we have less variation in how our permutations get chosen.
Because there are only T-1 different probe sequences to pick from once we've had a collision, the birthday paradox says that we need to see about √(2T) collisions before we find two items with identical values of h2. And indeed, given that we see, on expectation, α2T/2 items that need to have h2 inspected, this means that we have a very good chance of finding items whose positions will be determined by the exact same sequence of h2 values. This means that we have a new source of collisions compared with "classical" double hashing: collisions from h2 values overlapping one another.
Now, even if we do have collisions with h2 values, it's not a huge deal. After all, it'll only take a few extra probes to skip past items placed with the same h2 values before we get to new slots. But I think that this might be the source of the extra probes seen during the modified version.
Hope this helps!
The proof of double hashing goes through under some weak assumptions about h1 and h2, namely, they're drawn from universal families. The result you get is that every operation is expected constant time (for every access that doesn't depend on the choice of h1 and h2).
With h2 only you need to either strengthen the condition on h2 or give the time bound as stated. Pick a prime p congruent to 1 mod 4, let P* = {1, …, p−1} be the set of units mod p, and consider the following mediocre but universal family of hash functions from {L, R} × P*. Draw a random scalar c ← P* and a random function f ← (P* → P*) and define
h2((L, x)) = cx mod p
h2((R, x)) = f(x).
This family is universal because the (L, x) keys never collide with each other, and every other pair collides with probability exactly 1/|P*|. It's a bad choice for the double-hashing-with-only-h2 algorithm because it's linear on half of its range, and linearity preserves arithmetic sequences.
Consider the following sequence of operations. Fill half of the hash table at random by inserting (R, 1), …, (R, (p−1)/2), then insert half again as many elements (L, (p−1)/4), …, (L, 1). The table load is at most 3/4, so everything should run in expected constant time. Consider what happens, however, when we insert (L, 1). With probability 1/2, the location h2((L, 1)) is occupied by one of the R keys. The ith probe i h2((L, 1)) hits the same location as h2((L, i)), which for i ≤ (p−1)/4 is guaranteed to be full by earlier operations. Therefore the expected cost of this operation is linear in p even though the sequence of keys didn't depend on the hash function, which is unacceptable.
Putting h1 back in the mix smashes this structure.
(Ugh this didn't quite work because the proof of expected constant time assumes strong universality, not universality as stated in the abstract.)
Taking another bite at the apple, this time with strong universality. Leaving my other answer up because this one uses a result by Patrascu and Thorup as a black box (and your choice of some deep number theory or some handwaving), which is perhaps less satisfying. The result is about linear probing, namely, for every table size m that's a power of 4, there exists a 2-universal (i.e., strongly univeral) hash family and a sequence of operations such that, in expectation over the random hash function, one operation (referred to as The Query) probes Theta(√m) cells.
In order to use this result, we'd really like a table of size p−1 where p is a prime, so fixing m and the bad-for-linear-probing hash family Hm (whose functions have codomain {0, …, m-1}), choose p to be the least prime greater than m. (Alternatively, the requirement that m be a power of 4 is basically for convenience writing up the proof; it seems tedious but possible to generalize Patrascu and Thorup's result to other table sizes.) Define the distribution Hp by drawing a function h'' ← Hm and then define each value of h' independently according to the distribution
h'(x) = | h''(x) + 1 with probability m/(p-1)
| m with probability 1/m
...
| p-1 with probability 1/m.
Letting K be the field mod p, the functions h have codomain K* = {1, …, p-1}, the units of K. Unless I botched the definition, it's straightforward to verify that Hp is strongly universal. We need to pull in some heavy-duty number theory to show that p - m is O(m2/3) (this follows from the existence of primes between sufficiently large successive cubes), which means that our linear probe sequence of length O(√m) for The Query remains intact with constant probability for Omega(m1/3) steps, well more than constant.
Now, in order to change this family from a linear probing wrecker to a "double" hash wrecker, we need to give the key involved in The Query a name, let's say q. (We know for sure which one it is because the operation sequence doesn't depend on the hash function.) We define a new distribution of hash functions h by drawing h' as before, drawing c ← K*, and then defining
h(x) = | c h'(x) if x ≠ q
| c if x = q.
Let's verify that this is still 2-universal. Given keys x and y both not equal to q, it's clear that (h(x), h(y)) = (c h'(x), c h'(y)) has uniform distribution over K* × K*. Given q and some other key x, we examine the distribution of (h(q), h(x)) = (c, c h'(x)), which is as needed because c is uniform, and h'(x) is uniform and independent of c, hence so is c h'(x).
OK, the point of this exercise at last. The probe sequence for The Query will be c, 2c, 3c, etc. Which keys hash to (e.g.) 2c? They are the x's that satisfy the equation
h(x) = c h'(x) = 2c
from which we derive
h'(x) = 2,
i.e., the keys whose preferred slot is right after The Query's in linear probe order. Generalizing from 2 to i, we conclude that the bad linear probe sequence for The Query for h' becomes a bad "double" hashing probe sequence for The Query for h, QED.

Design of a data structure that can search over objects with 2 attributes

I'm trying to think of a way to desing a data structure that I can efficiently insert to, remove from and search in it.
The catch is that the search function is getting a similar object as input, with 2 attributes, and I need to find an object in my dataset, such that both the 1st and 2nd of the object in my dataset are equal to or bigger than the one in search function's input.
So for example, if I send as input, the following object:
object[a] = 9; object[b] = 14
Then a valid found object could be:
object[a] = 9; object[b] = 79
but not:
object[a] = 8; object[b] = 28
Is there anyway to store the data such that the search complexity is better than linear?
EDIT:
I forgot to include in my original question. The search has to return the smallest possible object in the dataset, by multipication of the 2 attributes.
Meaning that the value of object[a]*object[b] of an object that fits the original condition, is smaller than any other object in the dataset that also fits.
You may want to use k-d tree data structure, which is typically use to index k dimensional points. The search operation, like what you perform, requires O(log n) in average.
This post may help when attributes are hierarchically linked like name, forename. For point in a 2D space k-d tree is more adapted as explain by fajarkoe.
class Person {
string name;
string forename;
... other non key attributes
}
You have to write a comparator function which take two objects of class X as input and returns -1, 0 or +1 for <, = and > cases.
Libraries like glibc(), with qsort() and bsearch or more higher languages like Java and its java.util.Comparator class and java.util.SortedMap (implementation java.util.TreeMap) as containers use comparators.
Other languages use equivalent concept.
The comparator method may be wrote followin your spec like:
int compare( Person left, Person right ) {
if( left.name < right.name ) {
return -1;
}
if( left.name > right.name ) {
return +1;
}
if( left.forename < right.forename ) {
return -1;
}
if( left.forename > right.forename ) {
return +1;
}
return 0;
}
Complexity of qsort()
Quicksort, or partition-exchange sort, is a sorting algorithm
developed by Tony Hoare that, on average, makes O(n log n) comparisons
to sort n items. In the worst case, it makes O(n2) comparisons, though
this behavior is rare. Quicksort is often faster in practice than
other O(n log n) algorithms.1 Additionally, quicksort's sequential
and localized memory references work well with a cache. Quicksort is a
comparison sort and, in efficient implementations, is not a stable
sort. Quicksort can be implemented with an in-place partitioning
algorithm, so the entire sort can be done with only O(log n)
additional space used by the stack during the recursion.2
Complexity of bsearch()
If the list to be searched contains more than a few items (a dozen,
say) a binary search will require far fewer comparisons than a linear
search, but it imposes the requirement that the list be sorted.
Similarly, a hash search can be faster than a binary search but
imposes still greater requirements. If the contents of the array are
modified between searches, maintaining these requirements may even
take more time than the searches. And if it is known that some items
will be searched for much more often than others, and it can be
arranged so that these items are at the start of the list, then a
linear search may be the best.

A function to reproduce the same number for each input

say you have some data consisting of 2 columns and 1 billion rows, like:
0,0
1,0
2,3
3,2
etc
I want to create a function that will always give what's in column 2 if given an input from column one, so that it will be mapping values from column one to column two the same way it appeared in the data.
Column 1 is sequential from 0 to 1E9 (one billion)
Column 2 can ONLY be {0,1,2,3}
I don't want to just store the data in an array.. I want code that can calculate this map.
Any ideas?
Thanks in advance
If the keys are dense, a 1d array should be fine where weights[key] = weight
Otherwise, a lookup structure such as a dictionary would work if the keys are sparse.
Not sure if you also needed help on the random part, but the cumulative sum and a rand(sum(weights)) will select randomly with a bias on numbers with larger weights.
edited for clarity weights is the array
Assuming #munch1324 is correct, and the problem is:
Given a collection of 1000 data points, dynamically generate a function that matches the data set.
then yes, I think it is possible. However, if your goal is for the function to be a more compact representation of the data collection, then I think you are out of luck.
Here are two possibilities:
Piecewise-defined function
int function foo(int x)
{
if (x==0) return 0;
if (x==1) return 0;
if (x==2) return 3;
if (x==3) return 4;
...
}
Polynomial interpolation
N data points can be fit to exactly match a N-1 degree polynomial.
Given the collection of 1000 data points, use your favorite method to solve for the 1000 coeffecients of a 999-degree polynomial.
Your resulting function would then be:
int[] c; // Array of 1000 polynomial coefficients that you solved for when given the data collection
...
int function foo(int x)
{
return c[999]*x^999 + c[998]*x^998 + ... + c[1]*x + c[0];
}
This has obvious issues, because you have 1000 coefficients to store, and will have numerical issues raising x values to such high powers.
If you are looking for something a little more advanced, the Lagrange polynomial will give you the polynomial of least degree that fits all of your data points.

Lists Hash function

I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.
For exemple this is what I want:
f((1 2 3))=f((1 3 2))=f((2 1 3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).
Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.
I'm looking for a simple approach if there is any.
Sort the list and then:
list.each do |current_element|
hash = (37 * hash + current_element) % MAX_HASH_VALUE
end
You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.
Possibly you have N, k such that you could still make this work. Good luck.
If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).
So you are looking something provides these properties,
1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.
Knuth's Multiplicative Method
In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.
hash(i)=i*2654435761 mod 2^32
Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.
Robert Jenkins' 96 bit Mix Function
Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.
All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.
The following source is the mixing part of the hash function.
int mix(int a, int b, int c)
{
a=a-b; a=a-c; a=a^(c >>> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >>> 13);
a=a-b; a=a-c; a=a^(c >>> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >>> 5);
a=a-b; a=a-c; a=a^(c >>> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >>> 15);
return c;
}
You can read details from here
If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.
Hard to describe in words...
For example, if your maximum is 9 (that makes it easy to understand), you'd have :
f(2 3 9 8) = f(3 8 9 2) = 2389
If you maximum was 99, you'd have :
f(16 2 76 8) = (0)2081676
In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :
f(2 2 5) = 2*6^2 + 2*6 + 5 = 89
f(1 4 4) = 1*6^2 + 4*6 + 4 = 64
Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:
template <class T>
void hash_combine(size_t& seed, T const& v)
{
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).
However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.
Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...
I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.
[...] but it turned out that there are collisions
These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.
Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.
As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.
For example:
2 1 3
sorted becomes
1 2 3
Treating this as prime powers gives
2^1.3^2.5^3
which construct
2.9.125 = 2250
giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.
A naïve approach to solving your essential problem (comparing lists in an order-insensitive manner) is to convert all lists being compared to a set (set in Python or HashSet in Java). This is more effective than making a hash function since a perfect hash seems essential to your problem. For almost any other approach collisions are inevitable depending on input.

Good hash function for permutations?

I have got numbers in a specific range (usually from 0 to about 1000). An algorithm selects some numbers from this range (about 3 to 10 numbers). This selection is done quite often, and I need to check if a permutation of the chosen numbers has already been selected.
e.g one step selects [1, 10, 3, 18] and another one [10, 18, 3, 1] then the second selection can be discarded because it is a permutation.
I need to do this check very fast. Right now I put all arrays in a hashmap, and use a custom hash function: just sums up all the elements, so 1+10+3+18=32, and also 10+18+3+1=32. For equals I use a bitset to quickly check if elements are in both sets (I do not need sorting when using the bitset, but it only works when the range of numbers is known and not too big).
This works ok, but can generate lots of collisions, so the equals() method is called quite often. I was wondering if there is a faster way to check for permutations?
Are there any good hash functions for permutations?
UPDATE
I have done a little benchmark: generate all combinations of numbers in the range 0 to 6, and array length 1 to 9. There are 3003 possible permutations, and a good hash should generated close to this many different hashes (I use 32 bit numbers for the hash):
41 different hashes for just adding (so there are lots of collisions)
8 different hashes for XOR'ing values together
286 different hashes for multiplying
3003 different hashes for (R + 2e) and multiplying as abc has suggested (using 1779033703 for R)
So abc's hash can be calculated very fast and is a lot better than all the rest. Thanks!
PS: I do not want to sort the values when I do not have to, because this would get too slow.
One potential candidate might be this.
Fix a odd integer R.
For each element e you want to hash compute the factor (R + 2*e).
Then compute the product of all these factors.
Finally divide the product by 2 to get the hash.
The factor 2 in (R + 2e) guarantees that all factors are odd, hence avoiding
that the product will ever become 0. The division by 2 at the end is because
the product will always be odd, hence the division just removes a constant bit.
E.g. I choose R = 1779033703. This is an arbitrary choice, doing some experiments should show if a given R is good or bad. Assume your values are [1, 10, 3, 18].
The product (computed using 32-bit ints) is
(R + 2) * (R + 20) * (R + 6) * (R + 36) = 3376724311
Hence the hash would be
3376724311/2 = 1688362155.
Summing the elements is already one of the simplest things you could do. But I don't think it's a particularly good hash function w.r.t. pseudo randomness.
If you sort your arrays before storing them or computing hashes, every good hash function will do.
If it's about speed: Have you measured where the bottleneck is? If your hash function is giving you a lot of collisions and you have to spend most of the time comparing the arrays bit-by-bit the hash function is obviously not good at what it's supposed to do. Sorting + Better Hash might be the solution.
If I understand your question correctly you want to test equality between sets where the items are not ordered. This is precisely what a Bloom filter will do for you. At the expense of a small number of false positives (in which case you'll need to make a call to a brute-force set comparison) you'll be able to compare such sets by checking whether their Bloom filter hash is equal.
The algebraic reason why this holds is that the OR operation is commutative. This holds for other semirings, too.
depending if you have a lot of collisions (so the same hash but not a permutation), you might presort the arrays while hashing them. In that case you can do a more aggressive kind of hashing where you don't only add up the numbers but add some bitmagick to it as well to get quite different hashes.
This is only beneficial if you get loads of unwanted collisions because the hash you are doing now is too poor. If you hardly get any collisions, the method you are using seems fine
I would suggest this:
1. Check if the lengths of permutations are the same (if not - they are not equal)
Sort only 1 array. Instead of sorting another array iterate through the elements of the 1st array and search for the presence of each of them in the 2nd array (compare only while the elements in the 2nd array are smaller - do not iterate through the whole array).
note: if you can have the same numbers in your permutaions (e.g. [1,2,2,10]) then you will need to remove elements from the 2nd array when it matches a member from the 1st one.
pseudo-code:
if length(arr1) <> length(arr2) return false;
sort(arr2);
for i=1 to length(arr1) {
elem=arr1[i];
j=1;
while (j<=length(arr2) and elem<arr2[j]) j=j+1;
if elem <> arr2[j] return false;
}
return true;
the idea is that instead of sorting another array we can just try to match all of its elements in the sorted array.
You can probably reduce the collisions a lot by using the product as well as the sum of the terms.
1*10*3*18=540 and 10*18*3*1=540
so the sum-product hash would be [32,540]
you still need to do something about collisions when they do happen though
I like using string's default hash code (Java, C# not sure about other languages), it generates pretty unique hash codes.
so if you first sort the array, and then generates a unique string using some delimiter.
so you can do the following (Java):
int[] arr = selectRandomNumbers();
Arrays.sort(arr);
int hash = (arr[0] + "," + arr[1] + "," + arr[2] + "," + arr[3]).hashCode();
if performance is an issue, you can change the suggested inefficient string concatenation to use StringBuilder or String.format
String.format("{0},{1},{2},{3}", arr[0],arr[1],arr[2],arr[3]);
String hash code of course doesn't guarantee that two distinct strings have different hash, but considering this suggested formatting, collisions should be extremely rare

Resources