Hash Function with Order Preserving - algorithm

Is there any hash function with uniq hash code (like MD5) with order preserving?
NOTE:
i don't care about security, i need it for sorting, i have lot of chunks with (~1MB size) and i want to sort them, of course i can use index sort but i want to reduce time of compare
Theoreticaly:
if i have 1'000'000 chunks with 1MB size (1'048'576 byte) and all of them have difference in last 10 bytes then time of compare of one chunk to other will be O(n-10) and if i will use QuictSort (which make ~(nlog2(n)) compares) then total time of compare will be nlog2(n)*(k-10) (where k is chunk size)
1'000'000 * 20 * (1'048'576 - 10)
that's why i want to generate order preserved hash codes with fixed size (for example 16 bytes) once then sort chunks and save result (for example: in file)

CHM (Z.J. Czech, G. Havas, and B.S. Majewski) is an algorithm which generates a minimal perfect hash that preserves ordering (e.g. if A < B, then h(A) < h(B)). It uses approximately 8 bytes of storage per key.
See: http://cmph.sourceforge.net/chm.html

In general case, such a function is impossible unless the size of the hash is at least the size of the object.
The argument is trivial: if there are N objects but M < N hash values, by pigeonhole principle, two different objects are mapped to one hash value, and so their order is not preserved.
If however we have additional properties of the objects guaranteed or the requirements relaxed, a custom or probabilistic solution may become possible.

According to NIST (I'm no expert) a Pearson hash can be order-preserving. The hash uses an auxiliary table. Such a table can (in theory) be constructed such that the resulting hash is order preserving.
It doesn't meet your full requirements though, because it doesn't reduce the size as you would like. I'm posting this in case other people are looking for a solution.
Some pointers:
The NIST page: http://xlinux.nist.gov/dads/HTML/pearsonshash.html
Wikipedia: http://en.wikipedia.org/wiki/Pearson_hashing
The original Pearson Hash paper: http://cs.mwsu.edu/~griffin/courses/2133/downloads/Spring11/p677-pearson.pdf

Sorting an array of N strings each of length K can be done in just O (NK) or O (N^2 + NK) character comparisons.
For example, construct a trie.
Or do a kind of insertion sort. Construct the set of sorted strings S by adding strings to it one by one. For each new string P, traverse it, maintaining the (non-decreasing) index of the greatest string Q in S such that Q <= P. When the string P ends, insert it into S just after Q. Each of the O(N) insertions can be done in O(N+K) operations: O(N) times increasing the index distributed into K.
When you have indices of the strings in sorted order, just use them for your purposes instead of the "hashes" you want.

Lets construct such a function from the requirements:
You want a function that outputs a 16 byte hash. So you will have collisions. You can't preserve perfect order and you don't want to. Best you can do is:
H(x) < H(y) => x < y
H(x) > H(y) => x > y
Values close to each other will have the same hash.
For each x there is an i_x > 0 so that H(x) = H(x + i_x) < H(x + i_x + 1). (Except for the end where x + i_x + 1 would overflow your 1MB chunks.)
Extending that you get: H(x) < H(x + i_x + n) for any n > 0.
Same argument works for j_x > 0 in the other direction. Combine them and you get:
H(x - j_x) == H(x - j_x + 1) == ... == H(x + i_x - 1) == H(x + i_x)
Or in other words for each hash value there is a single segment [a, b] mapping to the same value. No value outside this segment can have the same hash value or the ordering would be violated.
Your hash function can then be described by the segments you choose:
Let a_i be 1MB chunks with 0 <= i < 256^16 and a_i <= a_i+1. Then
H(x) = i where a_i <= x < a_i+1
You want an more of less uniform distribution of hash values. Otherwise one would get far more collisions than another and you would spend all the time doing a full compare when that value is hit. So all the segments [a, b] should be about the same size.
The only way to have exact the same size for each segment is to have
a_i = i * 2 ^ (1MB - 16)
or in other words: H(x) = first 16 bytes of x.
Any other order preserving hash function with a 16 byte output would be less efficient for a random set of input blocks.
And yes, if all but the last few bits of each input block are the same then every test will be a collision. That's a worst case scenario that always exists. If you know your inputs aren't uniformly random then you can adjust the size of each segment to have the same probability to be hit. But that requires knowledge of likely inputs.
Note: If you really want to sort 1'000'000 1MB chunks where you fear such a worst case then you can use bucket sort, resulting in 1,000,000 * 1'048'576 (byte) compares every time. Half of that if you compare 16 bit values at a time, which still has a reasonable number of buckets (65536).

In theory there is no such thing. If you want, you can create a composed hash:
index:md5
I think this will resolve your needs.

Related

Sample number with equal probability which is not part of a set

I have a number n and a set of numbers S ∈ [1..n]* with size s (which is substantially smaller than n). I want to sample a number k ∈ [1..n] with equal probability, but the number is not allowed to be in the set S.
I am trying to solve the problem in at worst O(log n + s). I am not sure whether it's possible.
A naive approach is creating an array of numbers from 1 to n excluding all numbers in S and then pick one array element. This will run in O(n) and is not an option.
Another approach may be just generating random numbers ∈[1..n] and rejecting them if they are contained in S. This has no theoretical bound as any number could be sampled multiple times even if it is in the set. But on average this might be a practical solution if s is substantially smaller than n.
Say s is sorted. Generate a random number between 1 and n-s, call it k. We've chosen the k'th element of {1,...,n} - s. Now we need to find it.
Use binary search on s to find the count of the elements of s <= k. This takes O(log |s|). Add this to k. In doing so, we may have passed or arrived at additional elements of s. We can adjust for this by incrementing our answer for each such element that we pass, which we find by checking the next larger element of s from the point we found in our binary search.
E.g., n = 100, s = {1,4,5,22}, and our random number is 3. So our approach should return the third element of [2,3,6,7,...,21,23,24,...,100] which is 6. Binary search finds that 1 element is at most 3, so we increment to 4. Now we compare to the next larger element of s which is 4 so increment to 5. Repeating this finds 5 in so we increment to 6. We check s once more, see that 6 isn't in it, so we stop.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. Binary search finds that 2 elements are at most 4, so we increment to 6. Now we compare to the next larger element of s which is 5 so increment to 7. We check s once more, see that the next number is > 7, so we stop.
If we assume that "s is substantially smaller than n" means |s| <= log(n), then we will increment at most log(n) times, and in any case at most s times.
If s is not sorted then we can do the following. Create an array of bits of size s. Generate k. Parse s and do two things: 1) count the number of elements < k, call this r. At the same time, set the i'th bit to 1 if k+i is in s (0 indexed so if k is in s then the first bit is set).
Now, increment k a number of times equal to r plus the number of set bits is the array with an index <= the number of times incremented.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. We parse s and 1) note that 1 element is below 4 (r=1), and 2) set our array to [1, 1, 0, 0]. We increment once for r=1 and an additional two times for the two set bits, ending up at 7.
This is O(s) time, O(s) space.
This is an O(1) solution with O(s) initial setup that works by mapping each non-allowed number > s to an allowed number <= s.
Let S be the set of non-allowed values, S(i), where i = [1 .. s] and s = |S|.
Here's a two part algorithm. The first part constructs a hash table based only on S in O(s) time, the second part finds the random value k ∈ {1..n}, k ∉ S in O(1) time, assuming we can generate a uniform random number in a contiguous range in constant time. The hash table can be reused for new random values and also for new n (assuming S ⊂ { 1 .. n } still holds of course).
To construct the hash, H. First set j = 1. Then iterate over S(i), the elements of S. They do not need to be sorted. If S(i) > s, add the key-value pair (S(i), j) to the hash table, unless j ∈ S, in which case increment j until it is not. Finally, increment j.
To find a random value k, first generate a uniform random value in the range s + 1 to n, inclusive. If k is a key in H, then k = H(k). I.e., we do at most one hash lookup to insure k is not in S.
Python code to generate the hash:
def substitute(S):
H = dict()
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
H[s] = j
j += 1
return H
For the actual implementation to be O(s), one might need to convert S into something like a frozenset to insure the test for membership is O(1) and also move the len(S) loop invariant out of the loop. Assuming the j in S test and the insertion into the hash (H[s] = j) are constant time, this should have complexity O(s).
The generation of a random value is simply:
def myrand(n, s, H):
k = random.randint(s + 1, n)
return (H[k] if k in H else k)
If one is only interested in a single random value per S, then the algorithm can be optimized to improve the common case, while the worst case remains the same. This still requires S be in a hash table that allows for a constant time "element of" test.
def rand_not_in(n, S):
k = random.randint(len(S) + 1, n);
if k not in S: return k
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
if s == k: return j
j += 1
Optimizations are: Only generate the mapping if the random value is in S. Don't save the mapping to a hash table. Short-circuit the mapping generation when the random value is found.
Actually, the rejection method seems like the practical approach.
Generate a number in 1...n and check whether it is forbidden; regenerate until the generated number is not forbidden.
The probability of a single rejection is p = s/n.
Thus the expected number of random number generations is 1 + p + p^2 + p^3 + ... which is 1/(1-p), which in turn is equal to n/(n-s).
Now, if s is much less than n, or even more up to s = n/2, this expected number is at most 2.
It would take s almost equal to n to make it infeasible in practice.
Multiply the expected time by log s if you use a tree-set to check whether the number is in the set, or by just 1 (expected value again) if it is a hash-set. So the average time is O(1) or O(log s) depending on the set implementation. There is also O(s) memory for storing the set, but unless the set is given in some special way, implicitly and concisely, I don't see how it can be avoided.
(Edit: As per comments, you do this only once for a given set.
If, additionally, we are out of luck, and the set is given as a plain array or list, not some fancier data structure, we get O(s) expected time with this approach, which still fits into the O(log n + s) requirement.)
If attacks against the unbounded algorithm are a concern (and only if they truly are), the method can include a fall-back algorithm for the cases when a certain fixed number of iterations didn't provide the answer.
Similarly to how IntroSort is QuickSort but falls back to HeapSort if the recursion depth gets too high (which is almost certainly a result of an attack resulting in quadratic QuickSort behavior).
Find all numbers that are in a forbidden set and less or equal then n-s. Call it array A.
Find all numbers that are not in a forbidden set and greater then n-s. Call it array B. It may be done in O(s) if set is sorted.
Note that lengths of A and B are equal, and create mapping map[A[i]] = B[i]
Generate number t up to n-s. If there is map[t] return it, otherwise return t
It will work in O(s) insertions to a map + 1 lookup which is either O(s) in average or O(s log s)

Online algorithm for random permutation of N integers

Imagine a standard permute function that takes an integer and returns a vector of the first N natural numbers in a random permutation. If you only need k (<= N) of them, but don't know k beforehand, do you still have to perform a O(N) generation of the permutation? Is there a better algorithm than:
for x in permute(N):
if f(x):
break
I'm imagining an API such as:
p = permuter(N)
for x = p.next():
if f(x):
break
where the initialization is O(1) (including memory allocation).
This question is often viewed as a choice between two competing algorithms:
Strategy FY: A variation on the Fisher-Yates shuffle where one shuffle step is performed for each desired number, and
Strategy HT: Keep all generated numbers in a hash table. At each step, random numbers are produced until a number which is not in the hash table is found.
The choice is performed depending on the relationship between k and N: if k is sufficiently large, the strategy FY is used; otherwise, strategy HT. The argument is that if k is small relative to n, maintaining an array of size n is a waste of space, as well as producing a large initialization cost. On the other hand, as k approaches n more and more random numbers need to be discarded, and towards the end producing new values will be extremely slow.
Of course, you might not know in advance the number of samples which will be requested. In that case, you might pessimistically opt for FY, or optimistically opt for HT, and hope for the best.
In fact, there is no real need for trade-off, because the FY algorithm can be implemented efficiently with a hash table. There is no need to initialize an array of N integers. Instead, the hash-table is used to store only the elements of the array whose values do not correspond with their indices.
(The following description uses 1-based indexing; that seemed to be what the question was looking for. Hopefully it is not full of off-by-one errors. So it generates numbers in the range [1, N]. From here on, I use k for the number of samples which have been requested to date, rather than the number which will eventually be requested.)
At each point in the incremental FY algorithm a single index r is chosen at random from the range [k, N]. Then the values at indices k and r are swapped, after which k is incremented for the next iteration.
As an efficiency point, note that we don't really need to do the swap: we simply yield the value at r and then set the value at r to be the value at k. We'll never again look at the value at index k so there is no point updating it.
Initially, we simulate the array with a hash table. To look up the value at index i in the (virtual) array, we see if i is present in the hash table: if so, that's the value at index i. Otherwise the value at index i is i itself. We start with an empty hash table (which saves initialization costs), which represents an array whose value at every index is the index itself.
To do the FY iteration, for each sample index k we generate a random index r as above, yield the value at that index, and then set the value at index r to the value at index k. That's exactly the procedure described above for FY, except for the way we look up values.
This requires exactly two hash-table lookups, one insertion (at an already looked-up index, which in theory can be done more quickly), and one random number generation for each iteration. That's one more lookup than strategy HT's best case, but we have a bit of a saving because we never need to loop to produce a value. (There is another small potential saving when we rehash because we can drop any keys smaller than the current value of k.)
As the algorithm proceeds, the hash table will grow; a standard exponential rehashing strategy is used. At some point, the hash table will reach the size of a vector of N-k integers. (Because of hash table overhead, this point will be reached at a value of k much less than N, but even if there were no overhead this threshold would be reached at N/2.) At that point, instead of rehashing, the hash is used to create the tail of the now non-virtual array, a procedure which takes less time than a rehash and never needs to be repeated; remaining samples will be selected using the standard incremental FY algorithm.
This solution is slightly slower than FY if k eventually reaches the threshold point, and it is slightly slower than HT if k never gets big enough for random numbers to be rejected. But it is not much slower in either case, and if never suffers from pathological slowdown when k has an awkward value.
In case that was not clear, here is a rough Python implementation:
from random import randint
def sampler(N):
k = 1
# First phase: Use the hash
diffs = {}
# Only do this until the hash table is smallish (See note)
while k < N // 4:
r = randint(k, N)
yield diffs[r] if r in diffs else r
diffs[r] = diffs[k] if k in diffs else k
k += 1
# Second phase: Create the vector, ignoring keys less than k
vbase = k
v = list(range(vbase, N+1))
for i, s in diffs.items():
if i >= vbase:
v[i - vbase] = s
del diffs
# Now we can generate samples until we hit N
while k <= N:
r = randint(k, N)
rv = v[r - vbase]
v[r - vbase] = v[k - vbase]
yield rv
k += 1
Note: N // 4 is probably pessimistic; computing the correct value would require knowing too much about hash-table implementation. If I really cared about speed, I'd write my own hash table implementation in a compiled language, and then I'd know :)

Find "important" entries in a sorted log

I have a log file consisting of several thousand integers, each separated onto a new line. I've parsed this into an array of such integers, also sorted. Now my issue becomes finding the "important" integers from this log--these are ones that show up some user-configurable portion of the time.
For example, given the log, the user can filter to only see entries that appear a certain scaled number of times.
Currently I'm scanning the whole array and keeping count of the number of times each entry appears. Surely there is a better method?
First, I need to note that the following is just a theoretical solution, and you probably should use what is proposed by #MBo.
Take every m = n / lth element of the sorted array. Only those elements can be important, as no sequence of identical elements of length m can fit between i*m and (i+1)*m.
For each element x, find with binary search its lower bound and upper bound in the array. Subtracting indexes, you can know count, and decide to keep or discard x as unimportant.
Total complexity would be O((n/m) * log n) = O(l * log n). For large m it could be (asymptotically) better than O(n). To get an improvement in practice, however, you need very specific circumstances:
Array is given to you presorted (otherwise just use counting sort and you get an answer immediately)
You can access i-th element of the array in O(1) without reading the whole array. Otherwise, again, use counting sort with hash table.
Lets assume you have a file consisting of sorted fixed-width integers "data.bin" (it is possible for variable width too, but requires some extra effort). Then in pseudocode, algorithm could be something like so:
def find_all_important(l, n):
m = n / l
for i = m to l step m:
x = read_integer_at_offset("data.bin", i)
lower_bound = find_lower_bound(x, 0, i)
upper_bound = find_upper_bound(x, i, n)
if upper_bound - lower_bound >= m:
report(x)
def find_lower_bound(x, begin, end):
if end - begin == 0:
return begin
mid = (end + begin) / 2
x = read_integer_at_offset("data.bin", mid)
if mid < x:
return find_lower_bound(x, mid + 1, end)
else:
return find_lower_bound(x, begin, mid)
As a guess, you will not gain any noticeable improvement compared to naive O(n) on modern hardware, unless your file is very large (hundreds of MBs). And of course it is viable if your data can't fit in RAM. But as always with optimization, it might be worth testing.
Your sorting takes O(NlogN) time perhaps. Do you need to make (n/I) queries many times for the same data set?
If yes, walk through sorted array, make (Value;Count) pairs and sort them by Count field. Now you can easily separate pairs with high counts with binary search

Filter index and Hash functions in Bloom Filter

Bloom filter uses a bit array of m-bits hence there are 0 to m-1 indexes in the array but the hash functions I'm using return a 32-bit hash hence it could be from 0 to (2^32)-1 as the hash is used as an index for the bit array(filter) it is quite possible that the hash is greater than m as a result the value will not be mapped on the bit array. Should I take the mod of the hash i.e hash % m so that the resulting hash must correspond to an index in the bit array. Will it increase the number of false positives(IMO it will)?
A hash function h: S -> uint is loosely defined to be one which exhibits a high degree of entropy over the set S. Suppose I have a certain hash function h that has very high entropy over S, but for which the output h(x) for x in S is always even. This restriction simply means that a bit of the output of h is being wasted, which is only 1 / 32 of the bits.
Now suppose I have a Bloom filter for which m is an even number. Then h(x) % m is always going to be an even number - which means only half of the bits of the Bloom filter will ever be used! That's bad!
As others have suggested, simply taking the first m bits of h(x) as an index into 2^m Bloom filter bins is a better strategy, because, assuming you're being given a hash function that exhibits a high degree of entropy over the set S, the "first m bits" hash function g(x) = h(x)[0:m-1] should exhibit a nearly proportional amount of entropy.
Yes, using mod increases probability for false positives. Stephan T. Lavavej had a great talk about it on GoingNative 2013 (that mod creates bias), See HERE
His also mentioning about (what #btilly) said: it's better to simply cut bits - if yours hash function is good then it's OK.

Check if array B is a permutation of A

I tried to find a solution to this but couldn't get much out of my head.
We are given two unsorted integer arrays A and B. We have to check whether array B is a permutation of A. How can this be done.? Even XORing the numbers wont work as there can be several counterexamples which have same XOR value bt are not permutation of each other.
A solution needs to be O(n) time and with space O(1)
Any help is welcome!!
Thanks.
The question is theoretical but you can do it in O(n) time and o(1) space. Allocate an array of 232 counters and set them all to zero. This is O(1) step because the array has constant size. Then iterate through the two arrays. For array A, increment the counters corresponding to the integers read. For array B, decrement them. If you run into a negative counter value during iteration of array B, stop --- the arrays are not permutations of each others. Otherwise at the end (assuming A and B have the same size, a prerequisite) the counter array is all zero and the two arrays are permutations of each other.
This is O(1) space and O(n) time solution. However it is not practical, but would easily pass as a solution to the interview question. At least it should.
More obscure solutions
Using a nondeterministic model of computation, checking that the two arrays are not permutations of each others can be done in O(1) space, O(n) time by guessing an element that has differing count on the two arrays, and then counting the instances of that element on both of the arrays.
In randomized model of computation, construct a random commutative hash function and calculate the hash values for the two arrays. If the hash values differ, the arrays are not permutations of each others. Otherwise they might be. Repeat many times to bring the probability of error below desired threshold. Also on O(1) space O(n) time approach, but randomized.
In parallel computation model, let 'n' be the size of the input array. Allocate 'n' threads. Every thread i = 1 .. n reads the ith number from the first array; let that be x. Then the same thread counts the number of occurrences of x in the first array, and then check for the same count on the second array. Every single thread uses O(1) space and O(n) time.
Interpret an integer array [ a1, ..., an ] as polynomial xa1 + xa2 + ... + xan where x is a free variable and the check numerically for the equivalence of the two polynomials obtained. Use floating point arithmetics for O(1) space and O(n) time operation. Not an exact method because of rounding errors and because numerical checking for equivalence is probabilistic. Alternatively, interpret the polynomial over integers modulo a prime number, and perform the same probabilistic check.
If we are allowed to freely access a large list of primes, you can solve this problem by leveraging properties of prime factorization.
For both arrays, calculate the product of Prime[i] for each integer i, where Prime[i] is the ith prime number. The value of the products of the arrays are equal iff they are permutations of one another.
Prime factorization helps here for two reasons.
Multiplication is transitive, and so the ordering of the operands to calculate the product is irrelevant. (Some alluded to the fact that if the arrays were sorted, this problem would be trivial. By multiplying, we are implicitly sorting.)
Prime numbers multiply losslessly. If we are given a number and told it is the product of only prime numbers, we can calculate exactly which prime numbers were fed into it and exactly how many.
Example:
a = 1,1,3,4
b = 4,1,3,1
Product of ith primes in a = 2 * 2 * 5 * 7 = 140
Product of ith primes in b = 7 * 2 * 5 * 2 = 140
That said, we probably aren't allowed access to a list of primes, but this seems a good solution otherwise, so I thought I'd post it.
I apologize for posting this as an answer as it should really be a comment on antti.huima's answer, but I don't have the reputation yet to comment.
The size of the counter array seems to be O(log(n)) as it is dependent on the number of instances of a given value in the input array.
For example, let the input array A be all 1's with a length of (2^32) + 1. This will require a counter of size 33 bits to encode (which, in practice, would double the size of the array, but let's stay with theory). Double the size of A (still all 1 values) and you need 65 bits for each counter, and so on.
This is a very nit-picky argument, but these interview questions tend to be very nit-picky.
If we need not sort this in-place, then the following approach might work:
Create a HashMap, Key as array element, Value as number of occurances. (To handle multiple occurrences of the same number)
Traverse array A.
Insert the array elements in the HashMap.
Next, traverse array B.
Search every element of B in the HashMap. If the corresponding value is 1, delete the entry. Else, decrement the value by 1.
If we are able to process entire array B and the HashMap is empty at that time, Success. else Failure.
HashMap will use constant space and you will traverse each array only once.
Not sure if this is what you are looking for. Let me know if I have missed any constraint about space/time.
You're given two constraints: Computational O(n), where n means the total length of both A and B and memory O(1).
If two series A, B are permutations of each other, then theres also a series C resulting from permutation of either A or B. So the problem is permuting both A and B into series C_A and C_B and compare them.
One such permutation would be sorting. There are several sorting algorithms which work in place, so you can sort A and B in place. Now in a best case scenario Smooth Sort sorts with O(n) computational and O(1) memory complexity, in the worst case with O(n log n) / O(1).
The per element comparision then happens at O(n), but since in O notation O(2*n) = O(n), using a Smooth Sort and comparison will give you a O(n) / O(1) check if two series are permutations of each other. However in the worst case it will be O(n log n)/O(1)
The solution needs to be O(n) time and with space O(1).
This leaves out sorting and the space O(1) requirement is a hint that you probably should make a hash of the strings and compare them.
If you have access to a prime number list do as cheeken's solution.
Note: If the interviewer says you don't have access to a prime number list. Then generate the prime numbers and store them. This is O(1) because the Alphabet length is a constant.
Else here's my alternative idea. I will define the Alphabet as = {a,b,c,d,e} for simplicity.
The values for the letters are defined as:
a, b, c, d, e
1, 2, 4, 8, 16
note: if the interviewer says this is not allowed, then make a lookup table for the Alphabet, this takes O(1) space because the size of the Alphabet is a constant
Define a function which can find the distinct letters in a string.
// set bit value of char c in variable i and return result
distinct(char c, int i) : int
E.g. distinct('a', 0) returns 1
E.g. distinct('a', 1) returns 1
E.g. distinct('b', 1) returns 3
Thus if you iterate the string "aab" the distinct function should give 3 as the result
Define a function which can calculate the sum of the letters in a string.
// return sum of c and i
sum(char c, int i) : int
E.g. sum('a', 0) returns 1
E.g. sum('a', 1) returns 2
E.g. sum('b', 2) returns 4
Thus if you iterate the string "aab" the sum function should give 4 as the result
Define a function which can calculate the length of the letters in a string.
// return length of string s
length(string s) : int
E.g. length("aab") returns 3
Running the methods on two strings and comparing the results takes O(n) running time. Storing the hash values takes O(1) in space.
e.g.
distinct of "aab" => 3
distinct of "aba" => 3
sum of "aab => 4
sum of "aba => 4
length of "aab => 3
length of "aba => 3
Since all the values are equal for both strings, they must be a permutation of each other.
EDIT: The solutions is not correct with the given alphabet values as pointed out in the comments.
You can convert one of the two arrays into an in-place hashtable. This will not be exactly O(N), but it will come close, in non-pathological cases.
Just use [number % N] as it's desired index or in the chain that starts there. If any element has to be replaced, it can be placed at the index where the offending element started. Rinse , wash, repeat.
UPDATE:
This is a similar (N=M) hash table It did use chaining, but it could be downgraded to open addressing.
I'd use a randomized algorithm that has a low chance of error.
The key is to use a universal hash function.
def hash(array, hash_fn):
cur = 0
for item in array:
cur ^= hash_item(item)
return cur
def are_perm(a1, a2):
hash_fn = pick_random_universal_hash_func()
return hash_fn(a1, hash_fn) == hash_fn(a2, hash_fn)
If the arrays are permutations, it will always be right. If they are different, the algorithm might incorrectly say that they are the same, but it will do so with very low probability. Further, you can get an exponential decrease in chance for error with a linear amount of work by asking many are_perm() questions on the same input, if it ever says no, then they are definitely not permutations of each other.
I just find a counterexample. So, the assumption below is incorrect.
I can not prove it, but I think this may be possible true.
Since all elements of the arrays are integers, suppose each array has 2 elements,
and we have
a1 + a2 = s
a1 * a2 = m
b1 + b2 = s
b1 * b2 = m
then {a1, a2} == {b1, b2}
if this is true, it's true for arrays have n-elements.
So we compare the sum and product of each array, if they equal, one is the permutation
of the other.

Resources