Proof of calculating Minhash - probability

I'm reading about MinHash technique to estimate the similarity between 2 sets: Given set A and B, h is the hash function and hmin(S) is the minimum hash of set S, i.e. hmin(S)=min(h(s)) for s in S. We have the equation:
p(hmin(A)=hmin(B))=|A∩B| / |A∪B|
Which means the probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of A and B.
I am trying to prove above equation and come up with my own proof: for a∈A and b∈B such that h(a)=hmin(A) and h(b)=hmin(B). So, if hmin(A)=hmin(B) then h(a)=h(b). Assume that hash function h can hash keys to distinct hash value, so h(a)=h(b) if and only if a=b, which has a probability of |A∩B| / |A∪B|. However, my proof is not complete since hash function can return the same value for different keys. So, I'm asking for your help to find a proof which can be applied regardless the hash function.

I can't be sure what your exact question is.
But if you are looking for a method to prove:
probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of A and B.
Try having a look at section 3.3.3 of Mining of Massive Datasets, by Anand Rajaraman and Jeff Ullman

Think of the hash function just as a mean to provide a random permutation of (A ∪ B). Now, think about that permutation.
Put every possible element of (A ∪ B) as a row in a table, using the permutation p you have chosen. And two columns A and B, like this:
A = {1, 3, 5, 6}
B = {2, 3, 4, 6}
p = {5, 6, 1, 2, 4, 3}
The table:
A B
5 1 0
6 1 1
1 1 0
2 0 1
4 0 1
3 1 1
There are only two types of rows, X: where A and B are 1. Y: where A != B
There are (A ∪ B) rows in total. But only (A ∩ B) rows of type Y. The chance that the first row is one of the type Y is Y/(X+Y). Or Pr[hmin(A) = hmin(B)] = (A ∩ B)/(A ∪ B).
This is exactly what the book Nilesh linked says, but I tried to explain with another example.

This can't be proved "regardless of the hash function". Just consider: you could use a very poor hash function that produces extremely frequent collisions (such as simply binary-ANDing all values together). MinHash would no longer approximate Jaccard similarity at all, but would report much higher similarities. Proofs of MinHash that I've seen have assumed that hash collisions will be rare enough to be insignificant.

Assume collisions will never happen, or will be negligible. You just choose a length for your hashes such that the chance of them colliding becomes arbitrarily small. This article describes the bounds for various numbers of items and hash sizes. https://en.wikipedia.org/wiki/Birthday_attack

Related

How more effectively find the minimal composition from n sets that satisfies the given condition?

We have N sets of triples,like
1. { (4; 0,1), (5 ; 0.3), (7; 0,6) }
2. { (7; 0.2), (8 ; 0.4), (1 ; 0.4) }
...
N. { (6; 0.3), (1; 0.2), (9 ; 0.5) }
and need to choose only one pair from each triple, so that the sum of the first members in pair will be minimal, but also we have a condition that sum of the second members in pair must be not less than a given P number.
We can solve this by sorting all possible pair combinations with the sum of their first members (3 ^ N combinations), and in that sorted list choose the first one which also satisfies the second condition.
Could you please help to suggest a better, non trivial solution for this problem?
If there are no constraints on the values inside your triplets, then we are facing a pretty general version of integer programming problem, more specifically a 0-1 linear programming problem, as it can be represented as a system of equations with every coefficient being 0 or 1. You can find the possible approaches on the wiki page, but there is no fast-and-easy solution for this problem in general.
Alternatively, if the second numbers of each pair (the ones that need to sum up to >= P) are from a small enough range, we could view this as Dynamic Programming problem similar to a Knapsack problem. "Small enough" there is a bit hard to define because the original data has non-integer numbers. If they were integers, then the algorithmic complexity of solution I will describe is O(P * N). For non-integer numbers, they need to be first converted to integers by multiplying them all, as well as P, by a large enough number. In your example, the precision of each number is 1 digit after zero, so multiplying by 10 is enough. Hence, the actual complexity is O(M * P * N), where M is the factor everything was multiplied by to achieve integer numbers.
After this, we are essentially solving a modified Knapsack problem: instead of constraining the weight from above, we are constraining it from below, and on each step we are choosing a pair from a triplet, as opposed to deciding whether to put an item into the knapsack or not.
Let's define a function minimum_sum[i][s] which at values i, s represents the minimum possible sum (of first numbers in each pair we took) we can achieve if the sum of the second numbers in pairs taken so far is equal to s and we already considered the first i triplets. One exception to this definition is that minimum_sum[i][P] has the minimum for all sums exceeding P as well. If we can compute all values of this function, then minimum_sum[N][P] is the answer. The function values can be computed with something like this:
minimum_sum[0][0]=0, all other values are set to infinity
for i=0..N-1:
for s=0..P:
for j=0..2:
minimum_sum[i+1][min(P, s+B[i][j])] = min(minimum_sum[i+1][min(P, s+B[i][j])], minimum_sum[i][s] + A[i][j]
A[i][j] here denote the first number in i-th triplet's j-th pair, and B[i][j] denote the second number of the same triplet.
This solution is viable if N is large, but P is small and precision on Bs isn't too high. For instance, if N=50, there is little hope to compute 3^N possibilities, but with M*P=1000000 this approach would work extremely fast.
Python implementation of the idea above:
def compute(A, B, P):
n = len(A)
# note that I use 1,000,000 as “infinity” here, which might need to be increased depending on input data
best = [[1000000 for i in range(P + 1)] for j in range(n + 1)]
best[0][0] = 0
for i in range(n):
for s in range(P+1):
for j in range(3):
best[i+1][min(P, s+B[i][j])] = min(best[i+1][min(P, s+B[i][j])], best[i][s]+A[i][j])
return best[n][P]
Testing:
A=[[4, 5, 7], [7, 8, 1], [6, 1, 9]]
# second numbers in each pair after scaling them up to be integers
B=[[1, 3, 6], [2, 4, 4], [3, 2, 5]]
In [7]: compute(A, B, 0)
Out[7]: 6
In [14]: compute(A, B, 7)
Out[14]: 6
In [15]: compute(A, B, 8)
Out[15]: 7
In [20]: compute(A, B, 13)
Out[20]: 14

Select k numbers maximizing sum of pairwise xor

Given a range [l, r] (where l < r), and a number k (where k <= r - l), I want to select a set S of k distinct numbers in [l, r] which maximizes the sum of pairwise xors. For example, if [l, r] = [2, 10] and k = 3 and we choose S = {4, 5, 6}, the sum of xors is d(4, 5) + d(4, 6) + d(5, 6) = 1 + 1 + 2 = 4.
Here's my thinking so far: in [l, r], for each bit index i less than or equal to the index of the highest set bit in r, the number of elements in S ^ S with the ith bit set is equal to j * (k-j), where j is the count of the elements in S with the ith bit set. To optimize this we want to select S such that, for each bit i, S contains k/2 elements with the ith bit set. This is easy for k = 2, but I'm stuck on generalizing this for k > 2.
At a first glance it seems that there is no algebraic solution for this problem. I mean, this seems like a NP-hard problem (a optimizational problem) that is not solvable in polynomial time.
As almost always possible one can brute force through the feasible space.
Intuitively, I can suggest to look into Locality Sensitive Hashing. In LSH one normally tries to find similarities between two sets. But in you case, you can abuse this algorithm in the following sense.
The domain is subdivided into few buckets.
You sample randomly points in the space [l,r].
High probable points (large Hamming distance) are placed in the buckets.
In the end you brute force in the most probable bucket.
In the end one can expect that points with large Hamming distances should be in the same neighborhood (that's why the name Locality Sensitive Hashing). However, it is just an idea.

From a given number, determine three close numbers whose product is the original number

I have a number n, and I want to find three numbers whose product is n but are as close to each other as possible. That is, if n = 12 then I'd like to get 3, 2, 2 as a result, as opposed to 6, 1, 2.
Another way to think of it is that if n is the volume of a cuboid then I want to find the lengths of the sides so as to make the cuboid as much like a cube as possible (that is, the lengths as similar as possible). These numbers must be integers.
I know there is unlikely to be a perfect solution to this, and I'm happy to use something which gives a good answer most of the time, but I just can't think where to go with coming up with this algorithm. Any ideas?
Here's my first algorithm sketch, granted that n is relatively small:
Compute the prime factors of n.
Pick out the three largest and assign them to f1, f2, f3. If there are less than three factors, assign 1.
Loop over remaining factors in decreasing order, multiply them into the currently smallest partition.
Edit
Let's take n=60.
Its prime factors are 5 3 2 2.
Set f1=5, f2=3 and f3=2.
The remaining 2 is multiplied to f3, because it is the smallest.
We end up with 5 * 4 * 3 = 60.
Edit
This algorithm will not find optimum, notice btillys comment:
Consider 17550 = 2 * 3 * 3 * 3 * 5 * 5
* 13. Your algorithm would give 15, 30, 39 when the best is 25, 26, 27.
Edit
Ok, here's my second algorithm sketch with a slightly better heuristic:
Set the list L to the prime factors of n.
Set r to the cube root of n.
Create the set of three factors F, initially set to 1.
Iterate over the prime factors in descending order:
Try to multiply the current factor L[i] with each of the factors in descending order.
If the result is less than r, perform the multiplication and move on to the next
prime factor.
If not, try the next F. If out of Fs, multiply with the smallest one.
This will work for the case of 17550:
n=17550
L=13,5,5,3,3,3,2
r=25.98
F = { 1, 1, 1 }
Iteration 1:
F[0] * 13 is less than r, set F to {13,1,1}.
Iteration 2:
F[0] * 5 = 65 is greated than r.
F[1] * 5 = 5 is less than r, set F to {13,5,1}.
Iteration 3:
F[0] * 5 = 65 is greated than r.
F[1] * 5 = 25 is less than r, set F to {13,25,1}.
Iteration 4:
F[0] * 3 = 39 is greated than r.
F[1] * 3 = 75 is greated than r.
F[2] * 3 = 3 is less than r, set F to {13,25,3}.
Iteration 5:
F[0] * 3 = 39 is greated than r.
F[1] * 3 = 75 is greated than r.
F[2] * 3 = 9 is less than r, set F to {13,25,9}.
Iteration 6:
F[0] * 3 = 39 is greated than r.
F[1] * 3 = 75 is greated than r.
F[2] * 3 = 27 is greater than r, but it is the smallest F we can get. Set F to {13,25,27}.
Iteration 7:
F[0] * 2 = 26 is greated than r, but it is the smallest F we can get. Set F to {26,25,27}.
Here's a purely math based approach, that returns the optimal solution and does not involve any kind of sorting. Hell, it doesn't even need the prime factors.
Background:
1) Recall that for a polynomial
the sum and product of the roots are given by
where x_i are the roots.
2) Recall another elementary result from optimization theory:
i.e., given two variables such that their product is a constant, the sum is minimum when the two variables are equal to each other. The tilde variables denote the optimal values.
A corollary of this would be that if the sum of two variables whose product is constant, is a minimum, then the two variables are equal to each other.
Reformulate the original problem:
Your question above can now be reformulated as a polynomial root-finding exercise. We'll construct a polynomial that satisfies your conditions, and the roots of that polynomial will be your answer. If you need k numbers that are optimal, you'll have a polynomial of degree k. In this case, we can talk in terms of a cubic equation
We know that:
c is the negative of the input number (assume positive)
a is an integer and negative (since factors are all positive)
b is an integer (which is the sum of the roots taken two at a time) and is positive.
Roots of p must be real (and positive, but that has already been addressed).
To solve the problem, we simply need to maximize a subject to the above set of conditions. The only part not explicitly known right now, is condition 4, which we can easily enforce using the discriminant of the polynomial.
For a cubic polynomial p, the discriminant is
and p has real and distinct roots if ∆>0 and real and coincident (either two or all three) if ∆=0. So, constraint 4 now reads ∆>=0. This is now simple and easy to program.
Solution in Mathematica
Here's a solution in Mathematica that implements this.
And here's a test on some of the numbers used in other answers/comments.
The column on the left is the list and the corresponding row in the column on the right gives the optimal solution.
NOTE:
I just noticed that OP never mentioned that the 3 numbers needed to be integers although everyone (including myself until now) assumed that they were (probably because of his first example). Re-reading the question, and going by the cube example, it doesn't seem like OP was fixated on integers.
This is an important point which will decide which class of algorithms to pursue and needs to be defined. If they need not be integers, there are several polynomial based solutions that can be provided, one of which is mine (after relaxing the integer constraint). If they should be integers, then perhaps an approach using branch-n-bound/branch-n-cut/cutting plane might be more appropriate.
The following was written assuming the OP meant the three numbers to be integers.
The way I've implemented it right now, it can give a non-integer solution in certain cases.
The reason this gives non-integer solutions for x is because I had only maximized a, when actually, b also needs to be minimum (not only that, but also because I haven't placed a constraint on the x_i being integers. It is possible to use the integer root theorem, which would involve finding the prime factors, but makes things more complicated.)
Mathematica code in text
Clear[poly, disc, f]
poly = x^3 + a x^2 + b x + c;
disc = Discriminant[poly, x];
f[n_Integer] :=
Module[{p, \[CapitalDelta] = disc /. c -> -n},
p = poly /.
Maximize[{a, \[CapitalDelta] >= 0,
b > 0 && a < 0 && {a, b} \[Element] Integers}, {a, b}][[
2]] /. c -> -n;
Solve[p == 0]
]
There may be a clever way to find the tightest triplet, as Anders Lindahl is pursuing, but I will focus on a more basic approach.
If I generate all triplets, then I can filter them afterward however I want, so I will start there. The best way I know to generate these uses recursion:
f[n_, 1] := {{n}}
f[n_, k_] := Join ##
Table[
{q, ##} & ### Select[f[n/q, k - 1], #[[1]] >= q &],
{q, #[[2 ;; ⌈ Length##/k ⌉ ]] & # Divisors # n}
]
This function f takes two integer arguments, the number to factor n, and the number of factors to produce k.
The section #[[2 ;; ⌈ Length##/k ⌉ ]] & # Divisors # n uses Divisors to produce a list of all divisors of n (including 1), and then takes from these from the second (to drop the 1) to the Ceiling of the number of divisors divided by k.
For example, for {n = 240, k = 3} the output is {2, 3, 4, 5, 6, 8}
The Table command iterates over this list while accumulating results, assigning each element to q.
The body of the Table is Select[f[n/q, k - 1], #[[1]] >= q &]. This calls f recursively, and then selects from the result all lists that begin with a number >= q.
{q, ##} & ### (also in the body) then "prepends" q to each of these selected lists.
Finally, Join ## merges the lists of these selected lists that are produced by each loop of Table.
The result is all of the integer factors of n into k parts, in lexicographical order. Example:
In[]:= f[240, 3]
Out[]= {{2, 2, 60}, {2, 3, 40}, {2, 4, 30}, {2, 5, 24}, {2, 6, 20},
{2, 8, 15}, {2, 10, 12}, {3, 4, 20}, {3, 5, 16}, {3, 8, 10},
{4, 4, 15}, {4, 5, 12}, {4, 6, 10}, {5, 6, 8}}
With the output of the function/algorithm given above, one can then test triplets for quality however desired.
Notice that because of the ordering the last triplet in the output is the one with the greatest minimum factor. This will usually be the most "cubic" of the results, but occasionally it is not.
If the true optimum must be found, it makes sense to test starting from the right side of the list, abandoning the search if a better result is not found quickly, as the quality of the results decrease as you move left.
Obviously this method relies upon a fast Divisors function, but I presume that this is either a standard library function, or you can find a good implementation here on StackOverflow. With that in place, this should be quite fast. The code above finds all triplets for n from 1 to 10,000 in 1.26 seconds on my machine.
Instead of reinventing the wheel, one should recognize this as a variation of a well known NP-complete problem.
Compute the prime factors of n.
Compute the logarithms of these factors
The problem translates as partitioning these logs into three sums that are as close as possible.
This problem is known as a variation of the Bin Packing problem, known as Multiprocessor scheduling
Given the fact that the Multiprocessor scheduling problem is NP-complete, it's no wonder that it's hard to find an algorithm that does not search the whole problem space and finds the optimum solution.
But I guess there are already several algorithms that deal with either Bin-Packing or Multiprocessor-Scheduling and find near-optimum solutions in efficient manner.
Another related problem (generalization) is the Job shop scheduling. See the wikipedia description with many links to known algorithms.
What wikipedia describes as (the often-used LPT-Algorithm (Longest Processing Time) is exactly what Anders Lindahl came up with first.
EDIT
Here's a shorter explanation using more efficient code, KSetPartitions simplifies things considerably. So did some suggestions from Mr.W. The overall logic remains the same.
Assuming there a at least 3 prime factors of n,
Find the list of triplet KSetPartitions for the prime factors of n.
Multiply each of the elements (prime factors) within each subset to produce all possible combinations for three divisors of n (when multiplied they yield n). You can think of the divisors as the length, width and height of an orthogonal parallelepiped.
The parallelepiped closest to a cube will have the shortest space diagonal. Sum the squares of the three divisors for each case and pick the smallest.
Here's the code in Mathematica:
Needs["Combinatorica`"]
g[n_] := Module[{factors = Join ## ConstantArray ### FactorInteger[n]},
Sort[Union[Sort /# Apply[Times, Union[Sort /#
KSetPartitions[factors, 3]], {2}]]
/. {a_Integer, b_Integer, c_Integer} :>
{Total[Power[{a, b, c}, 2]], {a, b, c}}][[1, 2]]]
It can handle fairly large numbers, but slows down considerably as the number of factors of n grows. The examples below show timings for 240, 2400, ...24000000.
This could be sped up in principle by taking into account cases where a prime factor appears more than once in a divisor. I don't have the know-how to do it yet.
In[28]:= g[240]
Out[28]= {5, 6, 8}
In[27]:= t = Table[Timing[g[24*10^n]][[1]], {n, 6}]
Out[27]= {0.001868, 0.012734, 0.102968, 1.02469, 10.4816, 105.444}

equal k subsets algorithm

does anyone know a good and efficient algorithm for equal k subsets algorithm ? preferably c or c++ which could handle a 100 element vector maybe with a complexity and time estimation
ex. 9 element vector
x = {2,4,5,6,8,9,11,13,14}
i need to generate all k=3 disjoint subsets with sum = 24
the algorithm should check if there are k disjoint subsets each with sum of elements 24, and list them in ascending order(in subset and between subsets) or to see if the solution doesn't exists
Solutions
solution 1: {2 8 14} {4 9 11} {5 6 13}
solution 2: {2 9 13} {4 6 14} {5 8 11}
Thanks
Unfortunately the constrained k-subset problem is a hard problem ... and if you want to generate all such k-subsets, you have no choice but to evaluate many possible candidates.
There are a couple of optimizations you can perform to reduce the search space.
Given a domain x constaining integer values,
Given a positive integer target M,
Given a positive integer k size for the subset,
When x only contains positive integers, and given a upper bound M, remove all items from x larger than or equal to M. These can't possibly be part of the subset.
Similarly, for k > 1, a given M, and x containing positive integers, remove all items from x which are larger than M + min0 + min1 ... minK. Essentially, remove all of the large values which can't possibly be part of the subset since even when selecting small values they will results in a sum in excess of M.
You can also use the even/odd exclusion principle to pare down your search space. For instance, of k is odd and M is even, you know that the sum will either contain three even numbers or two odd and one even. You can use this information to reduce the search space by eliminating candidate values from x that could be part of the sum.
Sort the vector x - this allows you to rapidly exclude values that can't possibly be included in the sum.
Many of these optimizations (other than the even/odd exclusion) are no longer useful/valid when the vector x contains negative values. In this case, you pretty much have to do an exhaustive search.
As Jilles De Wit points out, if X contains negative numbers you could add the absolute value of the smallest value in X to each member of X. This would shift all values back into positive range - making some of the optimizations I describe above possible again. This requires, however, that you are able to accurately represent positive values in the enlarged range. One way to achieve this would be to internally use a wider type (say long instead of int) to perform the subset selection search. If you do this, however, remember to scale the results subsets back down by this same offset when you return your results.

Algorithm - find the smallest subset of cells representing all the rows

I have several lists, that you can consider as rows of integers.
For example :
[1 3 5]
[3 7]
[3 5 7]
[1 5 9]
[3 9]
[1 7]
[5 9 11]
I wish to find the smallest possible set of integers represented on these rows, so that :
each row has at least one of the selected integers,
in case of cardinality ties, select the set having the highest sum.
In my example, I believe the result should be [5 7 9] (preferred to [3 5 7] or [1 3 11] or ... many possibilities).
The second part is trivial (selecting highest sum), but the generation of all minimal subsets in cardinality seems to be hard.
Do you know a good way to achieve this ?
Edit
Size of data is growing slowly with iterations, but I need exact matches.
The minimum cardinality version is NP-Complete. Set Cover can be reduced to this. Requiring the max among those only makes the problem harder.
Btw, the other answer talking about boolean satisfiability is wrong! You need to reduce boolean satisfiability to this problem to show NP-Completeness, not the other way round.
Set cover is basically:
Give a collection of sets S1, S2, ... Sn of subsets of a set X, find the smallest sub-collection (in terms of number of sets) whose union covers all the elements in S1 U S2 U ... U Sn.
To reduce this to our problem,
Let S = S1 U S2 ... U Sn. = {x1 , x2, ..., xm}
Let C_i = { j such that xi is in Sj }
Feed the C_i to our problem.
Now if our problem was solvable in polynomial time and we could find a minimum cardinality set of elements of C_i, then we can find a set cover for the Si and vice versa.
This can normally be solved as an integer programming problem (which is NP-Hard too).
For approximate solutions, this can be viewed as a linear programming problem (which has polynomial time algorithms) and randomized rounding can be done to convert fractional values (solutions to the LP) to integers.
Also, unfortunately, it has been shown that this is NP-hard to approximate to even a constant factor (in fact I believe it is O(logn)).

Resources