Fast computation of pairs with least hamming distance - algorithm

Problem
Suppose you have N (~100k-1m) integers/bitstrings each K (e.g. 256) bits long. The algorithm should return the k pairs with the lowest pairwise Hamming distance.
Example
N = 4
K = 8
i1 = 00010011
i2 = 01010101
i3 = 11000000
i4 = 11000011
HammingDistance(i1,i2) = 3
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 3
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 4
HammingDistance(i3,i4) = 2
For k=1 it should return the pairlist {(i3,i4)}. For k=3 it should return {(i1,i2), (i1,i4), (i3,i4)}. And so on.
Algorithm
The naive implementation computes all pairwise distances, sorts the pairs and returns the k with the lowest distance: O(N^2). Are there any better data structures or algorithms? It looks like the ideas from Efficiently find binary strings with low Hamming distance in large set can not be used since there is no single query integer.

The recent paper "The Closest Pair Problem under the Hamming Metric" has only algorithms involving an n^2 factor (unless K is very large). That is even for finding only a single pair. So it seems that it is hard to improve this unless you make further assumptions about the structure of your instances. For example, if you assume the Hamming distance is not very large, you could sample a few columns, hash the strings into buckets according to these under the assumptions that these columns match exactly, and then do pairwise comparison in each bucket separately. Repeat this for another set of random columns to minimize the probability you miss some pairs.

Related

Find the m by m square that contains the most "conflicting pairs"?

There are two types of units on a 2d plane, green units (G) and red units (R).
The plane is represented as an n by n matrix, each unit is represented as an element in the matrix.
A pair of two units is called a "conflicting pair" if the two are of different colours. The goal is to find the m by m submatrix that contains the most "conflicting pairs".
Example
[R R 0 0 0
R R 0 0 0
0 0 R R 0
0 0 0 G G
0 0 0 G G]
In the above 5 by 5 matrix, the "most conflicting" 3 by 3 submatrix is at the lower right corner, where there are two red units and four green units, which amounts to 8 conflicting pairs within the submatrix.
A naive solution will take O(m^2n^2) for iterating every element in every possible submatrix.
I also thought of using dynamic programming like the Summed-area table algorithm, the time complexity will then be O(n^2), which looks good since it's already O(n^2) for scanning each element once.
However the n by n matrix may be large and sparse and given in a sparse format (like CSR), in that case an O(n^2) algorithm may not be efficient. Any suggeststions on how do I do better for sparse matrices (and dense matrices)?
If you have k non-empty cells (with R or G) then you can solve with time complexity O(k^2) (squeeze the matrix) because optimal submatrix has one non-empty cell on the border of the matrix.
Or time complexity maybe O(k * (log n)^2) if use two dimension sparse segments tree for getting sum on a rectangle.
The answer is given by
idx = argmax SUM(X_r,m) * SUM(X_g,m)
where SUM(X,m) returns a matrix with the summation of units in each m x m window, X_r and X_g are the matrices with only red and green units enabled respectively, and idx is the m x m window with the largest number of conflicting nodes.
The question then becomes can SUM(X,m) be more efficiently calculated for sparse matrices. I think the answer is: it really depends on the structure of X and the value of m.
An obvious way to make use of the sparsity of X is to compute SUM(X,m) by using the identity
SUM(X,m) = transpose(SUM1d( transpose(SUM1d(X,m) ), m )) (1)
where SUM1d(X,m) is the results of summing intervals of length m along rows of X. Clearly, SUM1d can be implemented in O(n) time for each row, and O(n^2) for the entire matrix, in a similar fashion to the Sum-Area-Table algorithm. This yields the same complexity O(n^2) for the entire algorithm. But that is rather uninteresting as it's the same runtime as a Sum-Area-Table algorithm.
What is interesting is asking whether SUM1d(X,m) can be implemented to take advantage of any sparsity of X. It's clear that SUM1d can be implemented to take full advantage of the sparsity of the input matrix; however, depending on the structure of X and the size of m the output matrix may not be sparse.
Assuming, m is much less than n then implementing SUM1d(X,m) as described in eq (1) above can be done in O(nz_row) time where nz_row is the max number of non-zero elements on any of the rows of X. Furthermore, SUM1d(X,m) will produce a sparse matrix, albeit with O(m) less sparsity. Since we assume m is much less than n this is still a sparse matrix and will still translate to efficiency gains.
Therefore, we should expect O(n*nz_row) for the first call to SUM1d in eq (1) and O(n*m*nz_col) for the second call to SUM1d.

Generate random intervals with given density and overlapping

I am in search of algorithm (possibly, approximate) that will generate test data.
We have large integer interval [0, n), n may be up to 10^9. We want to generate a number of smaller intervals (possibly, overlapping) of length k each, all of which fit inside this large interval and also satisfy following properties:
Number of "cells" covered by these intervals divided by n must be equal to density (<=1.0)
Every cell covered by at least one interval is actually covered by overlapping (>=1.0) intervals on average. E.g. degenerate case of overlap_factor=1.0 means that no two intervals intersect.
Interval positions should be distributed uniformly randomly in all other respects
Achieving both (1) and (2) is what makes this problem difficult. The algorithm should produce an array of interval positions.
Image below demonstrates one of the solutions for n=20, k=4, density=0.5, overlapping=1.6:
◼◼◼◼
◼◼◼◼
◼◼◼◼ ◼◼◼◼
↓↓↓
◻◻◻◼◼◼◼◼◼◻◻◻◼◼◼◼◻◻◻◻
0 19
density = 10/20 = 0.5
overlapping = 4*4/10 = 1.6
Real-world applications will opearte with larger values: n ≤ 10^9, k ∈ [1 .. 10^6], density ∈ [0.01 .. 1.0], overlapping ∈ [1.0..5.0].
Because this algorithm is intended to generate test data, approximate solution would be fine.

minimize variance of k integers from n ordered integers

Given a series of n integers and a number k, n>k, what's the solution of minimizing the variance of k new integers? You may add up any successive integers to a new integer and thus reduce n integers to k integers.
Here is an example. Given n=4, k=2, the series of integers are 4,4,1,1. The solution is 4,6 instead of 8,2 or 9,1.
I have come up with a greedy algorithm which goes like this: for every possible new integers, minimize the absolute value of the difference of this integer and the average of all the integers. But this won't work in some cases. Is there any efficient algorithm works?
The variance of a random variable X is E[(X - E[X])^2]. Here X is a random element of the output list. We know that E[X] is equal to the sum of the input numbers divided by k, so this objective is equivalent to the sum of (x - sum/k)^2 over output values x. This can be accomplished by slightly modifying a word wrap algorithm: Word wrap to X lines instead of maximum width (Least raggedness)

Partitioning a list of integers to minimize difference of their sums

Given a list of integers l, how can I partition it into 2 lists a and b such that d(a,b) = abs(sum(a) - sum(b)) is minimum. I know the problem is NP-complete, so I am looking for a pseudo-polynomial time algorithm i.e. O(c*n) where c = sum(l map abs). I looked at Wikipedia but the algorithm there is to partition it into exact halves which is a special case of what I am looking for...
EDIT:
To clarify, I am looking for the exact partitions a and b and not just the resulting minimum difference d(a, b)
To generalize, what is a pseudo-polynomial time algorithm to partition a list of n numbers into k groups g1, g2 ...gk such that (max(S) - min(S)).abs is as small as possible where S = [sum(g1), sum(g2), ... sum(gk)]
A naive, trivial and still pseudo-polynomial solution would be to use the existing solution to subset-sum, and repeat for sum(array)/2to 0 (and return the first one found).
Complexity of this solution will be O(W^2*n) where W is the sum of the array.
pseudo code:
for cand from sum(array)/2 to 0 descending:
subset <- subsetSumSolver(array,cand)
if subset != null:
return subset
The above will return the maximal subset that is lower/equals sum(array)/2, and the other part is the complement for the returned subset.
However, the dynamic programming for subset-sum should be enough.
Recall that the formula is:
f(0,i) = true
f(x,0) = false | x != 0
f(x,i) = f(x-arr[i],i-1) OR f(x,i-1)
When building the matrix, the above actually creates you each row with value lower than the initial x, if you input sum(array)/2 - it's basically all values.
After you generate the DP matrix, just find the maximal value of x such that f(x,n)=true, and this is the best partition you can get.
Complexity in this case is O(Wn)
You can phrase this as a 0/1 integer linear programming optimization problem. Let wi be the ith number, and let xi be a 0/1 variable which indicates whether wi is in the first set or not. Then you want to minimize sum(xi wi) - sum((1 - xi) wi) subject to
sum(xi wi) >= sum((1 - xi) wi)
and also subject to all xi being 0 or 1. There has been a lot of research into optimizing 0/1 linear programming solvers. For large total sum W this may be an improvement over the O(W n) pseudo-polynomial time algorithm presented because the W factor is scary.
My first thought is to:
Sort list of integers
Create two empty lists A and B
While iterating from biggest integer to smallest integer...add next integer to the list with the smallest current sum.
This is, of course, not guaranteed to give you the best result but you can bound the result it will give you by the size of the biggest integer in your list

Algorithm for making two histograms proportional, minimizing units removed

Imagine you have two histograms with an equal number of bins. N observations are distributed among the bins. Each bin now has between 0 and N observations.
What algorithm would be appropriate for determining the minimum number of observations to remove from both histograms in order to make them proportional? They do not need to be equal in absolute number, only proportional to each other. That is, there must be a common factor by which all the bins in one histogram can be multiplied in order to make it equal to the other histogram.
For example, imagine the following two histograms, where the item i in each histogram refers to the number of observations in bin i for the respective histogram.
Histogram 1: 4, 7, 4, 9
Histogram 2: 2, 0, 2, 1
For these histograms, the solution would be to remove from histogram 1 all 7 observations in bin 2 and another 7 observations from bin 4, such that (histogram 1)*2 = histogram 2.
But what general algorithm could be used to find the subsets of the two histograms that maximized the number of total observations between them while making them proportional? You can drop observations from both histograms or just one.
Thanks!
Seems to me that the problem is equivalent (if you consider each histogram as a N-dimensional vector), to minimizing the Manhattan length |R|, where R=xA-B, A and B are your 'vectors' and x is your proportional scale.
|R| has a single minimum (not necessarily an integer) so you can find it fairly rapidly using a simple bisection algorithm (or something akin to Newton's method).
Then, assuming you want a solution where the proportion is an integer, test the two cases ceil(x), and floor(x), to find which has the smallest Manhattan length (and that is the number of observations you need to remove).
Proof that the problem is not NP-hard:
Consider an inefficient 'solution' whereby you removed all N observations from all the bins. Now both A and B are equal to the 'zero' histogram 0 = (0,0,0,...). The two histograms are equal and thus proportional as 0 = s * 0 for all proportional values s, so a hard maximum for the number of observations to remove is N.
Now assume a more efficient solution exists with assitions/removals < N and a proportional scale s > 2*N (i.e after removal of some observations A = N * B or B=N * A ). If both A = 0 and B = 0, we have the previous solution with N removals (which contradicts the assumption that there are less than N removals). If A = 0 and B ≠ 0 then there is no s <> 0 such that 0 = s * B and no s such that s * 0 = B (with a similar argument for B = 0 and S ≠ 0). So it must be the case that both A ≠ 0 and B ≠ 0. Assume for a moment that A is the histogram to be scaled (so A * s = B), A must have at least one non-zero entry A[i] with minimum value 1 (after removal of extra observations), so when scaled it will have minimum value ≥. Therefore the equivalent entry B[i] must also have at least 2*N observations. But the total number of observations was initially N, so we have needed to add at least N observations to B[i], which contradicts the assumption that the improved solution had less than N additions/removals. So no 'efficient' solution requires a proportional scale greater than N.
So to find an efficient solution requires, at worst, testing the 'best fit' solution for scaling factors in the range 0-N.
The 'best fit' solution for scaling factor s in A = s * B, where A and B have M bins each requires
Sum(i=1 to M) of { Abs(A[i]- s * B[i]) mod s + Abs(A[i]- s * B[i]) div s } additions/removals.
This is an order M operation, so to test for each scaling factor in the range 0-N will be an algorithm of order O(M*N)
I am fairly certain (but haven't got a formal proof), that the scale factor cannot exceed the number of observations in the most filled bin. In practice it is typically very much smaller. For two histograms with two hundred bins and randomly chosen 30-300 observations per bin: if there were Na > Nb total observations in all the bins of A and B respectively the scaling factor was either almost always found in the range Na/Nb-4 < s < Na/Nb + 4, (or s = 0 if Na >> Nb).

Resources