Search in locality sensitive hashing - algorithm

I'm trying to understand the section 5. of this paper about LSH, in particular how to bucket the generated hashes. Quoting the linked paper:
Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+epsilon)
) random permutations of the bits. For each random permutation σ, we
maintain a sorted order O σ of the bit vectors, in lexicographic order
of the bits permuted by σ. Given a query bit vector q, we find the
approximate nearest neighbor by doing the following: For each permu-
tation σ, we perform a binary search on O σ to locate the two bit
vectors closest to q (in the lexicographic order ob- tained by bits
permuted by σ). We now search in each of the sorted orders O σ
examining elements above and below the position returned by the binary
search in order of the length of the longest prefix that matches q.
This can be done by maintaining two pointers for each sorted order O σ
(one moves up and the other down). At each step we move one of the
pointers up or down corresponding to the element with the longest
matching prefix. (Here the length of the longest matching prefix in O
σ is computed relative to q with its bits permuted by σ). We examine
2N = O(n 1/(1+epsilon) ) bit vec- tors in this way. Of all the bit vectors
examined, we return the one that has the smallest Hamming distance to
q.
I'm confused by this algorithm and I don't think that I understood how it works.
I found already this question on the topic, but I didn't understand the answer in the comments. Also in this question in point 2 the same algorithm is described but again, I don't understand how its works.
Can you please try to explain to me how it works step by step trying to be more simple as possible?
I even tried to make a list of things that I don't understand, but in practice is so bad written that I don't understand most of the sentences!
EDIT after gsamaras answer:
I mostly understood the answer, but I still have some doubts:
Is it correct to say that the total cost of performing the N permutations is O(Nnlogn), since we have to sort each one of them ?
The permutation+sorting process described above is performed only once during the pre-processing or for every query q? It seems already pretty expensive O(Nnlogn) even in pre-processing, if we have to do this at query time it's a disaster :D
At the last point, where we compare v0 and v4 to q, we compare their permuted version or the original one (before their permutation)?

This question is somehow broad, so I am just going to give a minimal (abstract) example here:
We have 6 (= n) vectors in our dataset, with d bits each. Let's assume that we do 2 (= N) random permutation.
Let the 1st random permutation begin! Remember that we permute the bits, not the order of the vectors. After permuting the bits, they maintain an order, for example:
v1
v5
v0
v3
v2
v4
Now the query vector, q, arrives, but it's (almost) unlikely that is going to be the same with a vector in our dataset (after the permutation), thus we won't find it by performing binary search.
However, we are going to end up between two vectors. So now we can imagine the scenario to be like this (for example q lies between v0 and v3:
v1
v5
v0 <-- up pointer
<-- q lies here
v3 <-- down pointer
v2
v4
Now we move either up or down pointer, seeking for the vi vector that will match at the most bits with q. Let's say it was v0.
Similarly, we do the second permutation and we find the vector vi, let's say v4. we now compare v0 from the first permutation and v4, to see which one is closest to q, i.e. which one has the most bits equal with q.
Edit:
Is it correct to say that the total cost of performing the N permutations is O(Nnlogn), since we have to sort each one of them?
If they actually sort every permutation from scratch, then yes, but it's not clear for me how they do it.
The permutation+sorting process described above is performed only once during the pre-processing or for every query q?
ONCE.
At the last point, where we compare v0 and v4 to q, we compare their permuted version or the original one (before their permutation)?
I think they do it with the permuted version (see the parentheses before 2N in the paper). But that wouldn't make any difference, since they permute q too with the same permute (σ).
This quora answer may shed some light too.

Related

Number of subsets whose XOR contains less than two set bits

I have an Array A(size <= 10^5) of numbers(<= 10^8), and I need to answer some queries(50000), for L, R, how many subsets for elements in the range [L, R], the XOR of the subset is a number that has 0 or 1 bit set(power of 2). Also, point modifications in the array are being done in between the queries, so can't really do some offline processing or use techniques like square root decomposition etc.
I have an approach where I use DP to calculate for a given range, something on the lines of this:
https://www.geeksforgeeks.org/count-number-of-subsets-having-a-particular-xor-value/
But this is clearly too slow. This feels like a classical segment tree problem, but can't seem to find as to what data points to store at each node, so that I can use the left child and right child to compute the answer for the given range.
Yeah, that DP won't be fast enough.
What will be fast enough is applying some linear algebra over GF(2), the Galois field with two elements. Each number can be interpreted as a bit-vector; adding/subtracting vectors is XOR; scalar multiplication isn't really relevant.
The data you need for each segment is (1) how many numbers are there in the segment (2) a basis for the subspace of numbers generated by numbers in the segment, which will consist of at most 27 numbers because all numbers are less than 2^27. The basis for a one-element segment is just that number if it's nonzero, else the empty set. To find the span of the union of two bases, use Gaussian elimination and discard the zero vectors.
Given the length of an interval and a basis for it, you can count the number of good subsets using the rank-nullity theorem. Basically, for each target number, use your Gaussian elimination routine to test whether the target number belongs to the subspace. If so, there are 2^(length of interval minus size of basis) subsets. If not, the answer is zero.

Ordered lattice point enumeration

Setup: Let ei be an orthogonal basis for n-dimensional Euclidean space, but suppose that ei has irrational (L1) norm. Let L be the set of points obtained by taking linear combinations of the ei with coefficients in the natural numbers (including zero). Now order the points in L first by their L1-norm and then lexicographically.
Question: Is there an efficient algorithm for producing the points in L in increasing order up to some pre-defined bound? Note that I do not want to produce the points and then sort them, rather I want to walk the lattice in order.
Observation: This is easy to do if the ei are an orthonormal basis. For instance, this problem is solved here. In principle something similar would work here, however determining the radii to iterate over is almost as hard as solving the enumeration problem, so it isn't very useful.
How about this:
Let L₁ and L₂ be lists of vectors, where L₁ is the list of visited/processed lattice vectors and L₂ is a list of lists of vectors that will be visited next.
Set L₁={ } and L₂ = {[0]}, where 0 is the zero-vector.
Let v be the smallest vector of the first list in L₂.
Visit/process the vector v.
Add the list L={v+e₁,...,v+en} to L₂, such that the lists are sorted by their smallest element. Only generate v+ei as long as its norm is smaller than your predefined bound.
Insert v at the end of L₁ and remove it from the front of the first list L₂.
If the first list is now empty, remove it from L₂. If not, move it to the correct place.
If L₂ is not empty, goto 2.
This algorithm requires the ei to be sorted by their norm from small to big.
This algorithm adds at most n vectors to L₂ per round. Let B your predefined upper bound, then there are at most nk-1 vectors you are going to visit, where k = 1+B/||e₁||. The first ca. nk' rounds, the list will be of size n, where k' = B/||en||. So in total you have to store less than N = nk' + (nk-1)/(nk'+1) lists. you can generate a new list in O(n) and add place it in L₂ in O(log N) (binary search the correct place and link insert it there).
So the overall complexity would be something like O(N⋅n⋅log N), but notice that N is about the number of vectors you are looking for.
Notice: most likely there is a faster algorithm, but this is something you can try.

Find the pair of bitstrings with the largest number of common set bits

I want to find an algorithm to find the pair of bitstrings in an array that have the largest number of common set bits (among all pairs in the array). I know it is possible to do this by comparing all pairs of bitstrings in the array, but this is O(n2). Is there a more efficient algorithm? Ideally, I would like the algorithm to work incrementally by processing one incoming bitstring in each iteration.
For example, suppose we have this array of bitstrings (of length 8):
B1:01010001
B2:01101010
B3:01101010
B4:11001010
B5:00110001
The best pair here is B2 and B3, which have four common set bits.
I found a paper that appears to describe such an algorithm (S. Taylor & T. Drummond (2011); "Binary Histogrammed Intensity Patches for Efficient and Robust Matching"; Int. J. Comput. Vis. 94:241–265), but I don't understand this description from page 252:
This can be incrementally updated in each iteration as the only [bitstring] overlaps that need recomputing are those for the new parent feature and any other [bitstrings] in the root whose “most overlapping feature” was one of the two selected for combination. This avoids the need for the O(N2) overlap comparison in every iteration and allows a forest for a typically-sized database of 700 features to be built in under a second.
As far as I can tell, Taylor & Drummond (2011) do not purport to give an O(n) algorithm for finding the pair of bitstrings in an array with the largest number of common set bits. They sketch an argument that a record of the best such pairs can be updated in O(n) after a new bitstring has been added to the array (and two old bitstrings removed).
Certainly the explanation of the algorithm on page 252 is not very clear, and I think their sketch argument that the record can be updated in O(n) is incomplete at best, so I can see why you are confused.
Anyway, here's my best attempt to explain Algorithm 1 from the paper.
Algorithm
The algorithm takes an array of bitstrings and constructs a lookup tree. A lookup tree is a binary forest (set of binary trees) whose leaves are the original bitstrings from the array, whose internal nodes are new bitstrings, and where if node A is a parent of node B, then A & B = A (that is, all the set bits in A are also set in B).
For example, if the input is this array of bitstrings:
then the output is the lookup tree:
The algorithm as described in the paper proceeds as follows:
Let R be the initial set of bitstrings (the root set).
For each bitstring f1 in R that has no partner in R, find and record its partner (the bitstring f2 in R − {f1} which has the largest number of set bits in common with f1) and record the number of bits they have in common.
If there is no pair of bitstrings in R with any common set bits, stop.
Let f1 and f2 be the pair of bitstrings in R with the largest number of common set bits.
Let p = f1 & f2 be the parent of f1 and f2.
Remove f1 and f2 from R; add p to R.
Go to step 2.
Analysis
Suppose that the array contains n bitstrings of fixed length. Then the algorithm as described is O(n3) because step 2 is O(n2), and there are O(n) iterations, because at each iteration we remove two bitstrings from R and add one.
The paper contains an argument that step 2 is Ω(n2) only on the first time around the loop, and on other iterations it is O(n) because we only have to find the partner of p "and any other bitstrings in R whose partner was one of the two selected for combination." However, this argument is not convincing to me: it is not clear that there are only O(1) other such bitstrings. (Maybe there's a better argument?)
We could bring the algorithm down to O(n2) by storing the number of common set bits between every pair of bitstrings. This requires O(n2) extra space.
Reference
S. Taylor & T. Drummond (2011). "Binary Histogrammed Intensity Patches for Efficient and Robust Matching". Int. J. Comput. Vis. 94:241–265.
Well for each bit position you could maintain two sets, those with that position on and those with it off. The sets could be placed in two binary trees for example.
Then you just perform set unions, first with all eight bits, than every combination of 7 and so on, until you find union with two elements.
The complexity here grows exponentially in the bit size, but if it is small and fixed this isn't a problem.
Another way to do it might be to look at the n k-bit strings as n points in a kD space, and your task is to find the two points closest together. There are a number of geometric algorithms to do this.

Divide and conquer on sorted input with Haskell

For a part of a divide and conquer algorithm, I have the following question where the data structure is not fixed, so set is not to be taken literally:
Given a set X sorted wrt. some ordering of elements and subsets A and B together consisting of all elements in X, can sorted versions A' and B' of A and B be constructed in time linear in the number of elements in X ?
At the moment I am doing a standard sort at each recursive step giving the recursion
T(n) = 2*T(n/2) + O(n*log n)
for the complexity rather than
T(n) = 2*T(n/2) + O(n)
like in the procedural version, where one can utilize a structure with constant-time lookup on A and B to form A' and B' in linear time.
The added log n factor carries over to the overall complexity, giving O(n* (log n)^2) instead of O(n* log n).
EDIT:
Perhaps I am understanding the term lookup incorrectly. The creation of A' and B' in linear time is easy to do if membership of A and B can be checked in constant time.
I didn't succeed in my attempt at making things clearer by abstracting
away the specifics, so here is the actual problem:
I am implementing the algorithm for the closest pair problem. Given a
finite collection P of points in the plane it finds a pair of points
in P with the minimal distance. It works roughly as follows:
If P
has at least 4 points, form Px and
Py, the points in P sorted by x- and y-coordinate. By
splitting Px form L and R, the left- and right-most
halves of points. Recursively compute the closest pair distance in L and
R, let d be the minimum of the two. Now the minimum distance in P is
either d or the distance from a point in L to a point in R. If the
minimal distance is between points from separate halves, it will appear
between a pair of points lying in the strip of width 2*d centered around
the line x = x0, where x0 is the x-coordinate of
a right-most point in L. It turns out that to find a potential minimal distance pair in
the strip, it is enough to compute for every point in the the strip its
distance to the seven following points if the strip points are in a
collection sorted by y-coordinate.
It is in the steps with forming the sorted collections to pass into the recursion and sorting the strip points by y-coordinate where I don't see how to, in
Haskell, utilize having sorted P at the beginning of the recursion.
The following function may interest you:
partition :: (a -> Bool) -> [a] -> ([a], [a])
partition f xs = (filter f xs, filter (not . f) xs)
If you can compute set-membership in constant time, that is, there is a predicate of type a -> Bool that runs in constant time, then partition will run in time linear in the length of its input list. Furthermore, partition is stable, so that if its input list is sorted, then so are both output lists.
I would also like to point out that the above definition is meant to be give the semantics of partition only; the real implementation in GHC only walks its input list once, even if the entire output is forced.
Of course, the real crux of the question is providing a constant-time predicate. The way you phrased the question leaves sets A and B quite unstructured -- you demand that we can handle any particular partitioning. In that case, I don't know of any particularly Haskell-y way of doing constant-time lookup in arbitrary sets. However, often these problems are a bit more structured: often, rather than set-membership, you are actually interested in whether some easily-computable property holds or not. In this case, the above is just what the doctor ordered.
I know very very little about Haskell but here's a shot anyway.
Given that (A+B) == X can;t you just iterate through X (in the sorted order) and add each element to A' or B' if it exists in A or B? Give linear time lookup of element x in the Sets A and B that would be linear.

Generate all subset sums within a range faster than O((k+N) * 2^(N/2))?

Is there a way to generate all of the subset sums s1, s2, ..., sk that fall in a range [A,B] faster than O((k+N)*2N/2), where k is the number of sums there are in [A,B]? Note that k is only known after we have enumerated all subset sums within [A,B].
I'm currently using a modified Horowitz-Sahni algorithm. For example, I first call it to for the smallest sum greater than or equal to A, giving me s1. Then I call it again for the next smallest sum greater than s1, giving me s2. Repeat this until we find a sum sk+1 greater than B. There is a lot of computation repeated between each iteration, even without rebuilding the initial two 2N/2 lists, so is there a way to do better?
In my problem, N is about 15, and the magnitude of the numbers is on the order of millions, so I haven't considered the dynamic programming route.
Check the subset sum on Wikipedia. As far as I know, it's the fastest known algorithm, which operates in O(2^(N/2)) time.
Edit:
If you're looking for multiple possible sums, instead of just 0, you can save the end arrays and just iterate through them again (which is roughly an O(2^(n/2) operation) and save re-computing them. The value of all the possible subsets is doesn't change with the target.
Edit again:
I'm not wholly sure what you want. Are we running K searches for one independent value each, or looking for any subset that has a value in a specific range that is K wide? Or are you trying to approximate the second by using the first?
Edit in response:
Yes, you do get a lot of duplicate work even without rebuilding the list. But if you don't rebuild the list, that's not O(k * N * 2^(N/2)). Building the list is O(N * 2^(N/2)).
If you know A and B right now, you could begin iteration, and then simply not stop when you find the right answer (the bottom bound), but keep going until it goes out of range. That should be roughly the same as solving subset sum for just one solution, involving only +k more ops, and when you're done, you can ditch the list.
More edit:
You have a range of sums, from A to B. First, you solve subset sum problem for A. Then, you just keep iterating and storing the results, until you find the solution for B, at which point you stop. Now you have every sum between A and B in a single run, and it will only cost you one subset sum problem solve plus K operations for K values in the range A to B, which is linear and nice and fast.
s = *i + *j; if s > B then ++i; else if s < A then ++j; else { print s; ... what_goes_here? ... }
No, no, no. I get the source of your confusion now (I misread something), but it's still not as complex as what you had originally. If you want to find ALL combinations within the range, instead of one, you will just have to iterate over all combinations of both lists, which isn't too bad.
Excuse my use of auto. C++0x compiler.
std::vector<int> sums;
std::vector<int> firstlist;
std::vector<int> secondlist;
// Fill in first/secondlist.
std::sort(firstlist.begin(), firstlist.end());
std::sort(secondlist.begin(), secondlist.end());
auto firstit = firstlist.begin();
auto secondit = secondlist.begin();
// Since we want all in a range, rather than just the first, we need to check all combinations. Horowitz/Sahni is only designed to find one.
for(; firstit != firstlist.end(); firstit++) {
for(; secondit = secondlist.end(); secondit++) {
int sum = *firstit + *secondit;
if (sum > A && sum < B)
sums.push_back(sum);
}
}
It's still not great. But it could be optimized if you know in advance that N is very large, for example, mapping or hashmapping sums to iterators, so that any given firstit can find any suitable partners in secondit, reducing the running time.
It is possible to do this in O(N*2^(N/2)), using ideas similar to Horowitz Sahni, but we try and do some optimizations to reduce the constants in the BigOh.
We do the following
Step 1: Split into sets of N/2, and generate all possible 2^(N/2) sets for each split. Call them S1 and S2. This we can do in O(2^(N/2)) (note: the N factor is missing here, due to an optimization we can do).
Step 2: Next sort the larger of S1 and S2 (say S1) in O(N*2^(N/2)) time (we optimize here by not sorting both).
Step 3: Find Subset sums in range [A,B] in S1 using binary search (as it is sorted).
Step 4: Next, for each sum in S2, find using binary search the sets in S1 whose union with this gives sum in range [A,B]. This is O(N*2^(N/2)). At the same time, find if that corresponding set in S2 is in the range [A,B]. The optimization here is to combine loops. Note: This gives you a representation of the sets (in terms of two indexes in S2), not the sets themselves. If you want all the sets, this becomes O(K + N*2^(N/2)), where K is the number of sets.
Further optimizations might be possible, for instance when sum from S2, is negative, we don't consider sums < A etc.
Since Steps 2,3,4 should be pretty clear, I will elaborate further on how to get Step 1 done in O(2^(N/2)) time.
For this, we use the concept of Gray Codes. Gray codes are a sequence of binary bit patterns in which each pattern differs from the previous pattern in exactly one bit.
Example: 00 -> 01 -> 11 -> 10 is a gray code with 2 bits.
There are gray codes which go through all possible N/2 bit numbers and these can be generated iteratively (see the wiki page I linked to), in O(1) time for each step (total O(2^(N/2)) steps), given the previous bit pattern, i.e. given current bit pattern, we can generate the next bit pattern in O(1) time.
This enables us to form all the subset sums, by using the previous sum and changing that by just adding or subtracting one number (corresponding to the differing bit position) to get the next sum.
If you modify the Horowitz-Sahni algorithm in the right way, then it's hardly slower than original Horowitz-Sahni. Recall that Horowitz-Sahni works two lists of subset sums: Sums of subsets in the left half of the original list, and sums of subsets in the right half. Call these two lists of sums L and R. To obtain subsets that sum to some fixed value A, you can sort R, and then look up a number in R that matches each number in L using a binary search. However, the algorithm is asymmetric only to save a constant factor in space and time. It's a good idea for this problem to sort both L and R.
In my code below I also reverse L. Then you can keep two pointers into R, updated for each entry in L: A pointer to the last entry in R that's too low, and a pointer to the first entry in R that's too high. When you advance to the next entry in L, each pointer might either move forward or stay put, but they won't have to move backwards. Thus, the second stage of the Horowitz-Sahni algorithm only takes linear time in the data generated in the first stage, plus linear time in the length of the output. Up to a constant factor, you can't do better than that (once you have committed to this meet-in-the-middle algorithm).
Here is a Python code with example input:
# Input
terms = [29371, 108810, 124019, 267363, 298330, 368607,
438140, 453243, 515250, 575143, 695146, 840979, 868052, 999760]
(A,B) = (500000,600000)
# Subset iterator stolen from Sage
def subsets(X):
yield []; pairs = []
for x in X:
pairs.append((2**len(pairs),x))
for w in xrange(2**(len(pairs)-1), 2**(len(pairs))):
yield [x for m, x in pairs if m & w]
# Modified Horowitz-Sahni with toolow and toohigh indices
L = sorted([(sum(S),S) for S in subsets(terms[:len(terms)/2])])
R = sorted([(sum(S),S) for S in subsets(terms[len(terms)/2:])])
(toolow,toohigh) = (-1,0)
for (Lsum,S) in reversed(L):
while R[toolow+1][0] < A-Lsum and toolow < len(R)-1: toolow += 1
while R[toohigh][0] <= B-Lsum and toohigh < len(R): toohigh += 1
for n in xrange(toolow+1,toohigh):
print '+'.join(map(str,S+R[n][1])),'=',sum(S+R[n][1])
"Moron" (I think he should change his user name) raises the reasonable issue of optimizing the algorithm a little further by skipping one of the sorts. Actually, because each list L and R is a list of sizes of subsets, you can do a combined generate and sort of each one in linear time! (That is, linear in the lengths of the lists.) L is the union of two lists of sums, those that include the first term, term[0], and those that don't. So actually you should just make one of these halves in sorted form, add a constant, and then do a merge of the two sorted lists. If you apply this idea recursively, you save a logarithmic factor in the time to make a sorted L, i.e., a factor of N in the original variable of the problem. This gives a good reason to sort both lists as you generate them. If you only sort one list, you have some binary searches that could reintroduce that factor of N; at best you have to optimize them somehow.
At first glance, a factor of O(N) could still be there for a different reason: If you want not just the subset sum, but the subset that makes the sum, then it looks like O(N) time and space to store each subset in L and in R. However, there is a data-sharing trick that also gets rid of that factor of O(N). The first step of the trick is to store each subset of the left or right half as a linked list of bits (1 if a term is included, 0 if it is not included). Then, when the list L is doubled in size as in the previous paragraph, the two linked lists for a subset and its partner can be shared, except at the head:
0
|
v
1 -> 1 -> 0 -> ...
Actually, this linked list trick is an artifact of the cost model and never truly helpful. Because, in order to have pointers in a RAM architecture with O(1) cost, you have to define data words with O(log(memory)) bits. But if you have data words of this size, you might as well store each word as a single bit vector rather than with this pointer structure. I.e., if you need less than a gigaword of memory, then you can store each subset in a 32-bit word. If you need more than a gigaword, then you have a 64-bit architecture or an emulation of it (or maybe 48 bits), and you can still store each subset in one word. If you patch the RAM cost model to take account of word size, then this factor of N was never really there anyway.
So, interestingly, the time complexity for the original Horowitz-Sahni algorithm isn't O(N*2^(N/2)), it's O(2^(N/2)). Likewise the time complexity for this problem is O(K+2^(N/2)), where K is the length of the output.

Resources