Index structure for top-k queries on bitstrings - algorithm

Given array of bitstrings (all of the same length) and query string Q find top-k most similar strings to Q, where similarity between strings A and B is defined as number of 1 in A and B, (operation and is applied bitwise).
I think there is should be a classical result for this problem.
k is small, in hundreds, while number of vectors in hundreds of millions and length of the vectors is 512 or 1024

One way to tackle this problem is to construct a K-Nearest Neighbor Graph (K-NNG) (digraph) with a Russell-Rao similarity function.
Note that efficient K-NNG construction is still an open problem,and none of the known solutions for this problem is general, efficient and scalable [quoting from Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures - Dong, Charikar, Li 2011].
Your distance function is often called Russell-Rao similarity (see for example A Survey of Binary Similarity and Distance Measures - Choi, Cha, Tappert 2010). Note that Russell-Rao similarity is not a metric (see Properties of Binary Vector Dissimilarity Measures - Zhang, Srihari 2003): The "if" part of "d(x, y) = 0 iff x == y" is false.
In A Fast Algorithm for Finding k-Nearest Neighbors with Non-metric Dissimilarity - Zhang, Srihari 2002, the authors propose a fast hierarchical search algorithm to find k-NNs using a non-metric measure in a binary vector space. They use a parametric binary vector distance function D(β). When β=0, this function is reduced to the Russell-Rao distance function. I wouldn't call it a "classical result", but this is the the only paper I could find that examines this problem.
You may want to check these two surveys: On nonmetric similarity search problems in complex domains - Skopal, Bustos 2011 and A Survey on Nearest Neighbor Search Methods - Reza, Ghahremani, Naderi 2014. Maybe you'll find something I missed.

This problem can be solved by writing simple Map and Reduce job. I'm neither claiming that this is the best solution, nor I'm claiming that this is the only solution.
Also, you have disclosed in the comments that k is in hundreds, there are millions of bitstrings and that the size of each of them is 512 or 1024.
Mapper pseudo-code:
Given Q;
For every bitstring b, compute similarity = b & Q
Emit (similarity, b)
Now, the combiner can consolidate the list of all bitStrings from every mapper that have the same similarity.
Reducer pseudo-code:
Consume (similarity, listOfBitStringsWithThisSimilarity);
Output them in decreasing order for similarity value.
From the output of reducer you can extract the top-k bitstrings.
So, MapReduce paradigm is probably the classical solution that you are looking for.

Related

Set distance as similarity metric for MinHashing algorithm

I am currently working on document clustering using MinHashing technique. However, I am not getting desired results as MinHash is a rough estimation of Jaccard similarity and it doesn't suits my requirement.
This is my scenario:
I have a huge set of books and if a single page is given as a query, I need to find the corresponding book from which this page is obtained from. The limitation is, I have features for the entire book and it's impossible to get page-by-page features for the books. In this case, Jaccard similarity is giving poor results if the book is too big. What I really want is the distance between query page and the books (not vice-versa). That is:
Given 2 sets A, B: I want the distance from A to B,
dis(A->B) = (A & B)/A
Is there similar distance metric that gives distance from set A to set B. Further, is it still possible to use MinHashing algorithm with this kind of similarity metric?
We can estimate your proposed distance function using a similar approach as the MinHash algorithm.
For some hash function h(x), compute the minimal values of h over A and B. Denote these values h_min(A) and h_min(B). The MinHash algorithm relies on the fact that the probability that h_min(A) = h_min(B) is (A & B) / (A | B). We may observe that the probability that h_min(A) <= h_min(B) is A / (A | B). We can then compute (A & B) / A as the ratio of these two probabilities.
Like in the regular MinHash algorithm, we can approximate these probabilities by repeated sampling until the desired variance is achieved.

Find all k-nearest neighbors

Problem:
I have N (~100m) strings each D (e.g. 100) characters long and with a low alphabet (eg 4 possible characters). I would like to find the k-nearest neighbors for every one of those N points ( k ~ 0.1D). Adjacent strings define via hamming distance. Solution doesn't have to be the best possible but closer the better.
Thoughts about the problem
I have a bad feeling this is a non trivial problem. I have read many papers and algorithms, however most of them has poor result in high dimension and it works when dimension is less than 5. For example this paper suggests an efficient algorithm, but its constant is related to dimension exponentially.
Currently, I am investigating on how can I reduce dimension in a way that hamming distance is preserved or can be computed.
Another option is locality sensitive hashing, Points that are close to each other under the chosen metric are mapped to the same bucket with high probability. Any Help? which option do you prefer?
One of the previously asked questions has some good discussions, so you can refer to that,
Nearest neighbors in high-dimensional data?
Other than this, you can also look at,
http://web.cs.swarthmore.edu/~adanner/cs97/s08/papers/dahl_wootters.pdf
Few papers which analyze different approaches,
http://www.jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf
https://www.cse.ust.hk/~yike/sigmod09-lsb.pdf

How to optimize finding similarities?

I have a set of 30 000 documents represented by vectors of floats. All vectors have 100 elements. I can find similarity of two documents by comparing them using cosine measure between their vectors. The problem is that it takes to much time to find the most similar documents. Is there any algorithm which can help me with speeding up this?
EDIT
Now, my code just counts cosine similarity between first and all others vectors. It takes about 3 sec. I would like to speed it up ;) algorithm doesn't have to be accurate but should give similar results to full search.
Sum of elements of each vector is equal 1.
start = time.time()
first = allVectors[0]
for vec in allVectors[1:]:
cosine_measure(vec[1:], first[1:])
print str(time.time() - start)
Would locality sensitive hashing (LHS) help?
In case of LHS, the hashing function maps similar items near each other with a probability of your choice. It is claimed to be especially well-suited for high-dimensional similarity search / nearest neighbor search / near duplicate detection and it looks like to me that's exactly what you are trying to achieve.
See also How to understand Locality Sensitive Hashing?
There is a paper How to Approximate the Inner-product: Fast Dynamic Algorithms for Euclidean Similarity describing how to perform a fast approximation of the inner product. If this is not good or fast enough, I suggest to build an index containing all your documents. A structure similar to a quadtree but based on a geodesic grid would probably work really well, see Indexing the Sphere with the Hierarchical Triangular Mesh.
UPDATE: I completely forgot that you are dealing with 100 dimensions. Indexing high dimensional data is notoriously hard and I am not sure how well indexing a sphere will generalize to 100 dimensions.
If your vectors are normalized, the cosine is related to the Euclidean distance: ||a - b||² = (a - b)² = ||a||² + ||b||² - 2 ||a|| ||b|| cos(t) = 1 + 1 - 2 cos(t). So you can recast your problem in terms of Euclidean nearest neighbors.
A nice approach if that of the kD trees, a spatial data structure that generalizes the binary search (http://en.wikipedia.org/wiki/K-d_tree). Anyway, kD trees are known to be inefficient in high dimensions (your case), so that the so-called best-bin-first-search is preferred (http://en.wikipedia.org/wiki/Best-bin-first_search).

Two algorithms to find nearest neighbor with Locality-sensitive hashing, which one?

Currently I'm studying how to find a nearest neighbor using Locality-sensitive hashing. However while I'm reading papers and searching the web I found two algorithms for doing this:
1- Use L number of hash tables with L number of random LSH functions, thus increasing the chance that two documents that are similar to get the same signature. For example if two documents are 80% similar, then there's an 80% chance that they will get the same signature from one LSH function. However if we use multiple LSH functions, then there's a higher chance to get the same signature for the documents from one of the LSH functions. This method is explained in wikipedia and I hope my understanding is correct:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search
2- The other algorithm uses a method from a paper (section 5) called: Similarity Estimation Techniques from Rounding Algorithms by Moses S. Charikar. It's based on using one LSH function to generate the signature and then apply P permutations on it and then sort the list. Actually I don't understand the method very well and I hope if someone could clarify it.
My main question is: why would anyone use the second method rather than the first method? As I find it's easier and faster.
I really hope someone can help!!!
EDIT:
Actually I'm not sure if #Raff.Edward were mixing between the "first" and the "second". Because only the second method uses a radius and the first just uses a new hash family g composed of the hash family F. Please check the wikipedia link. They just used many g functions to generate different signatures and then for each g function it has a corresponding hash table. In order to find the nearest neighbor of a point you just let the point go through the g functions and check the corresponding hash tables for collisions. Thus how I understood it as more function ... more chance for collisions.
I didn't find any mentioning about radius for the first method.
For the second method they generate only one signature for each feature vector and then apply P permutations on them. Now we have P lists of permutations where each contains n signatures. Now they then sort each list from P. After that given a query point q, they generate the signature for it and then apply the P permutations on it and then use binary search on each permuted and sorted P list to find the most similar signature to the query q. I concluded this after reading many papers about it, but I still don't understand why would anyone use such a method because it doesn't seem fast in finding the hamming distance!!!!
For me I would simply do the following to find the nearest neighbor for a query point q. Given a list of signatures N, I would generate the signature for the query point q and then scan the list N and compute the hamming distance between each element in N and the signature of q. Thus I would end up with the nearest neighbor for q. And it takes O(N)!!!
Your understanding of the first one is a little off. The probability of a collision occurring is not proportional to the similarity, but whether or not it is less than the pre-defined radius. The goal is that anything within the radius will have a high chance of colliding, and anything outside the radius * (1+eps) will have a low chance of colliding (and the area in-between is a little murky).
The first algorithm is actually fairly difficult to implement well, but can get good results. In particular, the first algorithm is for the L1 and L2 (and technically a few more) metrics.
The second algorithm is very simple to implement, though a naive implementation may use up too much memory to be useful depending on your problem size. In this case, the probability of collision is proportional to the similarity of the inputs. However, it only works for the Cosine Similarity (or distance metrics based on a transform of the similarity.)
So which one you would use is based primarily on which distance metric you are using for Nearest Neighbor (or whatever other application).
The second one is actually much easier to understand and implement than the first one, the paper is just very wordy.
The short version: Take a random vector V and give each index a independent random unit normal value. Create as many vectors as you want the signature length to be. The signature is the signs of each index when you do a Matrix Vector product. Now the hamming distance between any two signatures is related to the cosine similarity between the respective data points.
Because you can encode the signature into an int array and use an XOR with a bit count instruction to get the hamming distance very quickly, you can get approximate cosine similarity scores very quickly.
LSH algorithms doesn't have a lot of standardization, and the two papers (and others) use different definitions, so its all a bit confusing at times. I only recently implemented both of these algorithms in JSAT, and am still working on fully understanding them both.
EDIT: Replying to your edit. The wikipedia article is not great for LSH. If you read the original paper, the first method you are talking about only works for a fixed radius. The hash functions are then created based on that radius, and concatenated to increase the probability of getting near by points in a collision. They then construct a system for doing k-NN on-top of this by determine the maximum value of k they wan, and then finding the largest reasonable distance they would find the k'th nearest neighbor in. In this way, a radius search will very likely return the set of k-NNs. To speed this up, they also create a few extra small radius since the density is often not uniform, and the smaller radius you use, the faster the results.
The wikipedia section you linked is taken from the paper description for the "Stable Distribution" section, which presents the hash function for a search of radius r=1.
For the second paper, the "sorting" you describe is not part of the hashing, but part of one-scheme for searching the hamming space more quickly. I as I mentioned, I recently implemented this, and you can see a quick benchmark I did using a brute force search is still much faster than the naive method of NN. Again, you would also pick this method if you need the cosine similarity over the L2 or L1 distance. You will find many other papers proposing different schemes for searching the hamming space created by the signatures.
If you need help convincing yourself fit can be faster even if you were still doing brute force - just look at it this way: Lets say that the average sparse document has 40 common words with another document (a very conservative number in my experience). You have n documents to compare against. Brute force cosine similarity would then involve about 40*n floating point multiplications (and some extra work). If you have a 1024 bit signature, thats only 32 integers. That means we could do a brute force LSH search in 32*n integer operations, which are considerably faster then floating point operations.
There are also other factors at play here as well. For a sparse data set we have to keep both the doubles and integer indices to represent the non zero indexes, so the sparse dot product is doing a lot of additional integer operations to see which indices they have in common. LSH also allows us to save memory, because we don't need to store all of these integers and doubles for each vector, instead we can just keep its hash around - which is only a few bytes.
Reduced memory use can help us better exploit the CPU cache.
Your O(n) is the naive way I have used in my blog post. And it is fast. However, if you sort the bits before hand, you can do the binary search in O(log(n)). Even if you have L of these lists, L << n, and so it should be faster. The only issue is it gets you approximate hamming NN which are already approximating the cosine similarity, so the results can become a bit worse. It depends on what you need.

Indexing for similarity search

I have about 100M numeric vectors (Minhash fingerprints), each vector contains 100 integer numbers between 0 and 65536, and I'm trying to do a fast similarity search against this database of fingerprints using Jaccard similarity, i.e. given a query vector (e.g. [1,0,30, 9, 42, ...]) find the ratio of intersection/union of this query set against the database of 100M sets.
The requirement is to return k "nearest neighbors" of the query vector in <1 sec (not including indexing/File IO time) on a laptop. So obviously some kind of indexing is required, and the question is what would be the most efficient way to approach this.
notes:
I thought of using SimHash but in this case actually need to know the size of intersection of the sets to identify containment rather than pure similarity/resemblance, but Simhash would lose that information.
I've tried using a simple locality sensitive hashing technique as described in ch3 of Jeffrey Ullman's book by dividing each vector into 20 "bands" or snippets of length 5, converting these snippets into strings (e.g. [1, 2, 45, 2, 3] - > "124523") and using these strings as keys in a hash table, where each key contains "candidate neighbors". But the problem is that it creates too many candidates for some of these snippets and changing number of bands doesn't help.
I might be a bit late, but I would suggest IVFADC indexing by Jegou et al.: Product Quantization for Nearest Neighbor Search
It works for L2 Distance/dot product similarity measures and is a bit complex, but it's particularly efficient in terms of both time and memory.
It is also implemented in the FAISS library for similarity search, so you could also take a look at that.
One way to go about this is the following:
(1) Arrange the vectors into a tree (a radix tree).
(2) Query the tree with a fuzzy criteria, in other words, a match is if the difference in values at each node of the tree is within a threshold
(3) From (2) generate a subtree that contains all the matching vectors
(4) Now, repeat process (2) on the sub tree with a smaller threshold
Continue until the subtree has K items. If K has too few items, then take the previous tree and calculate the Jacard distance on each member of the subtree and sort to eliminate the worst matches until you have only K items left.
answering my own question after 6 years, there is a benchmark for approximate nearest neighbor search with many algorithms to solve this problem: https://github.com/erikbern/ann-benchmarks, the current winner is "Hierarchical Navigable Small World graphs": https://github.com/nmslib/hnswlib
You can use off-the-shelf similarity search services such as AWS-ES or Pinecone.io.

Resources