Indexing for similarity search - algorithm

I have about 100M numeric vectors (Minhash fingerprints), each vector contains 100 integer numbers between 0 and 65536, and I'm trying to do a fast similarity search against this database of fingerprints using Jaccard similarity, i.e. given a query vector (e.g. [1,0,30, 9, 42, ...]) find the ratio of intersection/union of this query set against the database of 100M sets.
The requirement is to return k "nearest neighbors" of the query vector in <1 sec (not including indexing/File IO time) on a laptop. So obviously some kind of indexing is required, and the question is what would be the most efficient way to approach this.
notes:
I thought of using SimHash but in this case actually need to know the size of intersection of the sets to identify containment rather than pure similarity/resemblance, but Simhash would lose that information.
I've tried using a simple locality sensitive hashing technique as described in ch3 of Jeffrey Ullman's book by dividing each vector into 20 "bands" or snippets of length 5, converting these snippets into strings (e.g. [1, 2, 45, 2, 3] - > "124523") and using these strings as keys in a hash table, where each key contains "candidate neighbors". But the problem is that it creates too many candidates for some of these snippets and changing number of bands doesn't help.

I might be a bit late, but I would suggest IVFADC indexing by Jegou et al.: Product Quantization for Nearest Neighbor Search
It works for L2 Distance/dot product similarity measures and is a bit complex, but it's particularly efficient in terms of both time and memory.
It is also implemented in the FAISS library for similarity search, so you could also take a look at that.

One way to go about this is the following:
(1) Arrange the vectors into a tree (a radix tree).
(2) Query the tree with a fuzzy criteria, in other words, a match is if the difference in values at each node of the tree is within a threshold
(3) From (2) generate a subtree that contains all the matching vectors
(4) Now, repeat process (2) on the sub tree with a smaller threshold
Continue until the subtree has K items. If K has too few items, then take the previous tree and calculate the Jacard distance on each member of the subtree and sort to eliminate the worst matches until you have only K items left.

answering my own question after 6 years, there is a benchmark for approximate nearest neighbor search with many algorithms to solve this problem: https://github.com/erikbern/ann-benchmarks, the current winner is "Hierarchical Navigable Small World graphs": https://github.com/nmslib/hnswlib

You can use off-the-shelf similarity search services such as AWS-ES or Pinecone.io.

Related

Index structure for top-k queries on bitstrings

Given array of bitstrings (all of the same length) and query string Q find top-k most similar strings to Q, where similarity between strings A and B is defined as number of 1 in A and B, (operation and is applied bitwise).
I think there is should be a classical result for this problem.
k is small, in hundreds, while number of vectors in hundreds of millions and length of the vectors is 512 or 1024
One way to tackle this problem is to construct a K-Nearest Neighbor Graph (K-NNG) (digraph) with a Russell-Rao similarity function.
Note that efficient K-NNG construction is still an open problem,and none of the known solutions for this problem is general, efficient and scalable [quoting from Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures - Dong, Charikar, Li 2011].
Your distance function is often called Russell-Rao similarity (see for example A Survey of Binary Similarity and Distance Measures - Choi, Cha, Tappert 2010). Note that Russell-Rao similarity is not a metric (see Properties of Binary Vector Dissimilarity Measures - Zhang, Srihari 2003): The "if" part of "d(x, y) = 0 iff x == y" is false.
In A Fast Algorithm for Finding k-Nearest Neighbors with Non-metric Dissimilarity - Zhang, Srihari 2002, the authors propose a fast hierarchical search algorithm to find k-NNs using a non-metric measure in a binary vector space. They use a parametric binary vector distance function D(β). When β=0, this function is reduced to the Russell-Rao distance function. I wouldn't call it a "classical result", but this is the the only paper I could find that examines this problem.
You may want to check these two surveys: On nonmetric similarity search problems in complex domains - Skopal, Bustos 2011 and A Survey on Nearest Neighbor Search Methods - Reza, Ghahremani, Naderi 2014. Maybe you'll find something I missed.
This problem can be solved by writing simple Map and Reduce job. I'm neither claiming that this is the best solution, nor I'm claiming that this is the only solution.
Also, you have disclosed in the comments that k is in hundreds, there are millions of bitstrings and that the size of each of them is 512 or 1024.
Mapper pseudo-code:
Given Q;
For every bitstring b, compute similarity = b & Q
Emit (similarity, b)
Now, the combiner can consolidate the list of all bitStrings from every mapper that have the same similarity.
Reducer pseudo-code:
Consume (similarity, listOfBitStringsWithThisSimilarity);
Output them in decreasing order for similarity value.
From the output of reducer you can extract the top-k bitstrings.
So, MapReduce paradigm is probably the classical solution that you are looking for.

how to find the nearest neighbor of a sparse vector

I have about 500 vectors,each vector is a 1500-dimension vector,
and almost every vector is very sparse-- I mean only about 30-70 dimension of the vector is not 0。
Now, the problom is that here is a given vetor,also 1500 dimension,and I need to compare it to the 500 vectors to find which of the 500 is the nearest one.(In euclidean distance).
There is no doubt that brute-force method is a solution , but I need to calculate the distance for 500 times ,which takes a long time.
Yesterday I read an article "Object retrieval with large vocabularies and fast spatial matching", it says using inverted index will help,its says:
but after my test, it made almost no sense, imagine a 1500-vector in which 50 of the dimension are not zero, when it comes to another one, they may always have the same dimension that are not zero. In other words, this algorithm can only rule out a little vectors, I still need to compare with many vectors left.
Thank you for your nice that you have read to here, my question is that:
1.will this algorithm make sense?
2.is there any other way to do what I want to do? such as flann or Kd-TREE?
but I want the exact accurate nearest neighbor, a approxiate one is not enough
This kind of index is called inverted lists, and is commonly used for text.
For example, Apache Lucene uses this kind of indexing for text similarity search.
Essentially, you use a columnar layout, and you only store the non-zero values. For on-disk efficiency, various compression techniques can be employed.
You can then compute many similarities using set operations on these lists.
k-d-trees cannot be used here. They will be extremely inefficient if you have many duplicate (zero) values.
I don't know your context but if you don't care of having a long preprocess step and you have to make this check often and fast, you can build a neighborhood graph and sorting neighbors by distances.
To efficiently build this graph you can perform a taxicab distance or a square distance to sort the points by distances (This will avoid heavy calculations).
Then if you want the nearest neighbor you just have to pick the first neighbor :p.

Is it possible to query Elastic Search with a feature vector?

I'd like to store an n-dimensional feature vector, e.g. <1.00, 0.34, 0.22, ..., 0>, with each document, and then provide another feature vector as a query, with the results sorted in order of cosine similarity. Is this possible with Elastic Search?
I don't have an answer particular to Elastic Search because I've never used it (I use Lucene on which Elastic search is built). However, I'm trying to give a generic answer to your question.
There are two standard ways to obtain the nearest vectors given a query vector, described as follows.
K-d tree
The first approach is to store the vectors in memory with the help of a data structure that supports nearest neighbour queries, e.g. k-d trees. A k-d tree is a generalization of the binary search tree in the sense that every level of the binary search tree partitions one of the k dimensions into two parts. If you have enough space to load all the points in memory, it is possible to apply the nearest neighbour search algorithm on k-d trees to obtain a list of retrieved vectors sorted by the cosine similarity values. The obvious disadvantage of this method is that it does not scale to huge sets of points, as often encountered in information retrieval.
Inverted Quantized Vectors
The second approach is to use inverted quantized vectors. A simple range-based quantization assigns pseudo-terms or labels to the real numbers of a vector so that these can then later be indexed by Lucene (or for that matter Elastic search).
For example, we may assign the label A to the range [0, 0.1), B to the range [0.1, 0.2) and so on... The sample vector in your question is then encoded as (J,D,C,..A). (because [.9,1] is J, [0.3,0.4) is D and so on).
Consequently, a vector of real numbers is thus transformed into a string (which can be treated as a document) and hence indexed with a standard information retrieval (IR) tool. A query vector is also transformed into a bag of pseudo-terms and thus one can compute a set of other similar vectors in the collection most similar (in terms of cosine similarity or other measure) to the current one.
The main advantage of this method is that it scales well for massive collection of real numbered vectors. The key disadvantage is that the computed similarity values are mere approximations to the true cosine similarities (due to the loss encountered in quantization). A smaller quantization range achieves better performance at the cost of increased index size.
Version 7.4 of elasticsearch actually has built-in vector comparison functions, including cosine similarity. See: https://www.elastic.co/guide/en/elasticsearch/reference/7.4/query-dsl-script-score-query.html#vector-functions.

Two algorithms to find nearest neighbor with Locality-sensitive hashing, which one?

Currently I'm studying how to find a nearest neighbor using Locality-sensitive hashing. However while I'm reading papers and searching the web I found two algorithms for doing this:
1- Use L number of hash tables with L number of random LSH functions, thus increasing the chance that two documents that are similar to get the same signature. For example if two documents are 80% similar, then there's an 80% chance that they will get the same signature from one LSH function. However if we use multiple LSH functions, then there's a higher chance to get the same signature for the documents from one of the LSH functions. This method is explained in wikipedia and I hope my understanding is correct:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search
2- The other algorithm uses a method from a paper (section 5) called: Similarity Estimation Techniques from Rounding Algorithms by Moses S. Charikar. It's based on using one LSH function to generate the signature and then apply P permutations on it and then sort the list. Actually I don't understand the method very well and I hope if someone could clarify it.
My main question is: why would anyone use the second method rather than the first method? As I find it's easier and faster.
I really hope someone can help!!!
EDIT:
Actually I'm not sure if #Raff.Edward were mixing between the "first" and the "second". Because only the second method uses a radius and the first just uses a new hash family g composed of the hash family F. Please check the wikipedia link. They just used many g functions to generate different signatures and then for each g function it has a corresponding hash table. In order to find the nearest neighbor of a point you just let the point go through the g functions and check the corresponding hash tables for collisions. Thus how I understood it as more function ... more chance for collisions.
I didn't find any mentioning about radius for the first method.
For the second method they generate only one signature for each feature vector and then apply P permutations on them. Now we have P lists of permutations where each contains n signatures. Now they then sort each list from P. After that given a query point q, they generate the signature for it and then apply the P permutations on it and then use binary search on each permuted and sorted P list to find the most similar signature to the query q. I concluded this after reading many papers about it, but I still don't understand why would anyone use such a method because it doesn't seem fast in finding the hamming distance!!!!
For me I would simply do the following to find the nearest neighbor for a query point q. Given a list of signatures N, I would generate the signature for the query point q and then scan the list N and compute the hamming distance between each element in N and the signature of q. Thus I would end up with the nearest neighbor for q. And it takes O(N)!!!
Your understanding of the first one is a little off. The probability of a collision occurring is not proportional to the similarity, but whether or not it is less than the pre-defined radius. The goal is that anything within the radius will have a high chance of colliding, and anything outside the radius * (1+eps) will have a low chance of colliding (and the area in-between is a little murky).
The first algorithm is actually fairly difficult to implement well, but can get good results. In particular, the first algorithm is for the L1 and L2 (and technically a few more) metrics.
The second algorithm is very simple to implement, though a naive implementation may use up too much memory to be useful depending on your problem size. In this case, the probability of collision is proportional to the similarity of the inputs. However, it only works for the Cosine Similarity (or distance metrics based on a transform of the similarity.)
So which one you would use is based primarily on which distance metric you are using for Nearest Neighbor (or whatever other application).
The second one is actually much easier to understand and implement than the first one, the paper is just very wordy.
The short version: Take a random vector V and give each index a independent random unit normal value. Create as many vectors as you want the signature length to be. The signature is the signs of each index when you do a Matrix Vector product. Now the hamming distance between any two signatures is related to the cosine similarity between the respective data points.
Because you can encode the signature into an int array and use an XOR with a bit count instruction to get the hamming distance very quickly, you can get approximate cosine similarity scores very quickly.
LSH algorithms doesn't have a lot of standardization, and the two papers (and others) use different definitions, so its all a bit confusing at times. I only recently implemented both of these algorithms in JSAT, and am still working on fully understanding them both.
EDIT: Replying to your edit. The wikipedia article is not great for LSH. If you read the original paper, the first method you are talking about only works for a fixed radius. The hash functions are then created based on that radius, and concatenated to increase the probability of getting near by points in a collision. They then construct a system for doing k-NN on-top of this by determine the maximum value of k they wan, and then finding the largest reasonable distance they would find the k'th nearest neighbor in. In this way, a radius search will very likely return the set of k-NNs. To speed this up, they also create a few extra small radius since the density is often not uniform, and the smaller radius you use, the faster the results.
The wikipedia section you linked is taken from the paper description for the "Stable Distribution" section, which presents the hash function for a search of radius r=1.
For the second paper, the "sorting" you describe is not part of the hashing, but part of one-scheme for searching the hamming space more quickly. I as I mentioned, I recently implemented this, and you can see a quick benchmark I did using a brute force search is still much faster than the naive method of NN. Again, you would also pick this method if you need the cosine similarity over the L2 or L1 distance. You will find many other papers proposing different schemes for searching the hamming space created by the signatures.
If you need help convincing yourself fit can be faster even if you were still doing brute force - just look at it this way: Lets say that the average sparse document has 40 common words with another document (a very conservative number in my experience). You have n documents to compare against. Brute force cosine similarity would then involve about 40*n floating point multiplications (and some extra work). If you have a 1024 bit signature, thats only 32 integers. That means we could do a brute force LSH search in 32*n integer operations, which are considerably faster then floating point operations.
There are also other factors at play here as well. For a sparse data set we have to keep both the doubles and integer indices to represent the non zero indexes, so the sparse dot product is doing a lot of additional integer operations to see which indices they have in common. LSH also allows us to save memory, because we don't need to store all of these integers and doubles for each vector, instead we can just keep its hash around - which is only a few bytes.
Reduced memory use can help us better exploit the CPU cache.
Your O(n) is the naive way I have used in my blog post. And it is fast. However, if you sort the bits before hand, you can do the binary search in O(log(n)). Even if you have L of these lists, L << n, and so it should be faster. The only issue is it gets you approximate hamming NN which are already approximating the cosine similarity, so the results can become a bit worse. It depends on what you need.

How to find the closest pairs (Hamming Distance) of a string of binary bins in Ruby without O^2 issues?

I've got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:
0110101010101010110101010101
Ideally, I'd like to query for near binary matches. This means, if the two documents have the following numbers. Yes, this is Hamming Distance.
This is NOT currently supported in Mongo. So, I'm forced to do it in the application layer.
So, given this, I am trying to find a way to avoid having to do individual Hamming distance comparisons between the documents. that makes the time to do this basically impossible.
I have a LOT of RAM. And, in ruby, there seems to be a great gem (algorithms) that can create a number of trees, none of which I can seem to make work (yet) that would reduce the number of queries I'd need to make.
Ideally, I'd like to make 1 million queries, find the near duplicate strings, and be able to update them to reflect that.
Anyone's thoughts would be appreciated.
I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).
Then, I used a BK Tree to compare the strings.
The Hamming distance defines a metric space, so you could use the O(n log n) algorithm to find the closest pair of points, which is of the typical divide-and-conquer nature.
You can then apply this repeatedly until you have "enough" pairs.
Edit: I see now that Wikipedia doesn't actually give the algorithm, so here is one description.
Edit 2: The algorithm can be modified to give up if there are no pairs at distance less than n. For the case of the Hamming distance: simply count the level of recursion you are in. If you haven't found something at level n in any branch, then give up (in other words, never enter n + 1). If you are using a metric where splitting on one dimension doesn't always yield a distance of 1, you need to adjust the level of recursion where you give up.
As far as I could understand, you have an input string X and you want to query the database for a document containing string field b such that Hamming distance between X and document.b is less than some small number d.
You can do this in linear time, just by scanning all of your N=1M documents and calculating the distance (which takes small fixed time per document). Since you only want documents with distance smaller than d, you can give up comparison after d unmatched characters; you only need to compare all 256 characters if most of them match.
You can try to scan fewer than N documents, that is, to get better than linear time.
Let ones(s) be the number of 1s in string s. For each document, store ones(document.b) as a new indexed field ones_count. Then you can only query documents where number of ones is close enough to ones(X), specifically, ones(X) - d <= document.ones_count <= ones(X) + d. The Mongo index should kick in here.
If you want to find all close enough pairs in the set, see #Philippe's answer.
This sounds like an algorithmic problem of some sort. You could try comparing those with a similar number of 1 or 0 bits first, then work down through the list from there. Those that are identical will, of course, come out on top. I don't think having tons of RAM will help here.
You could also try and work with smaller chunks. Instead of dealing with 256 bit sequences, could you treat that as 32 8-bit sequences? 16 16-bit sequences? At that point you can compute differences in a lookup table and use that as a sort of index.
Depending on how "different" you care to match on, you could just permute changes on the source binary value and do a keyed search to find the others that match.

Resources