Searching strings where Hamming Distance is less than a threshold - algorithm

Currently I work on an application where I have large number of hash values (strings).
When a query hash value (string) is given, the search process goes through those strings and return strings where the Hamming Distance between the query string and the result string is less than a given threshold.
Hash values are not binary strings. e.g. "1000302014771944008"
All hash values (strings) has the same fixed length.
Threshold values is not small (normally t>25) and can be vary.
I want to implement this search process using an efficient algorithm rather than using brute-force approach.
I have read some research papers (like this & this), but they are for binary strings or for low threshold values. I also tried Locality-sensitive hashing, but implementations I found were focused on binary strings.
Are there any algorithms or data structures to address this problem?
Any suggestions are also welcome. Thank you in advance.
.
Additional Information
Hamming Distance between non-binary strings
string 1: 0014479902266110001131133
string 2: 0014409902226110001111133
-------------------------
1 1 1 = 3 <-- hamming distance
Considered brute-force approach
calculate Hamming Distance between first hash string and the query hash string.
if Hamming Distance is less than the threshold, then add the hash string to the results list.
repeat step 1 and 2 for all hash strings.

Read the 7th section of the paper:
"HmSearch: An Efficient Hamming Distance Query Processing Algorithm".
The state-of-art result for d-query problem can be found at:
"Dictionary matching and indexing with errors and don’t care", which solves d-query problem in time O(m+log(nm)^d+occ) using space O(n*log(nm)^d), where
occ is the number of query result.
If threshold values is not smal, there are practical solutions for binary strings, found on HmSearch.
I think it is possible to apply the same practical solutions found on HmSearch for arbitrary strings, but I've never seen those solutions.

Something like this could work for you.
http://blog.mafr.de/2011/01/06/near-duplicate-detection/

Related

Locality-sensitive hashing of strings?

Is there a hash function for strings, such that strings within a small edit distance (for example, misspellings) would map to the same, or very close, hash values, while dissimilar strings would tend not to?
One option is to calculate set of all k-mers (substrings of length k), hash them and calculate the minimum.
So you are combining idea of shingles, with idea of minhashing.
(repeat multiple times to get better results, as usual with LSH schemes)
The way how this works is that probability of two string having same minhash is same as Jackard similarity of their k-mer sets.
Similarity of k-mer sets is related to edit distance (but not the same).

data structure for finding the substring from large number of strings

My problem statement is that I am given millions of strings, and I have to find one sub-string which can be present in any of those strings.
e.g. given is "xyzoverflowasxs, werstackweq" etc. and I have to find a given sub string named as "stack", which should return "werstackweq". What kind of data structure we can use for solving this problem ?
I think we can use suffix tree for this , but wanted some more suggestions for this problem.
I think the way to go is with a dictionary holding the actual words, and another data structure pointing to entries within this dictionary. One way to go would be with suffix trees and their variants, as mentioned in the question and the comments. I think the following is a far simpler (heuristic) alternative.
Say you choose some integer k. For each of your strings, finding the k Rabin Fingerprints of length-k within each string should be efficient and easy (any language has an implementation).
So, for a given k, you could hold two data structures:
A dictionary of the words, say a hash table based on collision lists
A dictionary mapping each fingerprint to an array of the linked-list node pointers in the first data structure.
Given a word of length k or greater, you would choose a k subword, calculate its Rabin fingerprint, find the words which contain this fingerprint, and check if they indeed contain this word.
The question is which k to use, and whether to use multiple such k. I would try this experimentally (starting with simultaneously a few small k values for, say, 1, 2, and 3, and also a couple of larger ones). The performance of this heuristic anyway depends on the distribution of your dictionary and queries.

Indexing for similarity search

I have about 100M numeric vectors (Minhash fingerprints), each vector contains 100 integer numbers between 0 and 65536, and I'm trying to do a fast similarity search against this database of fingerprints using Jaccard similarity, i.e. given a query vector (e.g. [1,0,30, 9, 42, ...]) find the ratio of intersection/union of this query set against the database of 100M sets.
The requirement is to return k "nearest neighbors" of the query vector in <1 sec (not including indexing/File IO time) on a laptop. So obviously some kind of indexing is required, and the question is what would be the most efficient way to approach this.
notes:
I thought of using SimHash but in this case actually need to know the size of intersection of the sets to identify containment rather than pure similarity/resemblance, but Simhash would lose that information.
I've tried using a simple locality sensitive hashing technique as described in ch3 of Jeffrey Ullman's book by dividing each vector into 20 "bands" or snippets of length 5, converting these snippets into strings (e.g. [1, 2, 45, 2, 3] - > "124523") and using these strings as keys in a hash table, where each key contains "candidate neighbors". But the problem is that it creates too many candidates for some of these snippets and changing number of bands doesn't help.
I might be a bit late, but I would suggest IVFADC indexing by Jegou et al.: Product Quantization for Nearest Neighbor Search
It works for L2 Distance/dot product similarity measures and is a bit complex, but it's particularly efficient in terms of both time and memory.
It is also implemented in the FAISS library for similarity search, so you could also take a look at that.
One way to go about this is the following:
(1) Arrange the vectors into a tree (a radix tree).
(2) Query the tree with a fuzzy criteria, in other words, a match is if the difference in values at each node of the tree is within a threshold
(3) From (2) generate a subtree that contains all the matching vectors
(4) Now, repeat process (2) on the sub tree with a smaller threshold
Continue until the subtree has K items. If K has too few items, then take the previous tree and calculate the Jacard distance on each member of the subtree and sort to eliminate the worst matches until you have only K items left.
answering my own question after 6 years, there is a benchmark for approximate nearest neighbor search with many algorithms to solve this problem: https://github.com/erikbern/ann-benchmarks, the current winner is "Hierarchical Navigable Small World graphs": https://github.com/nmslib/hnswlib
You can use off-the-shelf similarity search services such as AWS-ES or Pinecone.io.

How to find the closest pairs (Hamming Distance) of a string of binary bins in Ruby without O^2 issues?

I've got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:
0110101010101010110101010101
Ideally, I'd like to query for near binary matches. This means, if the two documents have the following numbers. Yes, this is Hamming Distance.
This is NOT currently supported in Mongo. So, I'm forced to do it in the application layer.
So, given this, I am trying to find a way to avoid having to do individual Hamming distance comparisons between the documents. that makes the time to do this basically impossible.
I have a LOT of RAM. And, in ruby, there seems to be a great gem (algorithms) that can create a number of trees, none of which I can seem to make work (yet) that would reduce the number of queries I'd need to make.
Ideally, I'd like to make 1 million queries, find the near duplicate strings, and be able to update them to reflect that.
Anyone's thoughts would be appreciated.
I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).
Then, I used a BK Tree to compare the strings.
The Hamming distance defines a metric space, so you could use the O(n log n) algorithm to find the closest pair of points, which is of the typical divide-and-conquer nature.
You can then apply this repeatedly until you have "enough" pairs.
Edit: I see now that Wikipedia doesn't actually give the algorithm, so here is one description.
Edit 2: The algorithm can be modified to give up if there are no pairs at distance less than n. For the case of the Hamming distance: simply count the level of recursion you are in. If you haven't found something at level n in any branch, then give up (in other words, never enter n + 1). If you are using a metric where splitting on one dimension doesn't always yield a distance of 1, you need to adjust the level of recursion where you give up.
As far as I could understand, you have an input string X and you want to query the database for a document containing string field b such that Hamming distance between X and document.b is less than some small number d.
You can do this in linear time, just by scanning all of your N=1M documents and calculating the distance (which takes small fixed time per document). Since you only want documents with distance smaller than d, you can give up comparison after d unmatched characters; you only need to compare all 256 characters if most of them match.
You can try to scan fewer than N documents, that is, to get better than linear time.
Let ones(s) be the number of 1s in string s. For each document, store ones(document.b) as a new indexed field ones_count. Then you can only query documents where number of ones is close enough to ones(X), specifically, ones(X) - d <= document.ones_count <= ones(X) + d. The Mongo index should kick in here.
If you want to find all close enough pairs in the set, see #Philippe's answer.
This sounds like an algorithmic problem of some sort. You could try comparing those with a similar number of 1 or 0 bits first, then work down through the list from there. Those that are identical will, of course, come out on top. I don't think having tons of RAM will help here.
You could also try and work with smaller chunks. Instead of dealing with 256 bit sequences, could you treat that as 32 8-bit sequences? 16 16-bit sequences? At that point you can compute differences in a lookup table and use that as a sort of index.
Depending on how "different" you care to match on, you could just permute changes on the source binary value and do a keyed search to find the others that match.

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.
The Wikipedia article on edit distance gives some good background on the concept.
By taking "chunk transposition" into account, I mean that
Turing, Alan.
should match
Alan Turing
more closely than it matches
Turing Machine
I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.
The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).
In the case of your application you should probably think about adapting some algorithms from bioinformatics.
For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.
If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.
Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.
Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.
Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.
One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction (C(AB) - min(C(A),C(B))) / max(C(A), C(B)) This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project:
http://arxiv.org/abs/cs/0312044 "Clustering by Compression"
https://github.com/rudi-cilibrasi/libcomplearn C language implementation
There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.
I think you're looking for Jaro-Winkler distance which is precisely for name matching.
You might find compression distance useful for this. See an answer I gave for a very similar question.
Or you could use a k-tuple based counting system:
Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.

Resources