Is there a hash function for strings, such that strings within a small edit distance (for example, misspellings) would map to the same, or very close, hash values, while dissimilar strings would tend not to?
One option is to calculate set of all k-mers (substrings of length k), hash them and calculate the minimum.
So you are combining idea of shingles, with idea of minhashing.
(repeat multiple times to get better results, as usual with LSH schemes)
The way how this works is that probability of two string having same minhash is same as Jackard similarity of their k-mer sets.
Similarity of k-mer sets is related to edit distance (but not the same).
Related
Currently I work on an application where I have large number of hash values (strings).
When a query hash value (string) is given, the search process goes through those strings and return strings where the Hamming Distance between the query string and the result string is less than a given threshold.
Hash values are not binary strings. e.g. "1000302014771944008"
All hash values (strings) has the same fixed length.
Threshold values is not small (normally t>25) and can be vary.
I want to implement this search process using an efficient algorithm rather than using brute-force approach.
I have read some research papers (like this & this), but they are for binary strings or for low threshold values. I also tried Locality-sensitive hashing, but implementations I found were focused on binary strings.
Are there any algorithms or data structures to address this problem?
Any suggestions are also welcome. Thank you in advance.
.
Additional Information
Hamming Distance between non-binary strings
string 1: 0014479902266110001131133
string 2: 0014409902226110001111133
-------------------------
1 1 1 = 3 <-- hamming distance
Considered brute-force approach
calculate Hamming Distance between first hash string and the query hash string.
if Hamming Distance is less than the threshold, then add the hash string to the results list.
repeat step 1 and 2 for all hash strings.
Read the 7th section of the paper:
"HmSearch: An Efficient Hamming Distance Query Processing Algorithm".
The state-of-art result for d-query problem can be found at:
"Dictionary matching and indexing with errors and don’t care", which solves d-query problem in time O(m+log(nm)^d+occ) using space O(n*log(nm)^d), where
occ is the number of query result.
If threshold values is not smal, there are practical solutions for binary strings, found on HmSearch.
I think it is possible to apply the same practical solutions found on HmSearch for arbitrary strings, but I've never seen those solutions.
Something like this could work for you.
http://blog.mafr.de/2011/01/06/near-duplicate-detection/
"LSH has some huge advantages. First, it is simple. You just compute the hash for all
points in your database, then make a hash table from them. To query, just compute the hash of
the query point, then retrieve all points in the same bin from the hash table."
Referring to the answer on another question I am looking for clarification of the process of LSH analysis.
Suppose I have sparse feature vectors (binary, mostly 0) and would like to use cosine distance as the measure with a threshold alpha, which might vary.
My first step is to compute the hash for each of the vectors. Does distance measure matter? (I suppose yes). Does threshold matters? (I suppose no). How can I find the appropriate hash-function?
If programming, I would have function like:
bytes[] getHash(Vector featureVec)
Then I would put results in the Map(long vectorId, bytes[] hashcode) <-vectorHashMap
Then I make hash table from hashes (putting hashs into bins). I suppose at least here should the threshold matter. How can I do that?
If programming, it would be like:
Map,Map createHashTable(Map vectorHashMap, long threshold)
which returns two maps: Map of (hashCode, bucketId) and Map of (bucketId, ListOfVectorIds).
Then i could easily retrieve the neigbors having vectorId as input and a list of vectorIds as output.
The hash has nothing to do with the distance measure. You get the each bit of the hash by dotting the vector with a randomly-chosen vector. The bit says what side of the random vector (a hyperplane really) the hashed vector is on. The bits together are a hash.
Yes then you index your vectors by hash value for easy retrieval. You don't need a 'bucket ID' -- your hash is your bucket.
The only catch here is that it is not true that all the nearest vectors are in the bucket that it hashed too. They merely tend to be near. If it mattered, you might have to search 'similar' buckets -- ones that differ in just a few bits -- to consider more candidates and do a better job of finding truly closest neighbors.
Many randomized algorithms and data structures (such as the Count-Min Sketch) require hash functions with the pairwise independence property. Intuitively, this means that the probability of a hash collision with a specific element is small, even if the output of the hash function for that element is known.
I have found many descriptions of pairwise independent hash functions for fixed-length bitvectors based on random linear functions. However, I have not yet seen any examples of pairwise independent hash functions for strings.
Are there any families of pairwise independent hash functions for strings?
I'm pretty sure they exist, but there's a bit of measure-theoretic subtlety to your question. You might be better off asking on mathoverflow. I'm very rusty with this stuff, but I think I can show that, even if they do exist, you don't actually want one.
To begin with, you need a probability measure on the strings, and any such measure will necessarily look very different from any notion of "uniform." (It's a countable set and all the sigma-algebras over countable sets just clump together sets of elements and assign a probability to each of those sets. You'll want all of the clumps to be singletons.)
Now, if you only give finitely many strings positive probability, you're back in the finite case. So let's ignore that for now and assume that, for any epsilon > 0, you can find a string whose probability is strictly between 0 and epsilon.
Suppose we restrict to the case where the hash functions map strings to {0,1}.
Your family of hash functions will need to be infinite as well and you'll want to talk about it as a probability space of hash functions. If you have a set H of hash functions that has positive probability, then every string is mapped to both 0 and 1 by (different) elements of H. In particular, no single element of H has positive probability. So H has to be uncountable and you've suddenly run into difficult representability issues.
I'd be very happy if someone who hasn't forgotten measure theory would chime in here.
Not with a seed of bounded length and an output of nonzero bounded length.
A fairly crude argument to this effect is, for a finite family of hash functions H, consider a map f from an element x to a tuple giving h(x) for every h in H. Since the codomains of each h and thus f are finite, there exist two strings mapped the same way by all h in H, which, given that there are at least two possible hash values, contradicts pairwise independence.
I am trying to solve a problem that involves comparing large numbers of word sets , each of which contains a large, ordered number of words from a set of words (totaling around 600+, very high dimensionality!) for similarity and then clustering them into distinct groupings. The solution needs to be as unsupervised as possible.
The data looks like
[Apple, Banana, Orange...]
[Apple, Banana, Grape...]
[Jelly, Anise, Orange...]
[Strawberry, Banana, Orange...]
...etc
The order of the words in each set matters ([Apple, Banana, Orange] is distinct from [Apple, Orange, Banana]
The approach I have been using so far has been to use Levenshtein distance (limited by a distance threshold) as a metric calculated in a Python script with each word being the unique identifier, generate a similarity matrix from the distances, and throwing that matrix into k-Mediods in KNIME for the groupings.
My questions are:
Is Levenshtein the most appropriate distance metric to use for this problem?
Is mean/medoid prototype clustering the best way to go about the groupings?
I haven't yet put much thought into validating the choice for 'k' in the clustering. Would evaluating an SSE curve of the clustering be the best way to go about this?
Are there any flaws in my methodology?
As an extension to the solution in the future, given training data, would anyone happen to have any ideas for going about assigning probabilities to cluster assignments? For example, set 1 has a 80% chance of being in cluster 1, etc.
I hope my questions don't seem too silly or the answers painfully obvious, I'm relatively new to data mining.
Thanks!
Yes, Levenshtein is a very suitable way to do this. But if the sequences vary in size much, you might be better off normalising these distances by dividing by the sum of the sequence lengths -- otherwise you will find that observed distances tend to increase for pairs of long sequences whose "average distance" (in the sense of the average distance between corresponding k-length substrings, for some small k) is constant.
Example: The pair ([Apple, Banana], [Carrot, Banana]) could be said to have the same "average" distance as ([Apple, Banana, Widget, Xylophone], [Carrot, Banana, Yam, Xylophone]) since every 2nd item matches in both, but the latter pair's raw Levenshtein distance will be twice as great.
Also bear in mind that Levenshtein does not make special allowances for "block moves": if you take a string, and move one of its substrings sufficiently far away, then the resulting pair (of original and modified strings) will have the same Levenshtein score as if the 2nd string had completely different elements at the position where the substring was moved to. If you want to take this into account, consider using a compression-based distance instead. (Although I say there that it's useful for computing distances without respect to order, it does of course favour ordered similarity to disordered similarity.)
check out SimMetrics on sourceforge for a platform supporting a variety of metrics able to use as a means to evaluate the best for a task.
for a commercially valid version check out K-Similarity from K-Now.co.uk.
This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.
You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.
This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.
If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.
Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.