I'm learning about LSH and minhashing and I'm trying to understand the rational of hashing the signature matrix:
We divide the signature matrix to bands and we hash (using which hash function?) every portion of column to k buckets. Why would it make sense? If we use a regular hash function then even a slight difference in two columns would probably lead to different buckets.
I do understand the relation between the signature matrix to Jacard distance but I don't understand the next step which is essentially hashing that distributes items evenly.
Related
Is there a hash function for strings, such that strings within a small edit distance (for example, misspellings) would map to the same, or very close, hash values, while dissimilar strings would tend not to?
One option is to calculate set of all k-mers (substrings of length k), hash them and calculate the minimum.
So you are combining idea of shingles, with idea of minhashing.
(repeat multiple times to get better results, as usual with LSH schemes)
The way how this works is that probability of two string having same minhash is same as Jackard similarity of their k-mer sets.
Similarity of k-mer sets is related to edit distance (but not the same).
In Hashing, what does this uniform distribution of hash values mean. Please explain in layman's terms using appropriate example.
Thank You
It simply means that if you have some size of your hash-table (Say n), then if you are hashing k values k<n, then:
A sequence of outputs from the function must appear to be a random sequence, even if the input numbers are sequential
Also, fundamental thing for your hash function should be to minimize collisions obviously, but at the same time, for a skewed input, output from hash-function should be distributed.
EDIT:
As asked, here is what uniform distribution means. Say, if your size of hash-table is n and you push k (<n) elements to it, then, in every bucket of n/k in hash table, there should be an element. Also, if k=r*c, in every bucket of size n/c in hash table, there should be r elements.
Obviously, perfect uniform distribution is not possible...but output distribution should not be skewed.
When implementing a bloom filter there are a few potential moving parts:
m = size of bit vector
n = items (expected to be) inserted into filter
k = number of hashes to be used
I understand that there are optimum relationships between m/n and k however I haven't found a clear explanation of how to map k hashes onto the bit vector for larger values of m.
In nearly every example I read people use values of m that are trivial (>256) and they show the hash functions heavily overlapping. For less than 256bits it's easy to imagine having k 256bit hash functions and ORing them to the vector.
As m gets larger to reduce the false positive rate for large values of n I'm not sure how the hashes should be mapped to the vector. I've seen hint of ideas such as partitioning the vector and applying "independent" (e.g. different murmur seeds) hashes to each 128bit section of the vector. However I haven't seen a concrete example of how to implement bloom filters for larger n/m values.
When I studied Bloom filters, the page below has helped me a lot:
http://matthias.vallentin.net/blog/2011/06/a-garden-variety-of-bloom-filters/
All described there is implemented as an open-source library. Looking at its sources also was very helpful.
When there are that amount of elements I would rather rely on simplified Cryptography hash functions :)
"LSH has some huge advantages. First, it is simple. You just compute the hash for all
points in your database, then make a hash table from them. To query, just compute the hash of
the query point, then retrieve all points in the same bin from the hash table."
Referring to the answer on another question I am looking for clarification of the process of LSH analysis.
Suppose I have sparse feature vectors (binary, mostly 0) and would like to use cosine distance as the measure with a threshold alpha, which might vary.
My first step is to compute the hash for each of the vectors. Does distance measure matter? (I suppose yes). Does threshold matters? (I suppose no). How can I find the appropriate hash-function?
If programming, I would have function like:
bytes[] getHash(Vector featureVec)
Then I would put results in the Map(long vectorId, bytes[] hashcode) <-vectorHashMap
Then I make hash table from hashes (putting hashs into bins). I suppose at least here should the threshold matter. How can I do that?
If programming, it would be like:
Map,Map createHashTable(Map vectorHashMap, long threshold)
which returns two maps: Map of (hashCode, bucketId) and Map of (bucketId, ListOfVectorIds).
Then i could easily retrieve the neigbors having vectorId as input and a list of vectorIds as output.
The hash has nothing to do with the distance measure. You get the each bit of the hash by dotting the vector with a randomly-chosen vector. The bit says what side of the random vector (a hyperplane really) the hashed vector is on. The bits together are a hash.
Yes then you index your vectors by hash value for easy retrieval. You don't need a 'bucket ID' -- your hash is your bucket.
The only catch here is that it is not true that all the nearest vectors are in the bucket that it hashed too. They merely tend to be near. If it mattered, you might have to search 'similar' buckets -- ones that differ in just a few bits -- to consider more candidates and do a better job of finding truly closest neighbors.
Currently I'm studying how to find a nearest neighbor using Locality-sensitive hashing. However while I'm reading papers and searching the web I found two algorithms for doing this:
1- Use L number of hash tables with L number of random LSH functions, thus increasing the chance that two documents that are similar to get the same signature. For example if two documents are 80% similar, then there's an 80% chance that they will get the same signature from one LSH function. However if we use multiple LSH functions, then there's a higher chance to get the same signature for the documents from one of the LSH functions. This method is explained in wikipedia and I hope my understanding is correct:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search
2- The other algorithm uses a method from a paper (section 5) called: Similarity Estimation Techniques from Rounding Algorithms by Moses S. Charikar. It's based on using one LSH function to generate the signature and then apply P permutations on it and then sort the list. Actually I don't understand the method very well and I hope if someone could clarify it.
My main question is: why would anyone use the second method rather than the first method? As I find it's easier and faster.
I really hope someone can help!!!
EDIT:
Actually I'm not sure if #Raff.Edward were mixing between the "first" and the "second". Because only the second method uses a radius and the first just uses a new hash family g composed of the hash family F. Please check the wikipedia link. They just used many g functions to generate different signatures and then for each g function it has a corresponding hash table. In order to find the nearest neighbor of a point you just let the point go through the g functions and check the corresponding hash tables for collisions. Thus how I understood it as more function ... more chance for collisions.
I didn't find any mentioning about radius for the first method.
For the second method they generate only one signature for each feature vector and then apply P permutations on them. Now we have P lists of permutations where each contains n signatures. Now they then sort each list from P. After that given a query point q, they generate the signature for it and then apply the P permutations on it and then use binary search on each permuted and sorted P list to find the most similar signature to the query q. I concluded this after reading many papers about it, but I still don't understand why would anyone use such a method because it doesn't seem fast in finding the hamming distance!!!!
For me I would simply do the following to find the nearest neighbor for a query point q. Given a list of signatures N, I would generate the signature for the query point q and then scan the list N and compute the hamming distance between each element in N and the signature of q. Thus I would end up with the nearest neighbor for q. And it takes O(N)!!!
Your understanding of the first one is a little off. The probability of a collision occurring is not proportional to the similarity, but whether or not it is less than the pre-defined radius. The goal is that anything within the radius will have a high chance of colliding, and anything outside the radius * (1+eps) will have a low chance of colliding (and the area in-between is a little murky).
The first algorithm is actually fairly difficult to implement well, but can get good results. In particular, the first algorithm is for the L1 and L2 (and technically a few more) metrics.
The second algorithm is very simple to implement, though a naive implementation may use up too much memory to be useful depending on your problem size. In this case, the probability of collision is proportional to the similarity of the inputs. However, it only works for the Cosine Similarity (or distance metrics based on a transform of the similarity.)
So which one you would use is based primarily on which distance metric you are using for Nearest Neighbor (or whatever other application).
The second one is actually much easier to understand and implement than the first one, the paper is just very wordy.
The short version: Take a random vector V and give each index a independent random unit normal value. Create as many vectors as you want the signature length to be. The signature is the signs of each index when you do a Matrix Vector product. Now the hamming distance between any two signatures is related to the cosine similarity between the respective data points.
Because you can encode the signature into an int array and use an XOR with a bit count instruction to get the hamming distance very quickly, you can get approximate cosine similarity scores very quickly.
LSH algorithms doesn't have a lot of standardization, and the two papers (and others) use different definitions, so its all a bit confusing at times. I only recently implemented both of these algorithms in JSAT, and am still working on fully understanding them both.
EDIT: Replying to your edit. The wikipedia article is not great for LSH. If you read the original paper, the first method you are talking about only works for a fixed radius. The hash functions are then created based on that radius, and concatenated to increase the probability of getting near by points in a collision. They then construct a system for doing k-NN on-top of this by determine the maximum value of k they wan, and then finding the largest reasonable distance they would find the k'th nearest neighbor in. In this way, a radius search will very likely return the set of k-NNs. To speed this up, they also create a few extra small radius since the density is often not uniform, and the smaller radius you use, the faster the results.
The wikipedia section you linked is taken from the paper description for the "Stable Distribution" section, which presents the hash function for a search of radius r=1.
For the second paper, the "sorting" you describe is not part of the hashing, but part of one-scheme for searching the hamming space more quickly. I as I mentioned, I recently implemented this, and you can see a quick benchmark I did using a brute force search is still much faster than the naive method of NN. Again, you would also pick this method if you need the cosine similarity over the L2 or L1 distance. You will find many other papers proposing different schemes for searching the hamming space created by the signatures.
If you need help convincing yourself fit can be faster even if you were still doing brute force - just look at it this way: Lets say that the average sparse document has 40 common words with another document (a very conservative number in my experience). You have n documents to compare against. Brute force cosine similarity would then involve about 40*n floating point multiplications (and some extra work). If you have a 1024 bit signature, thats only 32 integers. That means we could do a brute force LSH search in 32*n integer operations, which are considerably faster then floating point operations.
There are also other factors at play here as well. For a sparse data set we have to keep both the doubles and integer indices to represent the non zero indexes, so the sparse dot product is doing a lot of additional integer operations to see which indices they have in common. LSH also allows us to save memory, because we don't need to store all of these integers and doubles for each vector, instead we can just keep its hash around - which is only a few bytes.
Reduced memory use can help us better exploit the CPU cache.
Your O(n) is the naive way I have used in my blog post. And it is fast. However, if you sort the bits before hand, you can do the binary search in O(log(n)). Even if you have L of these lists, L << n, and so it should be faster. The only issue is it gets you approximate hamming NN which are already approximating the cosine similarity, so the results can become a bit worse. It depends on what you need.