Set distance as similarity metric for MinHashing algorithm - algorithm

I am currently working on document clustering using MinHashing technique. However, I am not getting desired results as MinHash is a rough estimation of Jaccard similarity and it doesn't suits my requirement.
This is my scenario:
I have a huge set of books and if a single page is given as a query, I need to find the corresponding book from which this page is obtained from. The limitation is, I have features for the entire book and it's impossible to get page-by-page features for the books. In this case, Jaccard similarity is giving poor results if the book is too big. What I really want is the distance between query page and the books (not vice-versa). That is:
Given 2 sets A, B: I want the distance from A to B,
dis(A->B) = (A & B)/A
Is there similar distance metric that gives distance from set A to set B. Further, is it still possible to use MinHashing algorithm with this kind of similarity metric?

We can estimate your proposed distance function using a similar approach as the MinHash algorithm.
For some hash function h(x), compute the minimal values of h over A and B. Denote these values h_min(A) and h_min(B). The MinHash algorithm relies on the fact that the probability that h_min(A) = h_min(B) is (A & B) / (A | B). We may observe that the probability that h_min(A) <= h_min(B) is A / (A | B). We can then compute (A & B) / A as the ratio of these two probabilities.
Like in the regular MinHash algorithm, we can approximate these probabilities by repeated sampling until the desired variance is achieved.

Related

Compute the similarity between two lists of objects

I'd like to compute the similarity between two lists of various lengths. In particular, the similarity has to take into account different conditions:
-Given 2 list A and B, if A=B then similarity(A,B)=1
-In general, if B contains A, then similarity (A,B)->1. However, the measure of similarity should also take into consideration the number of elements of the two list. (E.g. if A contains 1000 objects and B just one, which is also contained in A, then similarity(A,B)->0).
-Similarity(A,B) defines also a threshold T. Values of similarity grater than T indicate that the two lists are similar.
Cosine similarity is probably related to this problem, but I have no idea how to work with subset and treshold.
I have found also different approaches, but threshold pararameter is snot specified:
-A Similarity Measure for Indefinite Rankings
-Kendall rank correlation coefficient
I think you are looking for some kind of set similarity.
The two most prominent measures for that are Jaccard Index and Sørensen–Dice coefficient
In your case, Using Jaccard similarity coefficient might help.

Index structure for top-k queries on bitstrings

Given array of bitstrings (all of the same length) and query string Q find top-k most similar strings to Q, where similarity between strings A and B is defined as number of 1 in A and B, (operation and is applied bitwise).
I think there is should be a classical result for this problem.
k is small, in hundreds, while number of vectors in hundreds of millions and length of the vectors is 512 or 1024
One way to tackle this problem is to construct a K-Nearest Neighbor Graph (K-NNG) (digraph) with a Russell-Rao similarity function.
Note that efficient K-NNG construction is still an open problem,and none of the known solutions for this problem is general, efficient and scalable [quoting from Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures - Dong, Charikar, Li 2011].
Your distance function is often called Russell-Rao similarity (see for example A Survey of Binary Similarity and Distance Measures - Choi, Cha, Tappert 2010). Note that Russell-Rao similarity is not a metric (see Properties of Binary Vector Dissimilarity Measures - Zhang, Srihari 2003): The "if" part of "d(x, y) = 0 iff x == y" is false.
In A Fast Algorithm for Finding k-Nearest Neighbors with Non-metric Dissimilarity - Zhang, Srihari 2002, the authors propose a fast hierarchical search algorithm to find k-NNs using a non-metric measure in a binary vector space. They use a parametric binary vector distance function D(β). When β=0, this function is reduced to the Russell-Rao distance function. I wouldn't call it a "classical result", but this is the the only paper I could find that examines this problem.
You may want to check these two surveys: On nonmetric similarity search problems in complex domains - Skopal, Bustos 2011 and A Survey on Nearest Neighbor Search Methods - Reza, Ghahremani, Naderi 2014. Maybe you'll find something I missed.
This problem can be solved by writing simple Map and Reduce job. I'm neither claiming that this is the best solution, nor I'm claiming that this is the only solution.
Also, you have disclosed in the comments that k is in hundreds, there are millions of bitstrings and that the size of each of them is 512 or 1024.
Mapper pseudo-code:
Given Q;
For every bitstring b, compute similarity = b & Q
Emit (similarity, b)
Now, the combiner can consolidate the list of all bitStrings from every mapper that have the same similarity.
Reducer pseudo-code:
Consume (similarity, listOfBitStringsWithThisSimilarity);
Output them in decreasing order for similarity value.
From the output of reducer you can extract the top-k bitstrings.
So, MapReduce paradigm is probably the classical solution that you are looking for.

How to optimize finding similarities?

I have a set of 30 000 documents represented by vectors of floats. All vectors have 100 elements. I can find similarity of two documents by comparing them using cosine measure between their vectors. The problem is that it takes to much time to find the most similar documents. Is there any algorithm which can help me with speeding up this?
EDIT
Now, my code just counts cosine similarity between first and all others vectors. It takes about 3 sec. I would like to speed it up ;) algorithm doesn't have to be accurate but should give similar results to full search.
Sum of elements of each vector is equal 1.
start = time.time()
first = allVectors[0]
for vec in allVectors[1:]:
cosine_measure(vec[1:], first[1:])
print str(time.time() - start)
Would locality sensitive hashing (LHS) help?
In case of LHS, the hashing function maps similar items near each other with a probability of your choice. It is claimed to be especially well-suited for high-dimensional similarity search / nearest neighbor search / near duplicate detection and it looks like to me that's exactly what you are trying to achieve.
See also How to understand Locality Sensitive Hashing?
There is a paper How to Approximate the Inner-product: Fast Dynamic Algorithms for Euclidean Similarity describing how to perform a fast approximation of the inner product. If this is not good or fast enough, I suggest to build an index containing all your documents. A structure similar to a quadtree but based on a geodesic grid would probably work really well, see Indexing the Sphere with the Hierarchical Triangular Mesh.
UPDATE: I completely forgot that you are dealing with 100 dimensions. Indexing high dimensional data is notoriously hard and I am not sure how well indexing a sphere will generalize to 100 dimensions.
If your vectors are normalized, the cosine is related to the Euclidean distance: ||a - b||² = (a - b)² = ||a||² + ||b||² - 2 ||a|| ||b|| cos(t) = 1 + 1 - 2 cos(t). So you can recast your problem in terms of Euclidean nearest neighbors.
A nice approach if that of the kD trees, a spatial data structure that generalizes the binary search (http://en.wikipedia.org/wiki/K-d_tree). Anyway, kD trees are known to be inefficient in high dimensions (your case), so that the so-called best-bin-first-search is preferred (http://en.wikipedia.org/wiki/Best-bin-first_search).

Machine Learning Algorithm for Completing Sparse Matrix Data

I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.

Computing similarity between two lists

EDIT:
as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.
Eg,
1,7,4,5,8,9
1,7,5,4,9,6
What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists?
I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of documents. Now, I want to compare the similarity between my system and the "state of the art system" in order to measure the correctness of my system. Please note that the order of documents is important as we are talking about a ranked system.
Does anyone know of any measures that can help me find the similarity between these two lists.
The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.
It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.
Using DCG/nDCG to evaluate the system compared to the SOA base line:
Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.
Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)
To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.
If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.
Example:
mySystem = [1,2,5,4,6,7]
stateOfTheArt = [1,2,4,5,6,9]
First you give score to each document, according to the state of the art system [using the formula from above]:
doc1 = 1.0
doc2 = 0.6309297535714574
doc3 = 0.0
doc4 = 0.5
doc5 = 0.43067655807339306
doc6 = 0.38685280723454163
doc7 = 0
doc8 = 0
doc9 = 0.3562071871080222
Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783
Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939
Kendalls tau is the metric you want. It measures the number of pairwise inversions in the list. Spearman's foot rule does the same, but measures distance rather than inversion. They are both designed for the task at hand, measuring the difference in two rank-ordered lists.
Is the list of documents exhaustive? That is, is every document rank ordered by system 1 also rank ordered by system 2? If so a Spearman's rho may serve your purposes. When they don't share the same documents, the big question is how to interpret that result. I don't think there is a measurement that answers that question, although there may be some that implement an implicit answer to it.
As you said, you want to compute how similar one list is to the other. I think simplistically, you can start by counting the number of Inversions. There's a O(NlogN) divide and conquer approach to this. It is a very simple approach to measure the "similarity" between two lists. e.g. you want to compare how 'similar' the music tastes are for two persons on a music website, you take their rankings of a set of songs and count the no. of inversions in it. Lesser the count, more 'similar' their taste is.
since you are already considering the "state of the art system" to be a benchmark of correctness, counting Inversions should give you a basic measure of 'similarity' of your ranking.
Of course this is just a starters approach, but you can build on it as how strict you want to be with the "inversion gap" etc.
D1 D2 D3 D4 D5 D6
-----------------
R1: 1, 7, 4, 5, 8, 9 [Rankings from 'state of the art' system]
R2: 1, 7, 5, 4, 9, 6 [ your Rankings]
Since rankings are in order of documents you can write your own comparator function based on R1 (ranking of the "state of the art system" and hence count the inversions comparing to that comparator.
You can "penalize" 'similarity' for each inversions found: i < j but R2[i] >' R2[j]
( >' here you use your own comparator)
Links you may find useful:
Link1
Link2
Link3
I actually know four different measures for that purpose.
Three have already been mentioned:
NDCG
Kendall's Tau
Spearman's Rho
But if you have more than two ranks that have to be compared, use Kendall's W.
In addition to what has already been said, I would like to point you to the following excellent paper: W. Webber et al, A Similarity Measure for Indefinite Rankings (2010). Besides containing a good review of existing measures (such as above-mentioned Kendall Tau and Spearman's footrule), the authors propose an intuitively appealing probabilistic measure that is applicable for varying length of result lists and when not all items occur in both lists. Roughly speaking, it is parameterized by a "persistence" probability p that a user scans item k+1 after having inspected item k (rather than abandoning). Rank-Biased Overlap (RBO) is the expected overlap ratio of results at the point the user stops reading.
The implementation of RBO is slightly more involved; you can take a peek at an implementation in Apache Pig here.
Another simple measure is cosine similarity, the cosine between two vectors with dimensions corresponding to items, and inverse ranks as weights. However, it doesn't handle items gracefully that only occur in one of the lists (see the implementation in the link above).
For each item i in list 1, let h_1(i) = 1/rank_1(i). For each item i in list 2 not occurring in list 1, let h_1(i) = 0. Do the same for h_2 with respect to list 2.
Compute v12 = sum_i h_1(i) * h_2(i); v11 = sum_i h_1(i) * h_1(i); v22 = sum_i h_2(i) * h_2(i)
Return v12 / sqrt(v11 * v22)
For your example, this gives a value of 0.7252747.
Please let me give you some practical advice beyond your immediate question. Unless your 'production system' baseline is perfect (or we are dealing with a gold set), it is almost always better to compare a quality measure (such as above-mentioned nDCG) rather than similarity; a new ranking will be sometimes better, sometimes worse than the baseline, and you want to know if the former case happens more often than the latter. Secondly, similarity measures are not trivial to interpret on an absolute scale. For example, if you get a similarity score of say 0.72, does this mean it is really similar or significantly different? Similarity measures are more helpful in saying that e.g. a new ranking method 1 is closer to production than another new ranking method 2.
I suppose you are talking about comparing two Information Retrieval System which trust me is not something trivial. It is a complex Computer Science problem.
For measuring relevance or doing kind of A/B testing you need to have couple of things:
A competitor to measure relevance. As you have two systems than this prerequisite is met.
You need to manually rate the results. You can ask your colleagues to rate query/url pairs for popular queries and then for the holes(i.e. query/url pair not rated you can have some dynamic ranking function by using "Learning to Rank" Algorithm http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true (please read below of an example of Google/Bing).
Google and Bing are competitors in the horizontal search market. These search engines employ manual judges around the world and invest millions on them, to rate their results for queries. So for each query/url pairs generally top 3 or top 5 results are rated. Based on these ratings they may use a metric like NDCG (Normalized Discounted Cumulative Gain) , which is one of finest metric and the one of most popular one.
According to wikipedia:
Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks.
Wikipedia explains NDCG in a great manner. It is a short article, please go through that.

Resources