tf-idf: am I understanding it right? - algorithm

I am interested in doing some document clustering, and right now I am considering using TF-IDF for this.
If I am not wrong, TF-IDF is particularly used for evaluating the relevance of a document given a query. If I do not have a particular query, how can I apply tf-idf to clustering?

For document clustering. the best approach is to use k-means algorithm. If you know how many types of documents you have you know what k is.
To make it work on documents:
a) say choose initial k documents at random.
b) Assign each document to a cluser using the minimum distance for a document with the cluster.
c) After documents are assigned to the cluster make K new documents as cluster by taking the centroid of each cluster.
Now, the question is
a) How to calculate distance between 2 documents: Its nothing but cosine similarity of terms of documents with initial cluster. Terms here are nothing but TF-IDF(calculated earlier for each document)
b) Centroid should be: sum of TF-IDF of a given term/ no. of documents. Do, this for all the possible terms in a cluster. this will give you another n-dimensional documents.
Hope thats helps!

Not exactly actually: tf-idf gives you the relevance of a term in a given document.
So you can perfectly use it for your clustering by computing a proximity which would be something like
proximity(document_i, document_j) = sum(tf_idf(t,i) * tf_idf(t,j))
for each term t both in doc i and doc j.

TF-IDF serves a different purpose; unless you intend to reinvent the wheel, you are better of using a tool like Carrot. Googling for document clustering can give you many algorithms if you wish to implement one on your own.

Related

Indexing strategy for finding similar strings

I am working on devising indexing strategy for finding similar hashes. The hashes are generated for images. i.e
String A = "00007c3fff1f3b06738f390079c627c3ffe3fb11f0007c00fff07ff03f003000" //Image 1
String B = "6000fc3efb1f1b06638f1b0071c667c7fff3e738d0007c00fff03ff03f803000" //Image 2
These two hashes are similar (based on Hamming distance and Levenshtein distance) and hence similar images. I have more than 190 million such hashes. I have to select a suitable indexing data structure where the worst case complexity for finding similar hash is not O(n). Hash data structure won't work because it will search for <, = and > (or will it?). I can find Hamming distance or other distance to calculate the similarity but in worst case I will end up calculating it 190 million times.
This is my strategy now:
Currently I am working on BTree where I will rank all the keys in a node based on no. of consecutive same characters and traverse the key which is highest ranked and if the child's keys rank is less than other key's rank in parent node, I will start traversing that key in the parent node. If all the rank of parent is same I will do normal BTree traverse (givenkey < nodeKey --> go to Child node of nodeKey..using ASCII comparison) which is where my issue is.
Because it would lead to lot of false negatives in search. As in the worst case I will traverse only one part of tree where potentially similar key can be found in other traversals. Else I have to search entire tree which is again O(n) where I might as well not have tree.
I feel there has to be a better way and right now I am stuck and it would be great to hear any inputs on breaking down the problem. Please share your thoughts.
P.S : and I cannot use any external database.
First, this is a very difficult problem. Don't expect neat, tidy answers.
One approximate data structure I have seen is Spatial Approximation Sample Hierarchy (SASH).
A SASH (Spatial Approximation Sample Hierarchy) is a general-purpose data structure for efficiently computing approximate answers for similarity queries. Similarity queries naturally arise in a number of important computing contexts, in particular content-based retrieval on multimedia databases, and nearest-neighbor methods for clustering and classification.
SASH uses only a distance function to build a data structure, so the distance function (and in your case, the image hash function as well) needs to be "good". The basic intuition is roughly that if A ~ B (image A is close to image B) and B ~ C, then usually A ~ C. The data structure creates links between items that are relatively close, and you prune your search by only looking for things that are closer to your query. Whether this strategy actually works depends on the nature of your data and the distance function.
It has been 10 years or so since I looked at SASH, so there are probably newer developments as well. Michael Houle's page seems to indicate he has newer research on something called Rank Cover Trees, which seem similar in purpose to SASH. This should at least get you started on research in the area; read some papers and follow the reference trail.

metrics for evaluating ranking algorithms

I have created an algorithm that ranks entities . I was wondering what should be my metrics to evaluate my algorithm. Are their any algorithm of these type to which I can compare mine?
Normalized discounted cumulative gain is one of the standard method of evaluating ranking algorithms. You will need to provide a score to each of the recommendations that you give. If your algorithm assigns a low (better) rank to a high scoring entity, your NDCG score will be higher, and vice versa.
The score can depend on the aspect in the query.
You can also manually create a gold data set, with each result assigned a score. You can then use these scores to calculate NDCG.
Note that what I call score, is referred to as relevance (rel i, relevance of ith result) in the formulas.

How do I measure the goodness of cosine similarity scores across different vector spaces?

I am a computer scientist working on a problem that requires some statistical measures, though (not being very well versed in statistics) I am not quite sure what statistics to use.
Overview:
I have a set of questions (from StackExchange sites, of course) and with this data, I am exploring algorithms that will find similar questions to one I provide. Yes, StackExchange sites already perform this function, as do many other Q&A sites. What I am trying to do is analyze the methods and algorithms that people employ to accomplish this task to see which methods perform best. My problem is finding appropriate statistical measures to quantitatively determine "which methods perform best."
The Data:
I have a set of StackExchange questions, each of which is saved like this: {'questionID':"...", 'questionText':"..."}. For each question, I have a set of other questions either linked to it or from it. It is common practice for question answer-ers on StackExchange sites to add links to other similar posts in their answers, i.e. "Have you read this post [insert link to post here] by so-and-so? They're solving a similar problem..." I am considering these linked questions to be 'similar' to one another.
More concretely, let's say we have question A.
Question A has a collection of linked questions {B, C, D}. So  A_linked = {B, C, D}.
My intuition tells me that the transitive property does not apply here. That is, just because A is similar to B, and A is similar to C, I cannot confirm that B is similar to C. (Or can I?)
However, I can confidently say that if A is similar to B, then B is similar to A.
So, to simplify these relationships, I will create a set of similar pairs: {A, B}, {A, C}, {A, D}
These pairs will serve as a ground truth of sorts. These are questions we know are similar to one another, so their similarity confidence values equals 1. So similarityConfidence({A,B}) = 1
Something to note about this set-up is that we know only a few similar questions for each question in our dataset. What we don't know is whether some other question E is also similar to A. It might be similar, it might not be similar, we don't know. So our 'ground truth' is really only some of the truth.
The algorithm:
A simplified pseudocode version of the algorithm is this:
for q in questions: #remember q = {'questionID':"...", 'questionText':"..."}
similarities = {} # will hold a mapping from questionID to similarity to q
q_Vector = vectorize(q) # create a vector from question text (each word is a dimension, value is unimportant)
for o in questions: #such that q!=o
o_Vector = vectorize(o)
similarities[o['questionID']] = cosineSimilarity(q_Vector,o_Vector) # values will be in the range of 1.0=identical to 0.0=not similar at all
#now what???
So now I have a complete mapping of cosine similarity scores between q and every other question in my dataset. My ultimate goal is to run this code for many variations of the vectorize() function (each of which will return a slightly different vector) and determine which variation performs best in terms of cosine scores.
The Problem:
So here lies my question. Now what? How do I quantitatively measure how good these cosine scores are?
These are some ideas of measurements I've brainstormed (though I feel like they're unrefined, incomplete):
Some sort of error function similar to Root Mean Square Error (RMSE). So for each document in the ground-truth similarities list, accumulate the squared error (with error roughly defined as 1-similarities[questionID]). We would then divide that accumulation by the total number of similar pairs *2 (since we will consider a->b as well as b->a). Finally, we'd take the square root of this error.
This requires some thought, since these values may need to be normalized. Though all variations of vectorize() will produce cosine scores in the range of 0 to 1, the cosine scores from two vectorize() functions may not compare to one another. vectorize_1() might have generally high cosine scores for each question, so a score of .5 might be a very low score. Alternatively, vectorize_2() might have generally low cosine scores for each question, so a .5 might be a very high score. I need to account for this variation somehow.
Also, I proposed an error function of 1-similarities[questionID]. I chose 1 because we know that the two questions are similar, therefore our similarity confidence is 1. However, a cosine similarity score of 1 means the two questions are identical. We are not claiming that our 'linked' questions are identical, merely that they are similar. Is this an issue?
We can find the recall (number of similar documents returned/number of similar documents), so long as we set a threshold for which questions we return as 'similar' and which we do not.
Although, for the reasons mentioned above, this shouldn't be a predefined threshold like similarity[documentID]>7 because each vectorize() function may return different values.
We could find recall # k, where we only analyze the top k posts.
This could be problematic though, because we don't have the full ground truth. If we set k=5, and only 1 document (B) of the 3 documents we knew to be relevant ({B,C,D}) were in the top 5, we do not know whether the other 4 top documents are actually equally or more similar to A than the 3 we knew about, but no one linked them.
Do you have any other ideas? How can I quantitatively measure which vectorize() function performs best?
First note that this question is highly relevant to the Information Retrieval problem of similarity and near duplicate detection.
As far as I see it, your problem can be split to two problems:
Determining ground truth: In many 'competitions', where the ground truth is unclear, the way
to determine which are the relevant documents is by taking documents
which were returned by X% of the candidates.
Choosing the best candidate: first note that usually comparing scores of two different algorithms is irrelevant. The scales could be completely different, and it is usually pointless. In order to compare between two algorithms, you should use the ranking of each algorithm - how each algorithm ranks documents, and how far is it from the ground truth.
A naive way to do it is simply using precision and recall - and you can compare them with the f-measure. Problem is, a document that is ranked 10th is as important as a document that is ranked 1st.
A better way to do it is NDCG - this is the most common way to compare algorithms in most articles I have encountered, and is widely used in the main IR conferences: WWW, sigIR. NDCG is giving a score to a ranking, and giving high importance to documents that were ranked 'better', and reduced importance to documents that were ranked 'worse'. Another common variation is NDCG#k - where NDCG is used only up to the k'th document for each query.
Hope this background and advises help.

Similarities Between Trees

I am working on a problem of Clustering of Results of Keyword Search on Graph. The results are in the form of Tree and I need to cluster those threes in group based on their similarities. Every node of the tree has two keys, one is the table name in the SQL database(semantic form) and second is the actual values of a record of that table(label).
I have used Zhang and Shasha, Klein, Demaine and RTED algorithms to find the Tree Edit Distance between the trees based on these two keys. All algorithms use no of deletion/insertion/relabel operation need to modify the trees to make them look same.
**I want some more matrices of to check the similarities between two trees e.g. Number of Nodes, average fan outs and more so that I can take a weighted average of these matrices to reach on a very good similarity matrix which takes into account both the semantic form of the tree (structure) and information contained in the tree(Labels at the node).
Can you please suggest me some way out or some literature which can be of some help?**
Can anyone suggest me some good paper
Even if you had the (pseudo-)distances between each pair of possible trees, this is actually not what you're after. You actually want to do unsupervised learning (clustering) in which you combine structure learning with parameter learning. The types of data structures you want to perform inference on are trees. To postulate "some metric space" for your clustering method, you introduce something that is not really necessary. To find the proper distance measure is a very difficult problem. I'll point in different directions in the following paragraphs and hope they can help you on your way.
The following is not the only way to represent this problem... You can see your problem as Bayesian inference over all possible trees with all possible values at the tree nodes. You probably would have some prior knowledge on what kind of trees are more likely than others and/or what kind of values are more likely than others. The Bayesian approach would allow you to define priors for both.
One article you might like to read is "Learning with Mixtures of Trees" by Meila and Jordan, 2000 (pdf). It explains that it is possible to use a decomposable prior: the tree structure has a different prior from the values/parameters (this of course means that there is some assumption of independence at play here).
I know you were hinting at heuristics such as the average fan-out etc., but you might find it worthwhile to check out these new applications of Bayesian inference. Note, for example that within nonparametric Bayesian method it is also feasible to reason about infinite trees, as done e.g. by Hutter, 2004 (pdf)!

Computing similarity between two lists

EDIT:
as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.
Eg,
1,7,4,5,8,9
1,7,5,4,9,6
What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists?
I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of documents. Now, I want to compare the similarity between my system and the "state of the art system" in order to measure the correctness of my system. Please note that the order of documents is important as we are talking about a ranked system.
Does anyone know of any measures that can help me find the similarity between these two lists.
The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.
It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.
Using DCG/nDCG to evaluate the system compared to the SOA base line:
Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.
Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)
To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.
If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.
Example:
mySystem = [1,2,5,4,6,7]
stateOfTheArt = [1,2,4,5,6,9]
First you give score to each document, according to the state of the art system [using the formula from above]:
doc1 = 1.0
doc2 = 0.6309297535714574
doc3 = 0.0
doc4 = 0.5
doc5 = 0.43067655807339306
doc6 = 0.38685280723454163
doc7 = 0
doc8 = 0
doc9 = 0.3562071871080222
Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783
Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939
Kendalls tau is the metric you want. It measures the number of pairwise inversions in the list. Spearman's foot rule does the same, but measures distance rather than inversion. They are both designed for the task at hand, measuring the difference in two rank-ordered lists.
Is the list of documents exhaustive? That is, is every document rank ordered by system 1 also rank ordered by system 2? If so a Spearman's rho may serve your purposes. When they don't share the same documents, the big question is how to interpret that result. I don't think there is a measurement that answers that question, although there may be some that implement an implicit answer to it.
As you said, you want to compute how similar one list is to the other. I think simplistically, you can start by counting the number of Inversions. There's a O(NlogN) divide and conquer approach to this. It is a very simple approach to measure the "similarity" between two lists. e.g. you want to compare how 'similar' the music tastes are for two persons on a music website, you take their rankings of a set of songs and count the no. of inversions in it. Lesser the count, more 'similar' their taste is.
since you are already considering the "state of the art system" to be a benchmark of correctness, counting Inversions should give you a basic measure of 'similarity' of your ranking.
Of course this is just a starters approach, but you can build on it as how strict you want to be with the "inversion gap" etc.
D1 D2 D3 D4 D5 D6
-----------------
R1: 1, 7, 4, 5, 8, 9 [Rankings from 'state of the art' system]
R2: 1, 7, 5, 4, 9, 6 [ your Rankings]
Since rankings are in order of documents you can write your own comparator function based on R1 (ranking of the "state of the art system" and hence count the inversions comparing to that comparator.
You can "penalize" 'similarity' for each inversions found: i < j but R2[i] >' R2[j]
( >' here you use your own comparator)
Links you may find useful:
Link1
Link2
Link3
I actually know four different measures for that purpose.
Three have already been mentioned:
NDCG
Kendall's Tau
Spearman's Rho
But if you have more than two ranks that have to be compared, use Kendall's W.
In addition to what has already been said, I would like to point you to the following excellent paper: W. Webber et al, A Similarity Measure for Indefinite Rankings (2010). Besides containing a good review of existing measures (such as above-mentioned Kendall Tau and Spearman's footrule), the authors propose an intuitively appealing probabilistic measure that is applicable for varying length of result lists and when not all items occur in both lists. Roughly speaking, it is parameterized by a "persistence" probability p that a user scans item k+1 after having inspected item k (rather than abandoning). Rank-Biased Overlap (RBO) is the expected overlap ratio of results at the point the user stops reading.
The implementation of RBO is slightly more involved; you can take a peek at an implementation in Apache Pig here.
Another simple measure is cosine similarity, the cosine between two vectors with dimensions corresponding to items, and inverse ranks as weights. However, it doesn't handle items gracefully that only occur in one of the lists (see the implementation in the link above).
For each item i in list 1, let h_1(i) = 1/rank_1(i). For each item i in list 2 not occurring in list 1, let h_1(i) = 0. Do the same for h_2 with respect to list 2.
Compute v12 = sum_i h_1(i) * h_2(i); v11 = sum_i h_1(i) * h_1(i); v22 = sum_i h_2(i) * h_2(i)
Return v12 / sqrt(v11 * v22)
For your example, this gives a value of 0.7252747.
Please let me give you some practical advice beyond your immediate question. Unless your 'production system' baseline is perfect (or we are dealing with a gold set), it is almost always better to compare a quality measure (such as above-mentioned nDCG) rather than similarity; a new ranking will be sometimes better, sometimes worse than the baseline, and you want to know if the former case happens more often than the latter. Secondly, similarity measures are not trivial to interpret on an absolute scale. For example, if you get a similarity score of say 0.72, does this mean it is really similar or significantly different? Similarity measures are more helpful in saying that e.g. a new ranking method 1 is closer to production than another new ranking method 2.
I suppose you are talking about comparing two Information Retrieval System which trust me is not something trivial. It is a complex Computer Science problem.
For measuring relevance or doing kind of A/B testing you need to have couple of things:
A competitor to measure relevance. As you have two systems than this prerequisite is met.
You need to manually rate the results. You can ask your colleagues to rate query/url pairs for popular queries and then for the holes(i.e. query/url pair not rated you can have some dynamic ranking function by using "Learning to Rank" Algorithm http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true (please read below of an example of Google/Bing).
Google and Bing are competitors in the horizontal search market. These search engines employ manual judges around the world and invest millions on them, to rate their results for queries. So for each query/url pairs generally top 3 or top 5 results are rated. Based on these ratings they may use a metric like NDCG (Normalized Discounted Cumulative Gain) , which is one of finest metric and the one of most popular one.
According to wikipedia:
Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks.
Wikipedia explains NDCG in a great manner. It is a short article, please go through that.

Resources