metrics for evaluating ranking algorithms - algorithm

I have created an algorithm that ranks entities . I was wondering what should be my metrics to evaluate my algorithm. Are their any algorithm of these type to which I can compare mine?

Normalized discounted cumulative gain is one of the standard method of evaluating ranking algorithms. You will need to provide a score to each of the recommendations that you give. If your algorithm assigns a low (better) rank to a high scoring entity, your NDCG score will be higher, and vice versa.
The score can depend on the aspect in the query.
You can also manually create a gold data set, with each result assigned a score. You can then use these scores to calculate NDCG.
Note that what I call score, is referred to as relevance (rel i, relevance of ith result) in the formulas.

Related

Typical scoring parameter choices for cross-validation of ranking classifier rank:pairwise

I am building XGBoost ranking classifier using Python xgboost.sklearn.XGBClassifier (XGBClassifier). In my problem, I try to classify ranking labels that vary in 0,1,2,3. In the classifier setup, I used objective = "rank:pairwise". I now want to run cross-validation with sklearn.model_selection.cross_val_score (cross_val_score).
Are there any canonical choices of scoring function to assess the rank outcome classification performance?
I am thinking scoring = "neg_mean_squared_error" seems like an OK choice as it weights the distance between the two labels, i.e. accounts for the ranking character of the outcome.
I hope to get other comments/opinions/experiences on that.

KMeans evaluation metric not converging. Is this normal behavior or no?

I'm working on a problem that necessitates running KMeans separately on ~125 different datasets. Therefore, I'm looking to mathematically calculate the 'optimal' K for each respective dataset. However, the evaluation metric continues decreasing with higher K values.
For a sample dataset, there are 50K rows and 8 columns. Using sklearn's calinski-harabaz score, I'm iterating through different K values to find the optimum / minimum score. However, my code reached k=5,600 and the calinski-harabaz score was still decreasing!
Something weird seems to be happening. Does the metric not work well? Could my data be flawed (see my question about normalizing rows after PCA)? Is there another/better way to mathematically converge on the 'optimal' K? Or should I force myself to manually pick a constant K across all datasets?
Any additional perspectives would be helpful. Thanks!
I don't know anything about the calinski-harabaz score but some score metrics will be monotone increasing/decreasing with respect to increasing K. For instance the mean squared error for linear regression will always decrease each time a new feature is added to the model so other scores that add penalties for increasing number of features have been developed.
There is a very good answer here that covers CH scores well. A simple method that generally works well for these monotone scoring metrics is to plot K vs the score and choose the K where the score is no longer improving 'much'. This is very subjective but can still give good results.
SUMMARY
The metric decreases with each increase of K; this strongly suggests that you do not have a natural clustering upon the data set.
DISCUSSION
CH scores depend on the ratio between intra- and inter-cluster densities. For a relatively smooth distribution of points, each increase in K will give you clusters that are slightly more dense, with slightly lower density between them. Try a lattice of points: vary the radius and do the computations by hand; you'll see how that works. At the extreme end, K = n: each point is its own cluster, with infinite density, and 0 density between clusters.
OTHER METRICS
Perhaps the simplest metric is sum-of-squares, which is already part of the clustering computations. Sum the squares of distances from the centroid, divide by n-1 (n=cluster population), and then add/average those over all clusters.
I'm looking for a particular paper that discusses metrics for this very problem; if I can find the reference, I'll update this answer.
N.B. With any metric you choose (as with CH), a failure to find a local minimum suggests that the data really don't have a natural clustering.
WHAT TO DO NEXT?
Render your data in some form you can visualize. If you see a natural clustering, look at the characteristics; how is it that you can see it, but the algebra (metrics) cannot? Formulate a metric that highlights the differences you perceive.
I know, this is an effort similar to the problem you're trying to automate. Welcome to research. :-)
The problem with my question is that the 'best' Calinski-Harabaz score is the maximum, whereas my question assumed the 'best' was the minimum. It is computed by analyzing the ratio of between-cluster dispersion vs. within-cluster dispersion, the former/numerator you want to maximize, the latter/denominator you want to minimize. As it turned out, in this dataset, the 'best' CH score was with 2 clusters (the minimum available for comparison). I actually ran with K=1, and this produced good results as well. As Prune suggested, there appears to be no natural grouping within the dataset.

Normalization of a multi-dimensional space, what algorithm is this?

I'm not a trained statistician so I apologize for the incorrect usage of some words. I'm just trying to get some good results from the Weka Nearest Neighbor algorithms. I'll use some redundancy in my explanation as a means to try to get the concept across:
Is there a way to normalize a multi-dimensional space so that the distances between any two instances are always proportional to the effect on the dependent variable?
In other words I have a statistical data set and I want to use a "nearest neighbor" algorithm to find instances that are most similar to a specified test instance. Unfortunately my initial results are useless because two attributes that are very close in value weakly correlated to the dependent variable would incorrectly bias the distance calculation.
For example let's say you're trying to find the nearest-neighbor of a given car based on a database of cars: make, model, year, color, engine size, number of doors. We know intuitively that the make, model, and year have a bigger effect on price than the number of doors. So a car with identical color, door count, may not be the nearest neighbor to a car with different color/doors but same make/model/year. What algorithm(s) can be used to appropriately set the weights of each independent variable in the Nearest Neighbor distance calculation so that the distance will be statistically proportional (correlated, whatever) to the dependent variable?
Application: This can be used for a more accurate "show me products similar to this other product" on shopping websites. Back to the car example, this would have cars of same make and model bubbling up to the top, with year used as a tie-breaker, and then within cars of the same year, it might sort the ones with the same number of cylinders (4 or 6) ahead of the ones with the same number of doors (2 or 4). I'm looking for an algorithmic way to derive something similar to the weights that I know intuitively (make >> model >> year >> engine >> doors) and actually assign numerical values to them to be used in the nearest-neighbor search for similar cars.
A more specific example:
Data set:
Blue,Honda,6-cylinder
Green,Toyota,4-cylinder
Blue,BMW,4-cylinder
now find cars similar to:
Blue,Honda,4-cylinder
in this limited example, it would match the Green,Toyota,4-cylinder ahead of the Blue,Honda,6-cylinder because the two brands are statistically almost interchangeable and cylinder is a stronger determinant of price rather than color. BMW would match lower because that brand tends to double the price, i.e. placing the item a larger distance.
Final note: the prices are available during training of the algorithm, but not during calculation.
Possible you should look at Solr/Lucene for this aim. Solr provides a similarity search based field value frequency and it already has functionality MoreLikeThis for find similar items.
Maybe nearest neighbor is not a good algorithm for this case? As you want to classify discrete values it can become quite hard to define reasonable distances. I think an C4.5-like algorithm may better suit the application you describe. On each step the algorithm would optimize the information entropy, thus you will always select the feature that gives you the most information.
Found something in the IEEE website. The algorithm is called DKNDAW ("dynamic k-nearest-neighbor with distance and attribute weighted"). I couldn't locate the actual paper (probably needs a paid subscription). This looks very promising assuming that the attribute weights are computed by the algorithm itself.

Computing similarity between two lists

EDIT:
as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.
Eg,
1,7,4,5,8,9
1,7,5,4,9,6
What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists?
I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of documents. Now, I want to compare the similarity between my system and the "state of the art system" in order to measure the correctness of my system. Please note that the order of documents is important as we are talking about a ranked system.
Does anyone know of any measures that can help me find the similarity between these two lists.
The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.
It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.
Using DCG/nDCG to evaluate the system compared to the SOA base line:
Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.
Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)
To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.
If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.
Example:
mySystem = [1,2,5,4,6,7]
stateOfTheArt = [1,2,4,5,6,9]
First you give score to each document, according to the state of the art system [using the formula from above]:
doc1 = 1.0
doc2 = 0.6309297535714574
doc3 = 0.0
doc4 = 0.5
doc5 = 0.43067655807339306
doc6 = 0.38685280723454163
doc7 = 0
doc8 = 0
doc9 = 0.3562071871080222
Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783
Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939
Kendalls tau is the metric you want. It measures the number of pairwise inversions in the list. Spearman's foot rule does the same, but measures distance rather than inversion. They are both designed for the task at hand, measuring the difference in two rank-ordered lists.
Is the list of documents exhaustive? That is, is every document rank ordered by system 1 also rank ordered by system 2? If so a Spearman's rho may serve your purposes. When they don't share the same documents, the big question is how to interpret that result. I don't think there is a measurement that answers that question, although there may be some that implement an implicit answer to it.
As you said, you want to compute how similar one list is to the other. I think simplistically, you can start by counting the number of Inversions. There's a O(NlogN) divide and conquer approach to this. It is a very simple approach to measure the "similarity" between two lists. e.g. you want to compare how 'similar' the music tastes are for two persons on a music website, you take their rankings of a set of songs and count the no. of inversions in it. Lesser the count, more 'similar' their taste is.
since you are already considering the "state of the art system" to be a benchmark of correctness, counting Inversions should give you a basic measure of 'similarity' of your ranking.
Of course this is just a starters approach, but you can build on it as how strict you want to be with the "inversion gap" etc.
D1 D2 D3 D4 D5 D6
-----------------
R1: 1, 7, 4, 5, 8, 9 [Rankings from 'state of the art' system]
R2: 1, 7, 5, 4, 9, 6 [ your Rankings]
Since rankings are in order of documents you can write your own comparator function based on R1 (ranking of the "state of the art system" and hence count the inversions comparing to that comparator.
You can "penalize" 'similarity' for each inversions found: i < j but R2[i] >' R2[j]
( >' here you use your own comparator)
Links you may find useful:
Link1
Link2
Link3
I actually know four different measures for that purpose.
Three have already been mentioned:
NDCG
Kendall's Tau
Spearman's Rho
But if you have more than two ranks that have to be compared, use Kendall's W.
In addition to what has already been said, I would like to point you to the following excellent paper: W. Webber et al, A Similarity Measure for Indefinite Rankings (2010). Besides containing a good review of existing measures (such as above-mentioned Kendall Tau and Spearman's footrule), the authors propose an intuitively appealing probabilistic measure that is applicable for varying length of result lists and when not all items occur in both lists. Roughly speaking, it is parameterized by a "persistence" probability p that a user scans item k+1 after having inspected item k (rather than abandoning). Rank-Biased Overlap (RBO) is the expected overlap ratio of results at the point the user stops reading.
The implementation of RBO is slightly more involved; you can take a peek at an implementation in Apache Pig here.
Another simple measure is cosine similarity, the cosine between two vectors with dimensions corresponding to items, and inverse ranks as weights. However, it doesn't handle items gracefully that only occur in one of the lists (see the implementation in the link above).
For each item i in list 1, let h_1(i) = 1/rank_1(i). For each item i in list 2 not occurring in list 1, let h_1(i) = 0. Do the same for h_2 with respect to list 2.
Compute v12 = sum_i h_1(i) * h_2(i); v11 = sum_i h_1(i) * h_1(i); v22 = sum_i h_2(i) * h_2(i)
Return v12 / sqrt(v11 * v22)
For your example, this gives a value of 0.7252747.
Please let me give you some practical advice beyond your immediate question. Unless your 'production system' baseline is perfect (or we are dealing with a gold set), it is almost always better to compare a quality measure (such as above-mentioned nDCG) rather than similarity; a new ranking will be sometimes better, sometimes worse than the baseline, and you want to know if the former case happens more often than the latter. Secondly, similarity measures are not trivial to interpret on an absolute scale. For example, if you get a similarity score of say 0.72, does this mean it is really similar or significantly different? Similarity measures are more helpful in saying that e.g. a new ranking method 1 is closer to production than another new ranking method 2.
I suppose you are talking about comparing two Information Retrieval System which trust me is not something trivial. It is a complex Computer Science problem.
For measuring relevance or doing kind of A/B testing you need to have couple of things:
A competitor to measure relevance. As you have two systems than this prerequisite is met.
You need to manually rate the results. You can ask your colleagues to rate query/url pairs for popular queries and then for the holes(i.e. query/url pair not rated you can have some dynamic ranking function by using "Learning to Rank" Algorithm http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true (please read below of an example of Google/Bing).
Google and Bing are competitors in the horizontal search market. These search engines employ manual judges around the world and invest millions on them, to rate their results for queries. So for each query/url pairs generally top 3 or top 5 results are rated. Based on these ratings they may use a metric like NDCG (Normalized Discounted Cumulative Gain) , which is one of finest metric and the one of most popular one.
According to wikipedia:
Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks.
Wikipedia explains NDCG in a great manner. It is a short article, please go through that.

What does a Bayesian Classifier score represent?

I'm using the ruby classifier gem whose classifications method returns the scores for a given string classified against the trained model.
Is the score a percentage? If so, is the maximum difference 100 points?
It's the logarithm of a probability. With a large trained set, the actual probabilities are very small numbers, so the logarithms are easier to compare. Theoretically, scores will range from infinitesimally close to zero down to negative infinity. 10**score * 100.0 will give you the actual probability, which indeed has a maximum difference of 100.
Actually to calculate the probability of a typical naive bayes classifier where b is the base, it is b^score/(1+b^score). This is the inverse logit (http://en.wikipedia.org/wiki/Logit) However, given the independence assumptions of the NBC, these scores tend to be too high or too low and probabilities calculated this way will accumulate at the boundaries. It is better to calculate the scores in a holdout set and do a logistic regression of accurate(1 or 0) on score to get a better feel for the relationship between score and probability.
From a Jason Rennie paper:
2.7 Naive Bayes Outputs Are Often Overcondent
Text databases frequently have
10,000 to 100,000 distinct vocabulary words; documents often contain 100 or more
terms. Hence, there is great opportunity for duplication.
To get a sense of how much duplication there is, we trained a MAP Naive Bayes
model with 80% of the 20 Newsgroups documents. We produced p(cjd;D) (posterior)
values on the remaining 20% of the data and show statistics on maxc p(cjd;D) in
table 2.3. The values are highly overcondent. 60% of the test documents are assigned
a posterior of 1 when rounded to 9 decimal digits. Unlike logistic regression, Naive
Bayes is not optimized to produce reasonable probability values. Logistic regression
performs joint optimization of the linear coecients, converging to the appropriate
probability values with sucient training data. Naive Bayes optimizes the coecients
one-by-one. It produces realistic outputs only when the independence assumption
holds true. When the features include signicant duplicate information (as is usually
the case with text), the posteriors provided by Naive Bayes are highly overcondent.

Resources