Overall rank from multiple ranked lists - algorithm

I've looked through a lot of literature available online, including this forum without any luck and hoping someone can help a statistical issue I currently face:
I have 5 lists of of ranked data, each containing 10 items ranked from position 1 (best) to position 10 (worst). For sake of context, the 10 items in each lists are the same, but in different ranked orders as the technique used to decide their rank is different.
*Example data:
List 1 List 2 List 3 ... etc
Item 1 Ranked 1 Ranked 2 Ranked 1
Item 2 Ranked 3 Ranked 1 Ranked 2
Item 3 Ranked 2 Ranked 3 Ranked 3
... etc*
I am looking for a way to interpret and analyse the above data so that I get a final result showing the overall rank of each item based on each test and its position, e.g.
Result
Rank 1 = Item 1
Rank 2 = Item 3
Rank 3 = Item 4
... etc
Does anyone know how I can interpret this data in a statistically sound method (at a post graduate / PhD applicable level) so that I can understand the overall ranks signalling the importance of each item in the list across the 5 tests please? Or, if there is another type of technique or statistical test I can look into I would appreciate any hints or guidance.
(It maybe also worth noting, I have also performed the simpler mathematical techniques such as sums, averaging, minimum - maximum tests etc, but do not feel these are statistically important enough at this level).
Any help or advice would be greatly appreciated, thank you for your time.

You can use machine learning to get your ranked list. In the Information Retrieval research field - this is called Learning to Rank - and there is a wide rage of literature about it. This tutorial (heads up: high level tutorial) can help you understand the basic concepts and point you to articles for deepening in.
You might also want to have a look on interleaved ranking. This was originally engineered for evaluation of two lists, but it might also be good for your case.

A number of non-parametric statistical tests work by turning the data received into ranks and then analysing the ranks (this can make life easier if the data are very far from being normally distributed). If your ranks are plausibly derived from some underlying score or goodness that you can't observe directly, you could apply any of these tests - there is a short list at http://en.wikipedia.org/wiki/Ranking#Ranking_in_statistics or any book on non-parametric statistics, such as Conover, should cover them.
If you can come up with a statistic you are interested in, such as the total rank of any one item, you could use a Permutation Test - http://en.wikipedia.org/wiki/Resampling_%28statistics%29#Permutation_tests to work out the probability that the statistic concerned is at least as extreme as observed, under the probability that all of the rankings are simply random - you just generate loads of data that follows the null hypothesis and look at the distribution of the statistic in the randomly generated data. You can then use this to get a P-value, or, better, a confidence bound.

Related

How to give players a score on a ranking/prediction task?

I have a website built with php/mysql, and I am looking for help in communicating to a Programmer what I want him to do with a Poll/Prediction game that I am trying to create.
For purposes of discussion, assume a game where perhaps 100 players try to predict the top 5 finishers in a Golf Tournament of perhaps 9 Golfers.
I am looking for help in how to create and assign a score based upon the accuracy of prediction.
The players provide a rank ordering using a drag and drop function to order the players from 1 through 5. This ordering has already been coded, and the ranks are stored somehow in the DB (I do not know how).
My initial thinking is to ask the coder to create a script which will assign a score from 1 to 5 for each Golfer that the player nominated to be in the Top 5.
So, a player who predicted perfectly would be awarded a perfect score of 12345.
His first golfer received a 1 for finishing first, second a 2 for finishing second, third golfer receives a 3 for finishing third, and so on.
Anybody less than perfect would have a score higher than 12345.
Players who got the first four positions correct would have to be differentiated on the basis of the finish of their fifth Golfer.
So, one might score 12347 and the other 12348 and the player with the highest score (12348) would be the loser in a matchup of the two players.
A player who did poorly, might have a score of 53419.
Question:
Is this a viable way of creating a score which the players of my game can be ranked upon?
Is it possible to instead simply have something like a Spearman Rank-Order Correlation calculated comparing the Actual Finish Positions with the Predicted Finish Positions for each player,
and then rank players on the basis of the correlation coefficients for their rankings?
Thanks for any help in clarifying how to conceptualize this before approaching a programmer who gets annoyed when I don't really know what I want him to do ahead of time.
It's a quite interesting problem.
It seems that there are three components that need to be considered in the scoring: the number of correct predictions, the order of correct predictions, and the weight of correct predictions.
For example, assume the truth is:
1,5,10,15,20
Here are some predictions:
1,6,7,8,9 : only predicted first one
2,1,10,21,30 : 1 and 10, but the order of 1 is incorrect
20,15,1,5,30 : hit four in the top 5, but the orders are incorrect
It depends on what you value most. You may first check how many in the top 5 the user has predicted, add a value, and then penalize wrong orders. The weight for each position should also be different, this way
1,5,10,15,20 will rank higher than 1,5,10,20,15 and higher than 1,10,5,20,15
Spearman may be working, but I feel it could be too coarse for your purpose.
This is actually a very similar problem that search engines have. EG, in search engine evaluation, the actual outcomes are preferred results provided by humans, and the predicted outcomes are the results delivered by the search engine. In both your task and for search engines, I'd guess you care a lot more about the accuracy of the winner than the accuracy of the 5th place finisher. If that is the case, then the mean average precision is probably a good measure.

How to calculate difficulty metric?

Note: I have completely changed the original question!
I do have several texts, which consists of several words. Words are categorized into difficulty categories from 1 to 6, 1 being the easiest one and 6 the hardest (or from common to least common). However, obviously not all words can be put into these categories, because they are countless words in the english language.
Each category has twice as many words as the category before.
Level: 100 words in total (100 new)
Level: 200 words in total (100 new)
Level: 400 words in total (200 new)
Level: 800 words in total (400 new)
Level: 1600 words in total (800 new)
Level: 3200 words in total (1600 new)
When I use the term level 6 below, I mean introduced in level 6. So it is part of the 1600 new words and can't be found in the 1600 words up to level 5.
How would I rate the difficulty of an individual text? Compare these texts:
An easy one
would only consist of very basic vocabulary:
I drive a car.
Let's say these are 4 level 1 words.
A medium one
This old man is cretinous.
This is a very basic sentence which only comes with one difficult word.
A hard one
would have some advanced vocabulary in there too:
I steer a gas guzzler.
So how much more difficult is the second or third of the first one? Let's compare text 1 and text 3. I and a are still level 1 words, gas might be lvl 2, steer is 4 and guzzler is not even in the list. cretinous would be level 6.
How to calculate a difficulty of these texts, now that I've classified the vocabulary?
I hope it is more clear what I want to do now.
The problem you are trying to solve is how to quantify your qualitative data.
The search term "quantifying qualitative data" may help you.
There is no general all-purpose algorithm for this. The best way to do it will depend upon what you want to use the metric for, and what your ratings of each individual task mean for the project as a whole in terms of practical impact on the factors you are interested in.
For example if the hardest tasks are typically unsolvable, then as soon as a project involves a single type 6 task, then the project may become unsolvable, and your metric would need to reflect this.
You also need to find some way to address the missing data (unrated tasks). It's likely that a single numeric metric is not going to capture all the information you want about these projects.
Once you have understood what the metric will be used for, and how the task ratings relate to each other (linear increasing difficulty vs. categorical distinctions) then there are plenty of simple metrics that may codify this analysis.
For example, you may rate projects for risk based on a combination of the number of unknown tasks and the number of tasks with difficulty above a certain threshold. Alternatively you may rate projects for duration based on a weighted sum of task difficulty, using a default or estimated difficulty for unknown tasks.

Rating Algorithm

I'm trying to develop a rating system for an application I'm working on. Basically app allows you to rate an object from 1 to 5(represented by stars). But I of course know that keeping a rating count and adding the rating the number itself is not feasible.
So the first thing that came up in my mind was dividing the received rating by the total ratings given. Like if the object has received the rating 2 from a user and if the number of times that object has been rated is 100 maybe adding the 2/100. However I believe this method is not good enough since 1)A naive approach 2) In order for me to get the number of times that object has been rated I have to do a look up on db which might end up having time complexity O(n)
So I was wondering what alternative and possibly better ways to approach this problem?
You can keep in DB 2 additional values - number of times it was rated and total sum of all ratings. This way to update object's rating you need only to:
Add new rating to total sum.
Divide total sum by total times it was rated.
There are many approaches to this but before that check
If all feedback givers treated at equal or some have more weight than others (like panel review, etc)
If the objective is to provide only an average or any score band or such. Consider scenario like this website - showing total reputation score
And yes - if average is to be omputed, you need to have total and count of feedback and then have to compute it - that's plain maths. But if you need any other method, be prepared for more compute cycles. balance between database hits and compute cycle but that's next stage of design. First get your requirement and approach to solution in place.
I think you should keep separate counters for 1 stars, 2 stars, ... to calcuate the rating, you'd have to compute rating = (1*numOneStars+2*numTwoStars+3*numThreeStars+4*numFourStars+5*numFiveStars)/numOneStars+numTwoStars+numThreeStars+numFourStars+numFiveStars)
This way you can, like amazon also show how many ppl voted 1 stars and how many voted 5 stars...
Have you considered a vote up/down mechanism over numbers of stars? It doesn't directly solve your problem but it's worth noting that other sites such as YouTube, Facebook, StackOverflow etc all use +/- voting as it is often much more effective than star based ratings.

How to classify a set of samples via a continuous feature?

For example I got below table which is simply a coarse distribution for 20 persons over their age
age count of person
2 1
5 5
8 2
10 3
15 1
16 2
17 1
20 4
21 1
Then by using the same dataset, I could build another 'better' table .
age count of person
10- 8
10s 7
20+ 5
In fact , I could make more tables which contains different age range combination by using the same dataset.
Now I wonder how could I find the best combinations. The possible "goodness functions" we could use to measure if the combination is good or not might come by following three principles:
There should not be too many or too little classes
Ranges of classes should not vary too much.
Distribution should be smooth enough, that is ,number of items covered by each class should not vary too much.
Since this question represents a situation which is just general enough to describe a kind of specific problems , some sophisticated solutions to it should have already been there . But I failed to find them. Anyone could give some suggestions please?
I have go through some classification algorithm like PCA, k-mean or "max entropy based algorithm" but seems they are just too general to cover this specific problem by following all of the above three principles.
I would do the following:
Construct an evaluation function:
double goodness(double firstThreshold, double bucketWidth, int numBuckets)
which returns a goodness score based on your principles. I would then brute force a number of combinations of parameters and pick the combination with the best goodness score. If we try 4-10 values for each parameter then brute force will work, and probably give you nice round numbers for the cutoffs. If you want to get more sophisticated or have it run faster then you can try other search methods like hill-climbing, beam search or simulated annealing but I think that might be overkill for your situation.

Classifying Text Based on Groups of Keywords?

I have a list of requirements for a software project, assembled from the remains of its predecessor. Each requirement should map to one or more categories. Each of the categories consists of a group of keywords. What I'm trying to do is find an algorithm that would give me a score ranking which of the categories each requirement is likely to fall into. The results would be use as a starting point to further categorize the requirements.
As an example, suppose I have the requirement:
The system shall apply deposits to a customer's specified account.
And categories/keywords:
Customer Transactions: deposits, deposit, customer, account, accounts
Balance Accounts: account, accounts, debits, credits
Other Category: foo, bar
I would want the algorithm to score the requirement highest in category 1, lower in category 2, and not at all in category 3. The scoring mechanism is mostly irrelevant to me, but needs to convey how much more likely category 1 applies than category 2.
I'm new to NLP, so I'm kind of at a loss. I've been reading Natural Language Processing in Python and was hoping to apply some of the concepts, but haven't seen anything that quite fits. I don't think a simple frequency distribution would work, since the text I'm processing is so small (a single sentence.)
You might want to look the category of "similarity measures" or "distance measures" (which is different, in data mining lingo, than "classification".)
Basically, a similarity measure is a way in math you can:
Take two sets of data (in your case, words)
Do some computation/equation/algorithm
The result being that you have some number which tells you how "similar" that data is.
With similarity measures, this number is a number between 0 and 1, where "0" means "nothing matches at all" and "1" means "identical"
So you can actually think of your sentence as a vector - and each word in your sentence represents an element of that vector. Likewise for each category's list of keywords.
And then you can do something very simple: take the "cosine similarity" or "Jaccard index" (depending on how you structure your data.)
What both of these metrics do is they take both vectors (your input sentence, and your "keyword" list) and give you a number. If you do this across all of your categories, you can rank those numbers in order to see which match has the greatest similarity coefficient.
As an example:
From your question:
Customer Transactions: deposits,
deposit, customer, account, accounts
So you could construct a vector with 5 elements: (1, 1, 1, 1, 1). This means that, for the "customer transactions" keyword, you have 5 words, and (this will sound obvious but) each of those words is present in your search string. keep with me.
So now you take your sentence:
The system shall apply deposits to a
customer's specified account.
This has 2 words from the "Customer Transactions" set: {deposits, account, customer}
(actually, this illustrates another nuance: you actually have "customer's". Is this equivalent to "customer"?)
The vector for your sentence might be (1, 0, 1, 1, 0)
The 1's in this vector are in the same position as the 1's in the first vector - because those words are the same.
So we could say: how many times do these vectors differ? Lets compare:
(1,1,1,1,1)
(1,0,1,1,0)
Hm. They have the same "bit" 3 times - in the 1st, 3rd, and 4th position. They only differ by 2 bits. So lets say that when we compare these two vectors, we have a "distance" of 2. Congrats, we just computed the Hamming distance! The lower your Hamming distance, the more "similar" the data.
(The difference between a "similarity" measure and a "distance" measure is that the former is normalized - it gives you a value between 0 and 1. A distance is just any number, so it only gives you a relative value.)
Anyway, this might not be the best way to do natural language processing, but for your purposes it is the simplest and might actually work pretty well for your application, or at least as a starting point.
(PS: "classification" - as you have in your title - would be answering the question "If you take my sentence, which category is it most likely to fall into?" Which is a bit different than saying "how much more similar is my sentence to category 1 than category 2?" which seems to be what you're after.)
good luck!
The main characteristics of the problem are:
Externally defined categorization criteria (keyword list)
Items to be classified (lines from the requirement document) are made of a relatively small number of attributes values, for effectively a single dimension: "keyword".
As defined, no feedback/calibrarion (although it may be appropriate to suggest some of that)
These characteristics bring both good and bad news: the implementation should be relatively straight forward, but a consistent level of accuracy of the categorization process may be hard to achieve. Also the small amounts of various quantities (number of possible categories, max/average number of words in a item etc.) should give us room to select solutions that may be CPU and/or Space intentsive, if need be.
Yet, even with this license got "go fancy", I suggest to start with (and stay close to) to a simple algorithm and to expend on this basis with a few additions and considerations, while remaining vigilant of the ever present danger called overfitting.
Basic algorithm (Conceptual, i.e. no focus on performance trick at this time)
Parameters =
CatKWs = an array/hash of lists of strings. The list contains the possible
keywords, for a given category.
usage: CatKWs[CustTx] = ('deposits', 'deposit', 'customer' ...)
NbCats = integer number of pre-defined categories
Variables:
CatAccu = an array/hash of numeric values with one entry per each of the
possible categories. usage: CatAccu[3] = 4 (if array) or
CatAccu['CustTx'] += 1 (hash)
TotalKwOccurences = counts the total number of keywords matches (counts
multiple when a word is found in several pre-defined categories)
Pseudo code: (for categorizing one input item)
1. for x in 1 to NbCats
CatAccu[x] = 0 // reset the accumulators
2. for each word W in Item
for each x in 1 to NbCats
if W found in CatKWs[x]
TotalKwOccurences++
CatAccu[x]++
3. for each x in 1 to NbCats
CatAccu[x] = CatAccu[x] / TotalKwOccurences // calculate rating
4. Sort CatAccu by value
5. Return the ordered list of (CategoryID, rating)
for all corresponding CatAccu[x] values about a given threshold.
Simple but plausible: we favor the categories that have the most matches, but we divide by the overall number of matches, as a way of lessening the confidence rating when many words were found. note that this division does not affect the relative ranking of a category selection for a given item, but it may be significant when comparing rating of different items.
Now, several simple improvements come to mind: (I'd seriously consider the first two, and give thoughts to the other ones; deciding on each of these is very much tied to the scope of the project, the statistical profile of the data to be categorized and other factors...)
We should normalize the keywords read from the input items and/or match them in a fashion that is tolerant of misspellings. Since we have so few words to work with, we need to ensure we do not loose a significant one because of a silly typo.
We should give more importance to words found less frequently in CatKWs. For example the word 'Account' should could less than the word 'foo' or 'credit'
We could (but maybe that won't be useful or even helpful) give more weight to the ratings of items that have fewer [non-noise] words.
We could also include consideration based on digrams (two consecutive words), for with natural languages (and requirements documents are not quite natural :-) ) word proximity is often a stronger indicator that the words themselves.
we could add a tiny bit of importance to the category assigned to the preceding (or even following, in a look-ahead logic) item. Item will likely come in related series and we can benefit from this regularity.
Also, aside from the calculation of the rating per-se, we should also consider:
some metrics that would be used to rate the algorithm outcome itself (tbd)
some logic to collect the list of words associated with an assigned category and to eventually run statistic on these. This may allow the identification of words representative of a category and not initially listed in CatKWs.
The question of metrics, should be considered early, but this would also require a reference set of input item: a "training set" of sort, even though we are working off a pre-defined dictionary category-keywords (typically training sets are used to determine this very list of category-keywords, along with a weight factor). Of course such reference/training set should be both statistically significant and statistically representative [of the whole set].
To summarize: stick to simple approaches, anyway the context doesn't leave room to be very fancy. Consider introducing a way of measuring the efficiency of particular algorithms (or of particular parameters within a given algorithm), but beware that such metrics may be flawed and prompt you to specialize the solution for a given set at the detriment of the other items (overfitting).
I was also facing the same issue of creating a classifier based only on keywords. I was having a class keywords mapper file and which contained class variable and list of keywords occurring in a particular class. I came with the following algorithm to do and it is working really fine.
# predictor algorithm
for docs in readContent:
for x in range(len(docKywrdmppr)):
catAccum[x]=0
for i in range(len(docKywrdmppr)):
for word in removeStopWords(docs):
if word.casefold() in removeStopWords(docKywrdmppr['Keywords'][i].casefold()):
print(word)
catAccum[i]=catAccum[i]+counter
print(catAccum)
ind=catAccum.index(max(catAccum))
print(ind)
predictedDoc.append(docKywrdmppr['Document Type'][ind])

Resources