how to formulate elo table

how to formulate elo table - algorithm

I want to use the elo table.
Is there a way to calculate the probability to win without hard coding the whole table.
from what I see, It's not a linear line.

See http://en.wikipedia.org/wiki/Elo_rating_system
Look at that page for text "for the expected score of Player A is"
Ea = 1 / ((1 + 10^((RB - RA) / 400))
where Ea is the expected score for Player A

Calculation of the rating difference table as used by FIDE:
Let dp be the rating difference. Then expectation p equals (Excel):
=ROUND(ROUND(NORMDIST(dp * 7 / 2000;0;1;1);3);2)
There are two exception to this rule: dp = 343 and dp = 344

Related

Determine the "difficulty" of quiz with multiple weights?

Im trying to determine the "difficultly" of a quiz object.
My ultimate goal is to be able to create a "difficulty score" (DS) for any quiz. This would allow me to compare one quiz to another accurately, despite being made up of different questions/answers.
When creating my quiz object, I assign each question a "difficulty index" (DI), which is number on a scale from 1-15.
15 = most difficult
1 = least difficult
Now a strait forward way to measure this "difficulty score" could be to add up each question's "difficulty index" then divide by maximum possible "difficulty index" for the quiz. ( ex. 16/30 = 53.3% Difficulty )
However, I also have multiple "weighting" properties associated to each question. These weights are again one a scale of 1-5.
5 = most impact
1 = least impact
The reason I have (2) instead of the more common (1) is so I can accommodate a scenario as follows...
If presenting the student with a very difficult question (DI=15) and the student answers "incorrect", don't have it hurt their score so much BUT if they get it "correct" have it improve their score greatly. I call these my "positive" (PW) and "negative" (NW) weights.
Quiz Example A:
Question 1: DI = 1 | PW = 3 | NW = 3
Question 2: DI = 1 | PW = 3 | NW = 3
Question 3: DI = 1 | PW = 3 | NW = 3
Question 4: DI = 15 | PW = 5 | NW = 1
Quiz Example B:
Question 1: DI = 1 | PW = 3 | NW = 3
Question 2: DI = 1 | PW = 3 | NW = 3
Question 3: DI = 1 | PW = 3 | NW = 3
Question 4: DI = 15 | PW = 1 | NW = 5
Technically the above two quizzes are very similar BUT Quiz B should be more "difficult" because the hardest question will have the greatest impact on your score if you get it wrong.
My question now becomes how can I accurately determine the "difficulty score" when considering the complex weighting system?
Any help is greatly appreciated!

The challenge of course is to determine the difficulty score for each single question.
I suggest the following model:
Hardness (H): Define a hard question such that chances of answering it correctly are lower. The hardest question is such that (1) the chance of answering it correctly are equal to random choice (because it is inherently very hard), and (2) it has the largest number of possible answers. We'll define such question as (H = 15). On the other end of the scale, we'll define (H = 0) for a question where the chances of answering it correctly are 100% (because it is trivial) (I know - such question will never appear). Now - define the hardness of each question by subjective extrapolation (remember that one can always guess between the given options). For example, if a (H = 15) question has 4 answers, and another question with similar inherent hardness has 2 answers - it would be (H = 7.5). Another example: If you believe that an average student has 62.5% of answering a question correctly - it would also be a (H = 7.5) question (this is because a H = 15 has 25% of correct answer, while H = 0 has 100%. The average is 62.5%)
Effect (E): Now, we'll measure the effect of PW and NW. For questions with 50% chance of answering correctly - the effect is E = 0.5*PW - 0.5*NW. For questions with 25% chance of answering correctly - the effect is E = 0.25*PW - 0.75*NW. For trivial question NW doesn't matter so the effect is E = PW.
Difficulty (DI): The last step is to integrate the hardness and the effect - and call it difficulty. I suggest DI = H - c*E, where c is some positive constant. You may want to normalize again.
Edit: Alternatively, you may try the following formula: DI = H * (1 - c*E), where the effect magnitude is not absolute, but relative to the question's hardness.
Clarification:
The teacher needs to estimate only one parameter about each question: What is the probability that an average student would answer this question correctly. His estimation, e, will be in the range [1/k, 1], where k is the number of answers.
The hardness, H, is a linear function of e such that 1/k is mapped to 15 and 1 is mapped to 0. The function is: H = 15 * k / (k-1) * (1-e)
The effect E depends on e, PW and NW. The formula is E = e*PW - (1-e)*NW
Example based on OP comments:
Question 1:
k = 4, e = 0.25 (hardest). Therefore H = 15
PW = 1, NW = 5, e = 0.25. Therefore E = 0.25*1 - 0.75*5 = -3.5
c = 5. DI = 15 - 5*(-3.5) = 32.5
Question 2:
k = 4, e = 0.95 (very easy). Therefore H = 1
PW = 1, NW = 5, e = 0.95. Therefore E = 0.95*1 - 0.05*5 = 0.7
c = 5. DI = 1 - 5*(0.7) = -2.5

I'd say the core of the problem is that mathematically your example quizzes A and B are identical, except that quiz A awards the student 4 gratuitous bonus points (or, equivalently, quiz B arbitrarily takes 4 points away from them). If the same students take both of them, the score distribution will be the same, except shifted by 4 points. So while the two quizzes may feel different psychologically (because, let's face it, getting extra points feels good, and losing points feels bad, even if you technically did nothing to deserve it), finding an objective way to distinguish them seems tricky.
That said, one reasonable measure of "psychological difficulty" could simply be the average score (per question) that a randomly chosen student would be expected to get from the quiz. Of course, that's not something you can reliably calculate in advance, although you could estimate it from actual quiz results after the fact.
However, if you could somehow relate your (presumably arbitrary) difficulty ratings to the fraction of students likely to answer the question correctly, then you could use that to estimate the expected average score. So, for example, we might simply assume a linear relationship with the question difficulty as the success rate, with difficulty 1 corresponding to a 100% expected success rate, and difficulty 15 corresponding to a 0% expected success rate. Then the expected average score S per question for the quiz could be calculated as:
S = avg(PW × X − NW × (1 − X))
where the average is taken over all questions in the quiz, and where PW and NW are the point weights for a correct and an incorrect answer respectively, DI below is the difficulty rating for the question, and X = (15 − DI) / 14 is the estimated success rate.
Of course, we might want to also account for the fact that, even if a student doesn't know the answer to a question, they can still guess. Basically this means that the estimated success rate X should not range from 0 to 1, but from 1/N to 1, where N is the number of options for the question. So, taking that into account, we can adjust the formula for X to be:
X = (1 + (N − 1) × (15 − DI) / 14) / N
One problem with this estimated average score S as a difficulty measure is that it isn't bounded in either direction, and provides no obvious scale to indicate what counts as an "easy" quiz or a "hard" one. The fundamental problem here is that you haven't specified any limits for the question weights, so there's technically nothing to stop someone from making a question with, say, a positive or negative weight of one million points.
That said, if you do impose some reasonable limits on the weights (even if they're only recommendations), then you should be able to also establish reasonable thresholds on S for a quiz to be considered e.g. easy, moderate or hard. And even if you don't, you can still at least use it to rank quizzes relative to each other by difficulty.
Ps. One way to present the expected score in a UI might be to multiply it by the number of questions in the quiz, and display the result as "par" for the quiz. That way, students could roughly judge their own performance against the difficulty of the quiz by seeing whether they scored above or below par.

How to design an algorithm to find covariance for multiple variables?

Below is the covariance equation for two variables, ABC and XYZ.
Below is the algorithm to find the covariance between ABC and XYZ.
Algorithm 1: Calculate Covariance
1. cov = 0
2. for i = 0 to size - 1
3. cov = cov + (return_abc[i] - avg_abc) * (return_xyz[i] - avg_xyz)
4. end for
5. cov = cov / size - 1
6. end
How do I design an algorithm to find the covariance between one variable to the rest of the variables in a sample?
For example, how do I find the covariance between APPLE's stock and the rest of the stocks in NASDAQ (about 3,100 companies) ?

http://www.investopedia.com/articles/financial-theory/11/calculating-covariance.asp
If you're looking at only end of day prices (for simplicity), and you would like to compare a single stock to all of the rest of the stocks, try treating the rest of them all as one element, ie:
n=number of days to analyze
marketReturn = average(return of each stock that isn't APPLE)
marketAverage = (sum of marketReturn's)/n
appleAverage = (sum of APPLE returns)/n
then apply the summation across a sample size of the number of days you are analyzing.
Answer = (((day 1 APPLE return %)-(appleAverage))*((day 1 marketReturn %)-(marketAverage)) + ((day 2 APPLE return %)-(appleAverage))*((day 2 marketReturn %)-(marketAverage)) + ...)/(n-1)
I suppose if you want to check each stock individually:
for(x=0;x<numStocksNotAPPLE;++x)//increment through each stock
{
covariantList[x]=(((day 1 APPLE return percent)-(appleAverage))*((day 1 stock[x] percent)-(stock[x] average)) + ... + (((day n APPLE return %)-(appleAverage))*((day n stock[x] %)-(stock[x] average)))/(n-1)
}//that should give you a covariant APPLE vs each stock
and then if you want just one number, you could take the average of the covariantList:
foreach(covariantList c)
{
sum+=c
}
averageCovariant = sum/covariantList.size
Keep in mind this is all psuedo-code, i'm not claiming any of it is functional

Comparative rating algorithm

UseCase:
Assume that rating of object is from 1 to 5 stars.
It is already 50 votes with avarage rating 4.1.
When user make a vote(1-5stars), whe need to recalculate rating.
How to implement that logic? Problem is we don't know the value of every vote, only current rating and total votes.

newRating = (oldRating * oldCount + currentVote) / (oldCount + 1)

Well, basic arithmetic tells us that (50 * 4.1 + newvote) / 51 is the same as the average of all votes. You'll end up with roundoff errors if you do it repeatedly, and after a certain number of votes it's not even worth bothering with averaging in a single vote, but the basic formula is sound.

Assuming:
f = (1+2+3)/3 = 6/3 = 2
= (2+2+2)/3 = 6/3 = 2
We know that f=g. If we know the current rating is 2, and the total number of votes is 3, 2*3=6 as the sum of all votes. Let's add 4 to our average like so:
f = 6/3
f' = (6+4)/4 = 2.5

Need help on like/dislike voting system

I'd like to get some help to build a like/dislike sorting algorithm to find the best entries. I thought about a way to do it, but there are two major flaws with this method and I'd like to know if there's any better way.
Here's how I thought about doing it:
The entries would be sorted by the ratio given by l/d where l = number of likes and d = number of dislikes, so that those with a higher ratio have a bigger likes count and deserve a higher up place than those with a low ratio.
There are two issues with this method:
1: if the number of dislikes is 0 the l/d will be impossible. So even if an entry has a thousand of likes and 0 dislikes it still won't get any place into the scoreboard.
2: entries with a low amount of likes and dislikes are at an advantage in comparison with those with many ratings since it takes a low amount of ratings to influence the ratio and give the entry a good score.
What do you think?
EDIT: Here's a possible alternative that fixes the 1st issue: (l + 1) / (d + 1). Any feedback on this one?

This might be relevant: How Not To Sort By Average Rating.

To remove the division by zero, you might add 1 to the numerator and denominator to obtain (l+1)/(d+1). If you want to more highly rank entries with more likes, then you might multiply your ranking formula by log(number of likes + 1). Here the one is added to remove the mathematical error that results if the entry has zero likes. For the discussion that follows, assume that the log has a base of 10. So a ranking formula that meets the requirements would be (likes + 1)/(dislikes + 1) * log(likes + 1).
Observe that this formula provides a rank of 0 if there are no likes because log(1) = 0. Suppose that the votes are tied with one like vote and one dislike vote. Then the rank is 2/2*log(2) = 0.3 because log(2) = 0.3. Now consider another tie with 9 likes and 9 dislikes. Then the rank is 10/10*log(10) = 1, because log(10) = 1. That is, the log(likes) term ranks ties with more likes more highly than ties with fewer likes.

This has worked the best for me.
rank = likes * 100 / (likes + dislikes)
It orders by higher likes, then any like and/or dislike activity, then no activity.
examples:
likes, dislikes => rank
0, 0 => 0 //avoid /0 error
3, 3 => 50
3, 0 => 100

What is a better way to sort by a 5 star rating?

I'm trying to sort a bunch of products by customer ratings using a 5 star system. The site I'm setting this up for does not have a lot of ratings and continue to add new products so it will usually have a few products with a low number of ratings.
I tried using average star rating but that algorithm fails when there is a small number of ratings.
Example a product that has 3x 5 star ratings would show up better than a product that has 100x 5 star ratings and 2x 2 star ratings.
Shouldn't the second product show up higher because it is statistically more trustworthy because of the larger number of ratings?

Prior to 2015, the Internet Movie Database (IMDb) publicly listed the formula used to rank their Top 250 movies list. To quote:
The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
R = average for the movie (mean)
v = number of votes for the movie
m = minimum votes required to be listed in the Top 250 (currently 25000)
C = the mean vote across the whole report (currently 7.0)
For the Top 250, only votes from regular voters are considered.
It's not so hard to understand. The formula is:
rating = (v / (v + m)) * R +
(m / (v + m)) * C;
Which can be mathematically simplified to:
rating = (R * v + C * m) / (v + m);
The variables are:
R – The item's own rating. R is the average of the item's votes. (For example, if an item has no votes, its R is 0. If someone gives it 5 stars, R becomes 5. If someone else gives it 1 star, R becomes 3, the average of [1, 5]. And so on.)
C – The average item's rating. Find the R of every single item in the database, including the current one, and take the average of them; that is C. (Suppose there are 4 items in the database, and their ratings are [2, 3, 5, 5]. C is 3.75, the average of those numbers.)
v – The number of votes for an item. (To given another example, if 5 people have cast votes on an item, v is 5.)
m – The tuneable parameter. The amount of "smoothing" applied to the rating is based on the number of votes (v) in relation to m. Adjust m until the results satisfy you. And don't misinterpret IMDb's description of m as "minimum votes required to be listed" – this system is perfectly capable of ranking items with less votes than m.
All the formula does is: add m imaginary votes, each with a value of C, before calculating the average. In the beginning, when there isn't enough data (i.e. the number of votes is dramatically less than m), this causes the blanks to be filled in with average data. However, as votes accumulates, eventually the imaginary votes will be drowned out by real ones.
In this system, votes don't cause the rating to fluctuate wildly. Instead, they merely perturb it a bit in some direction.
When there are zero votes, only imaginary votes exist, and all of them are C. Thus, each item begins with a rating of C.
See also:
A demo. Click "Solve".
Another explanation of IMDb's system.
An explanation of a similar Bayesian star-rating system.

Evan Miller shows a Bayesian approach to ranking 5-star ratings:
where
nk is the number of k-star ratings,
sk is the "worth" (in points) of k stars,
N is the total number of votes
K is the maximum number of stars (e.g. K=5, in a 5-star rating system)
z_alpha/2 is the 1 - alpha/2 quantile of a normal distribution. If you want 95% confidence (based on the Bayesian posterior distribution) that the actual sort criterion is at least as big as the computed sort criterion, choose z_alpha/2 = 1.65.
In Python, the sorting criterion can be calculated with
def starsort(ns):
"""
http://www.evanmiller.org/ranking-items-with-star-ratings.html
"""
N = sum(ns)
K = len(ns)
s = list(range(K,0,-1))
s2 = [sk**2 for sk in s]
z = 1.65
def f(s, ns):
N = sum(ns)
K = len(ns)
return sum(sk*(nk+1) for sk, nk in zip(s,ns)) / (N+K)
fsns = f(s, ns)
return fsns - z*math.sqrt((f(s2, ns)- fsns**2)/(N+K+1))
For example, if an item has 60 five-stars, 80 four-stars, 75 three-stars, 20 two-stars and 25 one-stars, then its overall star rating would be about 3.4:
x = (60, 80, 75, 20, 25)
starsort(x)
# 3.3686975120774694
and you can sort a list of 5-star ratings with
sorted([(60, 80, 75, 20, 25), (10,0,0,0,0), (5,0,0,0,0)], key=starsort, reverse=True)
# [(10, 0, 0, 0, 0), (60, 80, 75, 20, 25), (5, 0, 0, 0, 0)]
This shows the effect that more ratings can have upon the overall star value.
You'll find that this formula tends to give an overall rating which is a bit
lower than the overall rating reported by sites such as Amazon, Ebay or Wal-mart
particularly when there are few votes (say, less than 300). This reflects the
higher uncertainy that comes with fewer votes. As the number of votes increases
(into the thousands) all overall these rating formulas should tend to the
(weighted) average rating.
Since the formula only depends on the frequency distribution of 5-star ratings
for the item itself, it is easy to combine reviews from multiple sources (or,
update the overall rating in light of new votes) by simply adding the frequency
distributions together.
Unlike the IMDb formula, this formula does not depend on the average score
across all items, nor an artificial minimum number of votes cutoff value.
Moreover, this formula makes use of the full frequency distribution -- not just
the average number of stars and the number of votes. And it makes sense that it
should since an item with ten 5-stars and ten 1-stars should be treated as
having more uncertainty than (and therefore not rated as highly as) an item with
twenty 3-star ratings:
In [78]: starsort((10,0,0,0,10))
Out[78]: 2.386028063783418
In [79]: starsort((0,0,20,0,0))
Out[79]: 2.795342687927806
The IMDb formula does not take this into account.

See this page for a good analysis of star-based rating systems, and this one for a good analysis of upvote-/downvote- based systems.
For up and down voting you want to estimate the probability that, given the ratings you have, the "real" score (if you had infinite ratings) is greater than some quantity (like, say, the similar number for some other item you're sorting against).
See the second article for the answer, but the conclusion is you want to use the Wilson confidence. The article gives the equation and sample Ruby code (easily translated to another language).

Well, depending on how complex you want to make it, you could have ratings additionally be weighted based on how many ratings the person has made, and what those ratings are. If the person has only made one rating, it could be a shill rating, and might count for less. Or if the person has rated many things in category a, but few in category b, and has an average rating of 1.3 out of 5 stars, it sounds like category a may be artificially weighed down by the low average score of this user, and should be adjusted.
But enough of making it complex. Let’s make it simple.
Assuming we’re working with just two values, ReviewCount and AverageRating, for a particular item, it would make sense to me to look ReviewCount as essentially being the “reliability” value. But we don’t just want to bring scores down for low ReviewCount items: a single one-star rating is probably as unreliable as a single 5 star rating. So what we want to do is probably average towards the middle: 3.
So, basically, I’m thinking of an equation something like X * AverageRating + Y * 3 = the-rating-we-want. In order to make this value come out right we need X+Y to equal 1. Also we need X to increase in value as ReviewCount increases...with a review count of 0, x should be 0 (giving us an equation of “3”), and with an infinite review count X should be 1 (which makes the equation = AverageRating).
So what are X and Y equations? For the X equation want the dependent variable to asymptotically approach 1 as the independent variable approaches infinity. A good set of equations is something like:
Y = 1/(factor^RatingCount)
and (utilizing the fact that X must be equal to 1-Y)
X = 1 – (1/(factor^RatingCount)
Then we can adjust "factor" to fit the range that we're looking for.
I used this simple C# program to try a few factors:
// We can adjust this factor to adjust our curve.
double factor = 1.5;
// Here's some sample data
double RatingAverage1 = 5;
double RatingCount1 = 1;
double RatingAverage2 = 4.5;
double RatingCount2 = 5;
double RatingAverage3 = 3.5;
double RatingCount3 = 50000; // 50000 is not infinite, but it's probably plenty to closely simulate it.
// Do the calculations
double modfactor = Math.Pow(factor, RatingCount1);
double modRating1 = (3 / modfactor)
+ (RatingAverage1 * (1 - 1 / modfactor));
double modfactor2 = Math.Pow(factor, RatingCount2);
double modRating2 = (3 / modfactor2)
+ (RatingAverage2 * (1 - 1 / modfactor2));
double modfactor3 = Math.Pow(factor, RatingCount3);
double modRating3 = (3 / modfactor3)
+ (RatingAverage3 * (1 - 1 / modfactor3));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage1, RatingCount1, modRating1));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage2, RatingCount2, modRating2));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage3, RatingCount3, modRating3));
// Hold up for the user to read the data.
Console.ReadLine();
So you don’t bother copying it in, it gives this output:
RatingAverage: 5, RatingCount: 1, Adjusted Rating: 3.67
RatingAverage: 4.5, RatingCount: 5, Adjusted Rating: 4.30
RatingAverage: 3.5, RatingCount: 50000, Adjusted Rating: 3.50
Something like that? You could obviously adjust the "factor" value as needed to get the kind of weighting you want.

You could sort by median instead of arithmetic mean. In this case both examples have a median of 5, so both would have the same weight in a sorting algorithm.
You could use a mode to the same effect, but median is probably a better idea.
If you want to assign additional weight to the product with 100 5-star ratings, you'll probably want to go with some kind of weighted mode, assigning more weight to ratings with the same median, but with more overall votes.

If you just need a fast and cheap solution that will mostly work without using a lot of computation here's one option (assuming a 1-5 rating scale)
SELECT Products.id, Products.title, avg(Ratings.score), etc
FROM
Products INNER JOIN Ratings ON Products.id=Ratings.product_id
GROUP BY
Products.id, Products.title
ORDER BY (SUM(Ratings.score)+25.0)/(COUNT(Ratings.id)+20.0) DESC, COUNT(Ratings.id) DESC
By adding in 25 and dividing by the total ratings + 20 you're basically adding 10 worst scores and 10 best scores to the total ratings and then sorting accordingly.
This does have known issues. For example, it unfairly rewards low-scoring products with few ratings (as this graph demonstrates, products with an average score of 1 and just one rating score a 1.2 while products with an average score of 1 and 1k+ ratings score closer to 1.05). You could also argue it unfairly punishes high-quality products with few ratings.
This chart shows what happens for all 5 ratings over 1-1000 ratings:
http://www.wolframalpha.com/input/?i=Plot3D%5B%2825%2Bxy%29/%2820%2Bx%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
You can see the dip upwards at the very bottom ratings, but overall it's a fair ranking, I think. You can also look at it this way:
http://www.wolframalpha.com/input/?i=Plot3D%5B6-%28%2825%2Bxy%29/%2820%2Bx%29%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
If you drop a marble on most places in this graph, it will automatically roll towards products with both higher scores and higher ratings.

Obviously, the low number of ratings puts this problem at a statistical handicap. Never the less...
A key element to improving the quality of an aggregate rating is to "rate the rater", i.e. to keep tabs of the ratings each particular "rater" has supplied (relative to others). This allows weighing their votes during the aggregation process.
Another solution, more of a cope out, is to supply the end-users with a count (or a range indication thereof) of votes for the underlying item.

One option is something like Microsoft's TrueSkill system, where the score is given by mean - 3*stddev, where the constants can be tweaked.

After look for a while, I choose the Bayesian system.
If someone is using Ruby, here a gem for it:
https://github.com/wbotelhos/rating

I'd highly recommend the book Programming Collective Intelligence by Toby Segaran (OReilly) ISBN 978-0-596-52932-1 which discusses how to extract meaningful data from crowd behaviour. The examples are in Python, but its easy enough to convert.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio