is there an algorithm to solve current rating with now one? - algorithm

I have a old rating from the db and a new Rating from the user,
i tried to search for "rating algorithm" but they save the ratings per user. In my case i don't save the previous ratings. My rating bar is up to max of 5
currently my solution is oldR + newR/ 2. does this make sense?

Not really. When you think about it, such a formula would mean that newer votes are weighted much more than old ones. Imagine a sequence of votes like this:
Vote: 1 1 1 1 1 1 1 1 1 1 1 5
Rating: 1 1 1 1 1 1 1 1 1 1 1 3
Clearly the rating should not be 3 in this case, it should still be 1 (or at worst 2), but with your formula it will be.
At the very least you should store the number of votes as well as the average rating, allowing you to calculate newR = ((oldR*votesCast)+newVote)/(votesCast+1). This also requires to store the rating with a higher precision, not just as an integer. (You can round it off when you display it but internally you should keep track of fractions too.)
A slightly better solution is to separately store how many votes have been cast for the 5 different ratings so far, allowing you to calculate different kinds of means (geometric for example).
And obviously the most flexible (but most storage and computation intensive) is to store each individual vote with user id and timestamp, allowing you to use any algorithm you can think of.

Related

How to measure flexibility using entropy or other methods?

E.g. I want to measure how flexible customers for different brands of goods by looking at their searching behaviors in the past months.
Users could have below searching logs
User 1: Brand A
User 2: Brand A A A A
User 3: Brand A A B A
User 4: Brand A A B B
So it is clear that user 1 and 2 have less flexibility than 3 and 4 and they tend to buy only from one brand.
While user 2 is less flexible than user 1 since 2 have multiple searches to show its inflexibility while 1 we don’t have that same amount of confidence.
So intuitively the degree of flexibility are 2<1<3<4
Initial I want to use entropy to calculate but it can’t distinguish user 1 and 2 as entropy will be both 0.
Do you know any method to calculate the flexibility of each user using one numeric value that also take the frequency of each unique value into account? Really appreciate

Rating System with Elo, better alternatives?

I'm working on a rating algorithm. I have a set of exercises. They are all categorized in levels (1 = easiest, 5 = hardest).
Users get shown two exercises and should decide which one is harder or if both are equal. Based on user ratings, the levels should get adjusted.
What I've done:
I experimented with the Elo rating.
My Questions:
Are there any better algorithms for doing this use case? (found nothing so far)
Thanks in advance and cheers.
Toby
I would try to solve the problem in a simple yet (I hope) effective way.
First, you only update an exercise rating when the vote is different that what the system actually expects. From now on, I will only considers the cases where the user output differs from what the system actually expects.
Second, I would give more weight to the votes where the two levels have a big difference. A wrong expectation on two esercises with rating 2 and 3 should have less impact than a wrong expectation on two exercises with rating 1 and 5.
That said, my algorithm would be along the lines of:
1- A constant percentage is set, let's call it increment. It establishes the percentage of impact that a vote has, and can be modified along the way based on the number of users.
2- For an "unexpected" vote, I would calculate the difference between the original levels (minimum of 1).
diff = max(1, abs(ex1.level - ex2.level))
3- I would update each exercise rating by a percentage, based on the multiplication of increment and diff.
if (ex1 level expected bigger)
ex1.rating = ex1.rating + diff*increment;
else
ex1.rating = ex1.rating - diff*increment;
Rating would be a float, and level would be the rounding of rating:
ex1.level = round(ex1.rating)
Example:
let's set increment = 0.1. exA, with a rating of 2.0 and level 2 is compared with exB, rating of 3.0 and level 3.
The first user selects exB as the hardest. Nothing changes, because it is the result expected by the system.
The second user selects exA. It is not the expected result. The difference between the two exercises is 1, so the rating is modified by a factor 1*0.1 = 0.1,
resulting in a exA.rating = 2.1 for exB.rating = 2.9

Recommendation system and baseline predictors

I have a bunch of data where the first column represents users, the second column is movies, and the third is a ten-points rating.
0 0 9
0 1 8
1 1 4
1 2 6
2 2 7
And I have to predict the third number for another ser of data (user, movie, ?):
0 2
1 0
2 0
2 1
I use this way for finding bias values https://youtube.com/watch?v=dGM4bNQcVKI and this way for predicting https://www.youtube.com/watch?v=4RSigTais8o.
Bias value for user number 0: 9 + 8 / 2 = 8.5 - 1.5 = 7.
Bias value for movie number 2: 6 + 7 / 2 = 6.5 - 1.5 = 5.
And baseline predictors:
1.5 + 7 + 5, where result is 13.5, but in contest result is: 7.052009.
But the problem description says the result of my Recommendation system should be:
0 2 7.052009
1 0 6.687943
2 0 6.995272
2 1 6.687943
Where is my mistake?
The raw average is the average of ALL the present scores ((9+8+4+6+7) / 5 = 6.8), I don't see that number anywhere, so I guess that's your error.
In the video Prof. used the raw average of 3.5 on all the calculations, including calculating bias, he skipped how to reach that number, if you add all numbers on the table of the video and divide, you get 3.5.
0 2 9.2 is the answer for the first one, using your videos as guide. The videos claims to have avoided calculus, the different final answers of the contest probably come from using the "full" method.
0 2 ?, user 0 (row 0: 9 8 x), movie 2 (column 2: x 6 7)
raw average = 6.8
bias user 0: (9+8) / 2 - 6.8 = 1.7
bias movie 2: (6+7) / 2 - 6.8 = -0.3
prediction: 6.8+1.7-0.3 = 8.2
The problem looks like a variation of the Netflix Contest, the contest' host knows the actual answers (the ratings), he doesn't give them to you, you are expected to guess/predict them, the winner of the contest is the one that gets the closest to the actual answers.
The winner of you contest got the closest, but he got there using an unknown method, or his own variation of a know method, if your goal is to match his answer exactly, you are better off asking him what method he used and how did he modify it, and try to replicate his results.
If this was homework and not a contest, then the teacher would expect you to use the "correct" method he taught you (there's no set method, just many methods that work with different accuracy), you'd have to use it exactly like he taught you. But it is a contest, your goal is to find a base method that approximates the best (the one you used is very low on accuracy), and tinker with it a bit to get even better results.
If you want to understand the link I suggest you research and later ask a statistics question, because it's just plain statistics. You can try to understand the link or research Matrix factorization on your own. Remember that to get contest winning results (or close) you won't be able to use a simple method like the one you found on the youtube video, but require a method with a lot more math.

Voting algorithm: how to calculate rank?

I am trying to figure our a way to calculate rank. Right now it simply takes ratio of wins / losses of each individual entry, so e.g. one won 99 times out of a 100, it has 99% winning rank. BUT if an entry won 1 out of total 1 votes, it will have a 100% winning rank, but definitely it can't be higher that of the one that won 99 times. What would be a better way to do this?
Try something like this:
votes = wins + losses
score = votes * ( wins / votes )
That way, something with 50% wins, but a million votes would still be ahead of something with 100% wins but only one vote.
You can add in an extra weight based on age (in days in this example), too, something like
if age < 5:
score = score + ((highest real score on site) * ((5 - age) / 5)
This will put brand new entries right at the top of the first page, and then they will move slowly down the list over the course of the next 5 days (I'm assuming age is a fractional number, not just an integer). After the 5 days are up, they will be put in the list based solely on the score from the previous bit of pseudo-code.
Depending on how complicated you want to make it, the Elo system chess uses (or something similar) may be what you want: http://en.wikipedia.org/wiki/Elo_rating_system
Even if a person has won 1/1 matches, his rating would be far below someone who has won/lost hundreds of matches against tough opponents, for instance.
You could always use a point system rather than win/loss ratio. Winning would always give points and then you could play around with either removing points for losing, not awarding points at all for losing, or awarding less points for losing. It all depends on exactly how you want people to be ranked. For example you may want to give 2 points for winning and 1 point for losing if you want to favor people who participate over those who do not (which sounds kind of like what you were talking about in your example of the person playing 100 games vs 1 game). The NHL uses a similar technique for rankings (2 points for a win, 1 point for an overtime loss, 0 points for a regular loss). That might give you some more flexibility.
if i understand the question correctly, then whoever gets more votes has the higher rank.
Would it make sense to add more rank to winning entry if losing entry originally had a much higher rank, e.g. much stronger competitor?

Finding similarities in a multidimensional array

Consider a sales department that sets a sales goal for each day. The total goal isn't important, but the overage or underage is. For example, if Monday of week 1 has a goal of 50 and we sell 60, that day gets a score of +10. On Tuesday, our goal is 48 and we sell 46 for a score of -2. At the end of the week, we score the week like this:
[0,0]=10,[0,1]=-2,[0,2]=1,[0,3]=7,[0,4]=6
In this example, both Monday (0,0) and Thursday and Friday (0,3 and 0,4) are "hot"
If we look at the results from week 2, we see:
[1,0]=-4,[1,1]=2,[1,2]=-1,[1,3]=4,[1,4]=5
For week 2, the end of the week is hot, and Tuesday is warm.
Next, if we compare weeks one and two, we see that the end of the week tends to be better than the first part of the week. So, now let's add weeks 3 and 4:
[0,0]=10,[0,1]=-2,[0,2]=1,[0,3]=7,[0,4]=6
[1,0]=-4,[1,1]=2,[1,2]=-1,[1,3]=4,[1,4]=5
[2,0]=-8,[2,1]=-2,[2,2]=-1,[2,3]=2,[2,4]=3
[3,0]=2,[3,1]=3,[3,2]=4,[3,3]=7,[3,4]=9
From this, we see that the end of the week is better theory holds true. But we also see that end of the month is better than the start. Of course, we would want to next compare this month with next month, or compare a group of months for quarterly or annual results.
I'm not a math or stats guy, but I'm pretty sure there are algorithms designed for this type of problem. Since I don't have a math background (and don't remember any algebra from my earlier days), where would I look for help? Does this type of "hotspot" logic have a name? Are there formulas or algorithms that can slice and dice and compare multidimensional arrays?
Any help, pointers or advice is appreciated!
This data isn't really multidimensional, it's just a simple time series, and there are many ways to analyse it. I'd suggest you start with the Fourier Transform, it detects "rhythms" in a series, so this data would show a spike at 7 days, and also around thirty, and if you extended the data set to a few years it would show a one-year spike for seasons and holidays. That should keep you busy for a while, until you're ready to use real multidimensional data, say by adding in weather information, stock market data, results of recent sports events and so on.
The following might be relevant to you: Stochastic oscillators in technical analysis, which are used to determine whether a stock has been overbought or oversold.
I'm oversimplifying here, but essentially you have two moving calculations:
14-day stochastic: 100 * (today's closing price - low of last 14 days) / (high of last 14 days - low of last 14 days)
3-day stochastic: same calculation, but relative to 3 days.
The 14-day and 3-day stochastics will have a tendency to follow the same curve. Your stochastics will fall somewhere between 1.0 and 0.0; stochastics above 0.8 are considered overbought or bearish, below 0.2 indicates oversold or bullish. More specifically, when your 3-day stochastic "crosses" the 14-day stochastic in one of those regions, you have predictor of momentum of the prices.
Although some people consider technical analysis to be voodoo, empirical evidence indicates that it has some predictive power. For what its worth, a stochastic is a very easy and efficient way to visualize the momentum of prices over time.
It seems to me that an OLAP approach (like pivot tables in MS Excel) fit the problem perfectly.
What you want to do is quite simple - you just have to calculate the autocorrelation of your data and look at the correlogram. From the correlogram you can see 'hidden' periods of your data and then you can use this information to analyze the periods.
Here is the result - your numbers and their normalized autocorrelation.
10 1,000
-2 0,097
1 -0,121
7 0,084
6 0,098
-4 0,154
2 -0,082
-1 -0,550
4 -0,341
5 -0,027
-8 -0,165
-2 -0,212
-1 -0,555
2 -0,426
3 -0,279
2 0,195
3 0,000
4 -0,795
7 -1,000
9
I used Excel to get the values. But the sequence in column A and add the equation =CORREL($A$1:$A$20;$A1:$A20) to cell B1 and copy it then up to B19. If you the add a line diagram, you can nicely see the structure of the data.
You can already make reasonable guesses about the periods of patterns - you're looking at things like weekly and monthly. To look for weekly patterns, for example, just average all the mondays together and so on. Same goes for days of the month, for months of the year.
Sure, you could use a complex algorithm to find out that there's a weekly pattern, but you already know to expect that. If you think there really may be patterns buried there that you'd never suspect (there's a strange community of people who use a 5-day week and frequent your business), by all means, use a strong tool -- but if you know what kinds of things to look for, there's really no need.
Daniel has the right idea when he suggested correlation but I don't think autocorrelation is what you want. Instead I would suggest correlating each week with each other week. Peaks in your correlation--that is values close to 1--suggest that the values of the weeks resemble each other (I.e. are peiodic) for that particular shift.
For example when you cross correlate
0 0 1 2 0 0
with
0 0 0 1 1 0
the result would be
2 0 0 1 3 0
the highest value is 3, which corresponds to shifting (right) the second array by 4
0 0 0 1 1 0 --> 0 0 1 1 0 0
and thenn multiplying component wise
0 0 1 2 0 0
0 0 1 1 0 0
----------------------
0 + 0 + 1 + 2 + 0 + 0 = 3
Note that when you correlate you can create your own "fake" week and cross-correlate all your real weeks, the idea being that you are looking for "shapes" of your weekly values that correspond to the shape of your fake week by looking for peaks in the correlation result.
So if you are interested in finding weeks that are close near the end of the week you could use the "fake" week
-1 -1 -1 -1 1 1
and if you get a high response in the first value of the correlation this means that the real week that you correlated with has roughly this shape.
This is probably beyond the scope of what you're looking for, but one technical approach that would give you the ability to do forecasting, look at things like statistical significance, etc., would be ARIMA or similar Box-Jenkins models.

Resources