Rating Algorithm - performance

I'm trying to develop a rating system for an application I'm working on. Basically app allows you to rate an object from 1 to 5(represented by stars). But I of course know that keeping a rating count and adding the rating the number itself is not feasible.
So the first thing that came up in my mind was dividing the received rating by the total ratings given. Like if the object has received the rating 2 from a user and if the number of times that object has been rated is 100 maybe adding the 2/100. However I believe this method is not good enough since 1)A naive approach 2) In order for me to get the number of times that object has been rated I have to do a look up on db which might end up having time complexity O(n)
So I was wondering what alternative and possibly better ways to approach this problem?

You can keep in DB 2 additional values - number of times it was rated and total sum of all ratings. This way to update object's rating you need only to:
Add new rating to total sum.
Divide total sum by total times it was rated.

There are many approaches to this but before that check
If all feedback givers treated at equal or some have more weight than others (like panel review, etc)
If the objective is to provide only an average or any score band or such. Consider scenario like this website - showing total reputation score
And yes - if average is to be omputed, you need to have total and count of feedback and then have to compute it - that's plain maths. But if you need any other method, be prepared for more compute cycles. balance between database hits and compute cycle but that's next stage of design. First get your requirement and approach to solution in place.

I think you should keep separate counters for 1 stars, 2 stars, ... to calcuate the rating, you'd have to compute rating = (1*numOneStars+2*numTwoStars+3*numThreeStars+4*numFourStars+5*numFiveStars)/numOneStars+numTwoStars+numThreeStars+numFourStars+numFiveStars)
This way you can, like amazon also show how many ppl voted 1 stars and how many voted 5 stars...

Have you considered a vote up/down mechanism over numbers of stars? It doesn't directly solve your problem but it's worth noting that other sites such as YouTube, Facebook, StackOverflow etc all use +/- voting as it is often much more effective than star based ratings.

Related

Algorithm development and optimization

I have this problem:
You need to develop a meal regime based on the data entered by the user. We have a database with the meals and their prices (meals also have a mark whether they are breakfast, cause on breakfast we eat smth different from lunch and dinner most often). The input receives the amount of money (Currency is not important) and the number of days. At the output, we must get a meal regime for a given number of days. Conditions:
Final price does not differ from the given one by more than 3%.
meals mustn't repeat more than once every 5 days.
I found this not effective solution: We are looking for an average price per day = amount of money / number of days. Then, until we reach the given number of days, we iterate throught each breakfast, then lunch and dinner (3 for loops, 2 are nested) and if price is not too different, then we end the search and add this day to the result list. So the design now looks like this:
while(daysCounter < days){
for(){
for(){
for(){
}
}
It looks scary, although there is not a lot of data (number of meals is about 150). There are thoughts that it is possible to find a more effective solution. Also i think about dynamic programming, but so far there are no ideas how to implement it.
Dynamic programming won't work because a necessary part of your state is the meals from the last 5 days. And the number of possibilities for that are astronomical.
However it is likely that there are many solutions, not just a few. And that it is easy to find a working solution by being greedy. Also that an existing solution can be improved fairly easily.
So I'd solve it like this. Replace days with an array of meals. Take the total you want to spend. Divide it up among your meals, in proportion to the median price of the options for that meal. (Breakfast is generally cheaper than dinner.) And now add up that per meal cost to get running totals.
And now for each meal choose the meal you have not had in the last 5 days that brings the running total of what has been spent as close as possible to the ideal total. Choose all of your meals, one at a time.
This is the greedy approach. Normally, but not always, it will come fairly close to the target.
And now, to a maximum of n tries or until you have the target within 3%, pick a random meal, consider all options that are not eaten within the last or next 5 days, and randomly pick one (if any such options exist) that brings the overall amount spent closer to the target.
(For a meal plan that is likely to vary more over a long period, I'd suggest trying simulated annealing. That will produce more interesting results, but it will also take longer.)

How to create a rating based off social scores

I have a thousand recipes each having a tweet and facebook like counts. What i want to do is to create an overall rating out of 100 based off these two scores (and perhaps other social network counts too).
Assuming both facebook and twitter are equally weighted, how can i go about this.
one way to do this for any given network would be somethign like this
this_recipes_facebook_count / max_facebook_count_in_db * 100.0
and average it with the twitter result.
However what happens if there is a recipe with a freakish high score? It unfairly punishes other recipes with lower yet still relatively high scores.
I feel i need to take standard deviation into acccount, perhaps some dampening function...but its been 14 years since i took stats in highschool.
Can anyone help? Id prefer simple over complex as it is only recipe ratings after all.
Instead of linearly increasing the popularity count you might do something like this: (1-p^x)
Where p is a pre-selected value (say 0.99) and x is the number of mentions.
Initially increase in mentions is going to speed up the score a lot. But after sometime the effect becomes smaller and smaller.

How to calculate popularity based on some known factors

I've got a list of movies for each of which the following factors are known:
Number of people that wish to watch the movie in future
Number of people that have watched the movie
Number of people that have enjoyed the movie
Number of people that watched and disliked the movie
Number of comments on the movie
Number of page hits (directly or from search engines) for the movie page
So based on the above factors, I am looking for a way to calculate popularity for each of the movies. Is there any known formula or algorithm to calculate the popularity value in such case? Preferred algorithms are those which provide a more efficient way to update the previously calculated popularity value for each item.
There are basically infinite ways to do what you are after, depending on how important each factor is.
First, you will need to normalize the data. One way to do it is assume each feature is distributed normally, and find the standard deviation and mean of each feature. (your features are number of people watched the movie, number of people enjoyed the movie,...).
Once you have the sd (standard deviation) and mu (mean), you can easily transform the features for each movie to the standard form using norm = (value-mu)/sd.
The estimator for the mean (mu) is the simple average: sum(x_i) / n
The estimator for the standard deviation (sd) is sd = sqrt(Sum((x_i - mu)^2) / (n-1))
Once you have normalized your data, you can simply define the rating as a weighted sum, where each feature will get a boost according to how significant it is:
a1 * #watched + a2 * #liked + ....
If you don't know what the weight is, but willing to manually give grade to a set of movies, you might use supervised learning to find (a1,a2,...,an) for you using linear regression.
There is no correct answer, but I think we should try to model it as close to reality as possible.
Let's consider the following:
P1=Proportion of people who watched and enjoyed it
P2=Proportion of people who disliked the movie
P3=Proportion of people who watched and would like to see again
P4=People who will watch it later but haven't seen it yet
The number of comments simply can't tell how good a movie is, though it can tell how popular it is.Sure.You could leverage the amount of positive and negative comments if its possible to segregate so(possibly by up-votes and down-votes), or you could just use the number of comments as such(C).
Number of page hits should usually give a good indication of the popularity of the movie, so we should give it a good weight in our algorithm.Moreover we should give recent page hits more weight than say page hits of over a year ago.So try and keep the count of page hits in the last three days(N3), in the last week(N7), in the last month(N30) and in the last year(N365), and everything else(Nrest).
You come up with an algorithm using the factors I mentioned.
[Try to use weighted average and variations of Horner's rule for quick updates.Good Luck.]

Feedback on ranking algorithm options for my website

I am currently working on writing an algorithm for my new site I plan to launch soon. The index page will display the "hottest" posts at the moment.
Variables to consider are:
Number of votes
How controversial the post is (# between 0-1)
Time since post
I have come up with two possible algorithms, the first and most simple is:
controversial * (numVotesThisHour / (numVotesTotal - numVotesThisHour)
Denom = numVotesTuisHour if numVotesTotal - numVotesThisHour == 0
Highest number is hottest
My other option is to use an algorithm similar to Reddit's (except that the score decreases as time goes by):
[controversial * log(x)] - (TimePassed / interval)
x = { numVotesTotal if numVotesTotal >= 10, 10 if numVotesTotal < 10
Highest number is hottest
The first algorithm would allow older posts to become "hot" again in the future while the second one wouldn't.
So my question is, which one of these two algorithms do you think is more effective? Which one do you think will display the truly "hot" topics at the moment? Can you think of any advantages or disadvantages to using one over the other? I just want to make sure I don't overlook anything so that I can ensure the content is as relevant as possible. Any feedback would be great! Thanks!
Am I missing something. In the first formula you have numVotesTotal in the denominator. So higher number of votes all time will mean it will never be so hot even if it is not so old.
For example if I have two posts - P1 and P2 (both equally controversial). Say P1 has numVotesTotal = 20, and P2 has numVotesTotal = 1000. Now in the last one hour P1 gets numVotesThisHour = 10 and P2 gets numVotesThisHour = 200.
According to the algorithm, P1 is more famous than P2. It doesn't make sense to me.
I think the first algorithm relies too heavily on instantaneous trend. Think of NASCAR, the current leader could be going 0 m.p.h. because he's at a pit stop. The second one uses the notion of average trend. I think both have their uses.
So for two posts with the same total votes and controversial rating, but where posts one receives 20 votes in the first hour and zero in the second, while the other receives 10 in each hour. The first post will be buried by the first algorithm but the second algorithm will rank them equally.
YMMV, but I think the 'hotness' is entirely dependent on the time frame, and not at all on the total votes unless your time frame is 'all time'. Also, it seems to me that the proportion of all votes in the relevant time frame, rather than the absolute number of them, is the important figure.
You might have several categories of hot:
Hottest this hour
Hottest this week
Hottest since your last visit
Hottest all time
So, 'Hottest in the last [whatever]' could be calculated like this:
votes_for_topic_in_timeframe / all_votes_in_timeframe
if you especially want a number between 0 and 1, (useful for comparing across categories) or, if you only want the ones in a specific timeframe, just take the votes_for_topic_in_timeframe values and sort into descending order.
If you don't want the user explicitly choosing the time frame, you may want to calculate all (say) four versions (or perhaps just the top 3), assign a multiplier to each category to give each category a relative importance, and calculate total values for each topic to take the top n. This has the advantage of potentially hiding from the user that no-one at all has voted in the last hour ;)

Algorithm for Rating Objects Based on Amount of Votes and 5 Star Rating

I'm creating a site whereby people can rate an object of their choice by allotting a star rating (say 5 star rating). Objects are arranged in a series of tags and categories eg. electronics>graphics cards>pci express>... or maintenance>contractor>plumber.
If another user searches for a specific category or tag, the hits must return the highest "rated" object in that category. However, the system would be flawed if 1 person only votes 5 stars for an object whilst 1000 users vote an average of 4.5 stars for another object. Obviously, logic dictates that credibility would be given to the 1000 user rated object as opposed to the object that is evaluated by 1 user even though it has a "lower" score.
Conversely, it's reliable to trust an object with 500 user rating with score of 4.8 than it is to trust an object with 1000 user ratings of 4.5 for example.
What algorithm can achieve this weighting?
A great answer to this question is here:
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
You can use the Bayesian average when sorting by recommendation.
I'd be tempted to have a cutoff (say, fifty votes though this is obviously traffic dependent) before which you consider the item as unranked. That would significantly reduce the motivation for spam/idiot rankings (especially if each vote is tied to a user account), and also gets you a simple, quick to implement, and reasonably reliable system.
simboid_function(value) = 1/(1+e^(-value));
rating = simboid_function(number_of_voters) + simboid_function(average_rating);

Resources