500px.com Ranking Algorithm - algorithm

I was recently wondering how http://500px.com calculates their "Pulse" rating.
The "Pulse" is a score from 1..100 based on the popularity of the photo.
I think it might use some of the following criteria:
Number of likes
Number of "favorites"
Number of comments
Total views
maybe the time since the photo has been uploaded
maybe some other non-obvious criteria like the users follower count, user rank, camera model or similar
How would I achieve some sort of algorithm like this?
Any advice on how to implement an algorithm with this criteria (and maybe some code) would be appreciated too.

I don't know too much about the site but systems like this generally work the same way. Normalize a set of weighted values to produce a single comparable value.
Define your list of rules, weight them based on importance, then run them all together to get your final value.
In this case it would be something like.
Total number of visits = 10%
Total number of Likes = 10%
Number of vists / number of likes = 40% (popularity = percentage of visitors that liked it)
number of Likes in last 30 days = 20% (current popularity)
author rating = 20%
Now we need to normalize the values for those rules. Depending on what your data is, scale of numbers etc this will be different for each rule so we need a workable value, say between 1 and 100.
Example normalizations for the above:
= percentage of vistors out of 50,000 vists (good number of vists)
(vists / 50000 ) * 100
= percentage of likes out of 10,000 likes (good number of likes)
(likes / 10000) * 100
= percentage of vistors that liked it
(likes / vists) * 100
= percentage of likes in last 30 days out of 1,000 likes (good number of likes for a 30 day period)
(likesIn30Days / 1000) * 100
= arbitrary rating of the author
Make sure all of these have a maximum value of 100 (if it's over bring it back down). Then we need to combine all these up depending on their weighting:
Popularity = (1 * 0.1) + (2 * 0.1) + (3 * 0.4) + (4 * 0.2) + (5 * 0.2)
This is all off the top of my head and rough. There are obviously also much more efficient ways to this as you don't need to normalize to a percentage at every stage but I hope it helps you get the gist.
Update
I've not really got any references or extra reading. I've never really worked with it as a larger concept only in small implementations.
I think most of what you read though is going to be methodological ranking systems in general and theories. Because depending on your rules and data format, your implementation will be very different. It seems such a huge concept when actually it will probably come down to arround 10 lines of code, not counting aggregating your data.

You may want to also refer to the following
How Reddit ranking algorithms work
How Hacker News ranking algorithm works
How to Build a Popularity Algorithm You can be Proud of

500px explains their (in the meantime outdated) Pulse ranking algorithm to some extend in their blog:
https://500px.com/blog/52/how-rating-works-and-why-there-s-a-lot-more-to-a-rating-than-just-a-number
Pretty interesting and different than what the solutions here on SO suggest so far.

Related

How to create a rating based off social scores

I have a thousand recipes each having a tweet and facebook like counts. What i want to do is to create an overall rating out of 100 based off these two scores (and perhaps other social network counts too).
Assuming both facebook and twitter are equally weighted, how can i go about this.
one way to do this for any given network would be somethign like this
this_recipes_facebook_count / max_facebook_count_in_db * 100.0
and average it with the twitter result.
However what happens if there is a recipe with a freakish high score? It unfairly punishes other recipes with lower yet still relatively high scores.
I feel i need to take standard deviation into acccount, perhaps some dampening function...but its been 14 years since i took stats in highschool.
Can anyone help? Id prefer simple over complex as it is only recipe ratings after all.
Instead of linearly increasing the popularity count you might do something like this: (1-p^x)
Where p is a pre-selected value (say 0.99) and x is the number of mentions.
Initially increase in mentions is going to speed up the score a lot. But after sometime the effect becomes smaller and smaller.

Feedback on ranking algorithm options for my website

I am currently working on writing an algorithm for my new site I plan to launch soon. The index page will display the "hottest" posts at the moment.
Variables to consider are:
Number of votes
How controversial the post is (# between 0-1)
Time since post
I have come up with two possible algorithms, the first and most simple is:
controversial * (numVotesThisHour / (numVotesTotal - numVotesThisHour)
Denom = numVotesTuisHour if numVotesTotal - numVotesThisHour == 0
Highest number is hottest
My other option is to use an algorithm similar to Reddit's (except that the score decreases as time goes by):
[controversial * log(x)] - (TimePassed / interval)
x = { numVotesTotal if numVotesTotal >= 10, 10 if numVotesTotal < 10
Highest number is hottest
The first algorithm would allow older posts to become "hot" again in the future while the second one wouldn't.
So my question is, which one of these two algorithms do you think is more effective? Which one do you think will display the truly "hot" topics at the moment? Can you think of any advantages or disadvantages to using one over the other? I just want to make sure I don't overlook anything so that I can ensure the content is as relevant as possible. Any feedback would be great! Thanks!
Am I missing something. In the first formula you have numVotesTotal in the denominator. So higher number of votes all time will mean it will never be so hot even if it is not so old.
For example if I have two posts - P1 and P2 (both equally controversial). Say P1 has numVotesTotal = 20, and P2 has numVotesTotal = 1000. Now in the last one hour P1 gets numVotesThisHour = 10 and P2 gets numVotesThisHour = 200.
According to the algorithm, P1 is more famous than P2. It doesn't make sense to me.
I think the first algorithm relies too heavily on instantaneous trend. Think of NASCAR, the current leader could be going 0 m.p.h. because he's at a pit stop. The second one uses the notion of average trend. I think both have their uses.
So for two posts with the same total votes and controversial rating, but where posts one receives 20 votes in the first hour and zero in the second, while the other receives 10 in each hour. The first post will be buried by the first algorithm but the second algorithm will rank them equally.
YMMV, but I think the 'hotness' is entirely dependent on the time frame, and not at all on the total votes unless your time frame is 'all time'. Also, it seems to me that the proportion of all votes in the relevant time frame, rather than the absolute number of them, is the important figure.
You might have several categories of hot:
Hottest this hour
Hottest this week
Hottest since your last visit
Hottest all time
So, 'Hottest in the last [whatever]' could be calculated like this:
votes_for_topic_in_timeframe / all_votes_in_timeframe
if you especially want a number between 0 and 1, (useful for comparing across categories) or, if you only want the ones in a specific timeframe, just take the votes_for_topic_in_timeframe values and sort into descending order.
If you don't want the user explicitly choosing the time frame, you may want to calculate all (say) four versions (or perhaps just the top 3), assign a multiplier to each category to give each category a relative importance, and calculate total values for each topic to take the top n. This has the advantage of potentially hiding from the user that no-one at all has voted in the last hour ;)

Formula for popularity? (based on "like it", "comments", "views")

I have some pages on a website and I have to create an ordering based on "popularity"/"activity"
The parameters that I have to use are:
views to the page
comments made on the page (there is a form at the bottom where uses can make comments)
clicks made to the "like it" icon
Are there any standards for what a formula for popularity would be? (if not opinions are good too)
(initially I thought of views + 10*comments + 10*likeit)
Actually there is an accepted best way to calculate this:
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
You may need to combine 'likes' and 'comments' into a single score, assigning your own weighting factor to each, before plugging it into the formula as the 'positive vote' value.
from the link above:
Score = Lower bound of Wilson score confidence interval for a
Bernoulli parameter
We need to balance the proportion of positive ratings with
the uncertainty of a small number of observations. Fortunately, the
math for this was worked out in 1927 by Edwin B. Wilson. What we want
to ask is: Given the ratings I have, there is a 95% chance that the
"real" fraction of positive ratings is at least what? Wilson gives the
answer. Considering only positive and negative ratings (i.e. not a
5-star scale), the lower bound on the proportion of positive ratings
is given by:
(Use minus where it says plus/minus to calculate the lower bound.)
Here p̂ is the observed fraction of positive ratings, zα/2 is the
(1-α/2) quantile of the standard normal distribution, and n is the
total number of ratings. The same formula implemented in Ruby:
require 'statistics2'
def ci_lower_bound(pos, n, confidence)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-(1-confidence)/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end
pos is the number of positive ratings, n is the total number of
ratings, and confidence refers to the statistical confidence level:
pick 0.95 to have a 95% chance that your lower bound is correct, 0.975
to have a 97.5% chance, etc. The z-score in this function never
changes, so if you don't have a statistics package handy or if
performance is an issue you can always hard-code a value here for z.
(Use 1.96 for a confidence level of 0.95.)
The same formula as an SQL query:
SELECT widget_id, ((positive + 1.9208) / (positive + negative) -
1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
(positive + negative)) / (1 + 3.8416 / (positive + negative))
AS ci_lower_bound FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;
There is no standard formula for this (how could there be?)
What you have looks like a fairly normal solution, and would probably work well. Of course, you should play around with the 10's to find values that suit your needs.
Depending on your requirements, you might also want to add in a time factor (i.e. -X points per week) so that old pages become less popular. Alternatively, you could change your "page views" to "page views in the last month". Again, this depends on your needs, it may not be relevant.
You could do something like what YouTube does - just have it sorted by largest count per category. For example - most viewed, most commented, most liked. In each category a different page could come first, though the rankings might likely be correlated. If you only need a single ranking, then you would have to come up with a formula of some sort, preferably derived empirically by analyzing a bunch of data you already have and deciding what should be calculated as good/bad, and working backwards to fit an equation that fits your decision.
You could even attempt a machine learning approach to "learn" what a good weighting is for combining each of these numbers as in your example formula. Doing it manually might also not be too hard.
I use,
(C*comments + L*likeit)*100/views
where you must use C and L depending on how much you value each attribute.
I use C=1 and L=1.
This gives you the percentage of views that generated a positive action, making the items with
higher percentage the most "popular".
I like this because it makes it possible for newer items to be very popular at first, showing up first and getting more views and thus becoming less popular (or more) until stabilizing.
Anyway,
i hope it helps.
PS: Of it would work just the same without the "*100" but i like percentages.
I would value comments more than 'like it's if the content invites a discussion. If it's just stating facts, an equal ration for comments and the like count seems ok (though 10 is a bit too much, I think...)
Does visit take into account the time the user spent somehow? You might use that, as well, as a 2 second view means less than a 3 minute one.
Java code for Anentropic's answer:
public static double getRank(double thumbsUp, double thumbsDown) {
double totalVotes = thumbsUp + thumbsDown;
if (totalVotes > 0) {
return ((thumbsUp + 1.9208) / totalVotes -
1.96 * Math.sqrt((thumbsUp * thumbsDown) / totalVotes + 0.9604) /
totalVotes) / (1 + (3.8416 / totalVotes));
} else {
return 0;
}
}

How to calculate scores?

This question is more related to logic than any programming language. If the question is not apt for the forum please do let me know and I will delete this.
I have to write a logic to calculate scores for blogs for a Blog Award website. A blog may be nominated for multiple award categories and is peer-reviewed or rated by a Jury on a -1 to 5 scale (-1 to indicate a blog they utterly dislike). Now, a blog can be rated by one or more Jurors. One criterion while calculating final score for a blog is that if a blog is rated positively by more people it should get more weightage (and vice-versa). Similarly a blog rated -1 even by one Juror should have its score affected (-1 is sort of a Veto here). Lastly, I also want to have an additional score based on the Technorati rank of the blog (so that the final score is based on a mix of Juror rating + Technorati ranking).
Example: A blog is rated in category A by total 6 Jurors. 2 rate it at 3, 3 rate it at 2 and 1 rate it at 4. (I used to calculate the score as (2*3 + 3*2 + 1*4)/6 = 16/6 = 2.67 to get weighted average but I am not satisfied with this, primarily because it doesn't work well when a Juror rating is -1. Moreover, I need to add the Technorati ranking ranking criteria too) .
Could you help me decide the best way to calculate the final scores (keeping the rating method same as above as that cannot be changed now)?
If you want to weight the effect of a -1 rating more strongly, use the same average score calculation but substitute -10 whenever you see -1. You can choose a value other than -10 if you don't want a negative rating to weight as strongly.
You might look at using the lower bound of the Wilson score interval for your ratings.
See http://www.evanmiller.org/how-not-to-sort-by-average-rating.html for more details. Although, there, it is used for the simpler Bernoulli case.
The gist is if you have a lot of ratings you have a higher degree of confidence in your scoring. You can then combine the scores from your local ratings and the Technorati ratings, by weighting the scores by the number of voters locally and on Technorati.
As for wanting a single -1 vote to have high impact, just remap it to a large negative value proportional to your desired impact before feeding it into your scoring formula.
Calculating a score based on votes will be pretty easy. Adding the technorati rank will be the tricky part.
I made a quick script that calculates some scores based on this algorithm
score = ( vote_sum - ( vetos * veto_weight ) ) / number_of_votes
you can change the url paramters to get different values
There are a lot of ties, so maybe you could use technorati blog rank as a tie breaker
you could internally work with scores from 0 to 6. Just do a shift by one, calculate the score and shift back. I guess the -1 has some disrupting effekt on your calculation.

How to rank stories based on "controversy"?

I'd like to rank my stories based on "controversy" quotient. For example, reddit.com currently has "controversial" section: http://www.reddit.com/controversial/
When a story has a lot of up and a lot of down votes, it's controversial even though the total score is 0 (for example). How should I calculate this quotient score so that when there's a lot of people voting up and down, I can capture this somehow.
Thanks!!!
Nick
I would recommend using the standard deviation of the votes.
A controversial vote that's 100% polarised would have equal numbers of -1 and +1 votes, so the mean would be 0 and the stddev would be around 1.0
Conversely a completely consistent set of votes (with no votes in the opposite direction) would have a mean of 1 or -1 and a stddev of 0.0.
Votes that aren't either completely consistent or completely polarised will produce a standard deviation figure between 0 and ~1.0 where that value will indicate the degree of controversy in the vote.
The easiest method is to count the number of upvote/downvote pairings for a given comment within the timeframe (e.g. 1 week, 48 hours etc), and have comments with the most parings appear first. Anything more complex requires trial-and-error or experimentation on the best algorithm - as always, it varies on the content of the site and how you want it weighted.
Overall, it's not much different than a hotness algorithm, which works by detecting the most upvotes or views within a timeframe.
What about simply getting the smaller of the two values (up or down) of a point in time? If it goes up a lot and goes down a little, or the other way around it, is not controversial.
If for example the items has 10 ups and 5 downs, the "controversiality level" is 5, since there is 5 people disagreeing about liking it or not. On the other hand if it has either 10 ups or 10 downs, the "controversiality level" is 0, since no one is disagreeing.
So in the end the smaller of both items in this case defines the "hotness" or the "controversiality". Does this make sense?
// figure out if up or down is winning - doesn't matter which
if (up_votes > down_votes)
{
win_votes = up_votes;
lose_votes = down_votes;
}
else
{
win_votes = down_votes;
lose_votes = up_votes;
}
// losewin_ratio is always <= 1, near 0 if win_votes >> lose_votes
losewin_ratio = lose_votes / win_votes;
total_votes = up_votes + down_votes;
controversy_score = total_votes * losewin_ratio; // large means controversial
This formula will produce high scores for stories that have a lot of votes and a near 50/50 voting split, and low scores for stories that have either few votes or many votes for one choice.

Resources