Algorithm to determine most popular article last week, month and year? - algorithm

I'm working on a project where I need to sort a list of user-submitted articles by their popularity (last week, last month and last year).
I've been mulling on this for a while, but I'm not a great statitician so I figured I could maybe get some input here.
Here are the variables available:
Time [date] the article was originally published
Time [date] the article was recommended by editors (if it has been)
Amount of votes the article has received from users (total, in the last week, in the last month, in the last year)
Number of times the article has been viewed (total, in the last week, in the last month, in the last year)
Number of times the article has been downloaded by users (total, in the last week, in the last month, in the last year)
Comments on the article (total, in the last week, in the last month, in the last year)
Number of times a user has saved the article to their reading-list (Total, in the last week, in the last month, in the last year)
Number of times the article has been featured on a kind of "best we've got to offer" (editorial) list (Total, in the last week, in the last month, in the last year)
Time [date] the article was dubbed 'article of the week' (if it has been)
Right now I'm doing some weighting on each variable, and dividing by the times it has been read. That's pretty much all I could come up with after reading up on Weighted Means. My biggest problem is that there are some user-articles that are always on the top of the popular-list. Probably because the author is "cheating".
I'm thinking of emphasizing the importance of the article being relatively new, but I don't want to "punish" articles that are genuinely popular just because they're a bit old.
Anyone with a more statistically adept mind than mine willing to help me out?
Thanks!

I think the weighted means approach is a good one. But I think there are two things you need to work out.
How to weigh the criteria.
How to prevent "gaming" of the system
How to weigh the criteria
This question falls under the domain of Multi-Criteria Decision Analysis. Your approach is the Weighted Sum Model. In any computational decision making process, ranking the criteria is often the most difficult part of the process. I suggest you take the route of pairwise comparisons: how important do you think each criterion is compared to the others? Build yourself a table like this:
c1 c2 c3 ...
c1 1 4 2
c2 1/4 1 1/2
c3 1/2 2 1
...
This shows that C1 is 4 times as important as C2 which is half as important as C3. Use a finite pool of weightings, say 1.0 since that's easy. Distributing it over the criteria we have 4 * C1 + 2 * C3 + C2 = 1 or roughly C1 = 4/7, C3 = 2/7, C2 = 1/7. Where discrepencies arise (for instance if you think C1 = 2*C2 = 3*C3, but C3 = 2*C2), that's a good error indication: it means that you're inconsistent with your relative rankings so go back and reexamine them. I forget the name of this procedure, comments would be helpful here. This is all well documented.
Now, this all probably seems a bit arbitrary to you at this point. They're for the most part numbers you pulled out of your own head. So I'd suggest taking a sample of maybe 30 articles and ranking them in the way "your gut" says they should be ordered (often you're more intuitive than you can express in numbers). Finagle the numbers until they produce something close to that ordering.
Preventing gaming
This is the second important aspect. No matter what system you use, if you can't prevent "cheating" it will ultimately fail. You need to be able to limit voting (should an IP be able to recommend a story twice?). You need to be able to prevent spam comments. The more important the criterion, the more you need to prevent it from being gamed.

You can use outlier theory for detecting anomalies. A very naive way of looking for outliers is using the mahalanobis distance. This is a measure that takes into account the spread of your data, and calculates the relative distance from the center. It can be interpreted as how many standard deviations the article is from the center. This will however include also genuinely very popular articles, but it gives you a first indication that something is odd.
A second, more general approach is building a model. You could regress the variables that can be manipulated by users against those related to editors. One would expect that users and editors would agree to some extent. If they don't, then it's again an indication something is odd.
In both cases, you'll need to define some treshold and try to find some weighting based on that. A possible approach is to use the square rooted mahalanobis distance as an inverse weight. If you're far away from the center, your score will be pulled down. Same can be done using the residuals from the model. Here you could even take the sign into account. If the editor score is lower than what would be expected based on the user score, the residual will be negative. if the editor score is higher than what would be expected based on the user score, the residual is positive and it's very unlikely that the article is gamed. This allows you to define some rules to reweigh the given scores.
An example in R:
Code :
#Test data frame generated at random
test <- data.frame(
quoted = rpois(100,12),
seen = rbinom(100,60,0.3),
download = rbinom(100,30,0.3)
)
#Create some link between user-vars and editorial
test <- within(test,{
editorial = round((quoted+seen+download)/10+rpois(100,1))
})
#add two test cases
test[101,]<-c(20,18,13,0) #bad article, hyped by few spammers
test[102,]<-c(20,18,13,8) # genuinely good article
# mahalanobis distances
mah <- mahalanobis(test,colMeans(test),cov(test))
# simple linear modelling
mod <- lm(editorial~quoted*seen*download,data=test)
# the plots
op <- par(mfrow=c(1,2))
hist(mah,breaks=20,col="grey",main="Mahalanobis distance")
points(mah[101],0,col="red",pch=19)
points(mah[102],0,,col="darkgreen",pch=19)
legend("topright",legend=c("high rated by editors","gamed"),
pch=19,col=c("darkgreen","red"))
hist(resid(mod),breaks=20,col="grey",main="Residuals model",xlim=c(-6,4))
points(resid(mod)[101],0,col="red",pch=19)
points(resid(mod)[102],0,,col="darkgreen",pch=19)
par(op)

There are any number of ways to do this, and what works for you will depend on your actual dataset and what outcomes you desire for specific articles. As a rough reworking though, I would suggest moving the times it has been read to the weighted numbers and dividing by age of the article, since the older an article is, the more likely it is to have higher numbers in each category.
For example
// x[i] = any given variable above
// w[i] = weighting for that variable
// age = days since published OR
// days since editor recommendation OR
// average of both OR
// ...
score = (x[1]w[1] + ... + x[n]w[n])/age
Your problem of wanting to promote new articles more but not wanting to punish genuinely popular old articles requires consideration of how you can tell whether or not an article is genuinely popular. Then just use the "genuine-ness" algorithm to weight the votes or views rather than a static weighting. You can also change any of the other weightings to be functions rather than constants, and then have non-linear weightings for any variables you wish.
// Fw = some non-linear function
// (possibly multi-variable) that calculates
// a sub-score for the given variable(s)
score = (Fw1(x[1]) + ... + FwN(x[n]))/FwAge(age)

Related

How to calculate popularity based on some known factors

I've got a list of movies for each of which the following factors are known:
Number of people that wish to watch the movie in future
Number of people that have watched the movie
Number of people that have enjoyed the movie
Number of people that watched and disliked the movie
Number of comments on the movie
Number of page hits (directly or from search engines) for the movie page
So based on the above factors, I am looking for a way to calculate popularity for each of the movies. Is there any known formula or algorithm to calculate the popularity value in such case? Preferred algorithms are those which provide a more efficient way to update the previously calculated popularity value for each item.
There are basically infinite ways to do what you are after, depending on how important each factor is.
First, you will need to normalize the data. One way to do it is assume each feature is distributed normally, and find the standard deviation and mean of each feature. (your features are number of people watched the movie, number of people enjoyed the movie,...).
Once you have the sd (standard deviation) and mu (mean), you can easily transform the features for each movie to the standard form using norm = (value-mu)/sd.
The estimator for the mean (mu) is the simple average: sum(x_i) / n
The estimator for the standard deviation (sd) is sd = sqrt(Sum((x_i - mu)^2) / (n-1))
Once you have normalized your data, you can simply define the rating as a weighted sum, where each feature will get a boost according to how significant it is:
a1 * #watched + a2 * #liked + ....
If you don't know what the weight is, but willing to manually give grade to a set of movies, you might use supervised learning to find (a1,a2,...,an) for you using linear regression.
There is no correct answer, but I think we should try to model it as close to reality as possible.
Let's consider the following:
P1=Proportion of people who watched and enjoyed it
P2=Proportion of people who disliked the movie
P3=Proportion of people who watched and would like to see again
P4=People who will watch it later but haven't seen it yet
The number of comments simply can't tell how good a movie is, though it can tell how popular it is.Sure.You could leverage the amount of positive and negative comments if its possible to segregate so(possibly by up-votes and down-votes), or you could just use the number of comments as such(C).
Number of page hits should usually give a good indication of the popularity of the movie, so we should give it a good weight in our algorithm.Moreover we should give recent page hits more weight than say page hits of over a year ago.So try and keep the count of page hits in the last three days(N3), in the last week(N7), in the last month(N30) and in the last year(N365), and everything else(Nrest).
You come up with an algorithm using the factors I mentioned.
[Try to use weighted average and variations of Horner's rule for quick updates.Good Luck.]

Feedback on ranking algorithm options for my website

I am currently working on writing an algorithm for my new site I plan to launch soon. The index page will display the "hottest" posts at the moment.
Variables to consider are:
Number of votes
How controversial the post is (# between 0-1)
Time since post
I have come up with two possible algorithms, the first and most simple is:
controversial * (numVotesThisHour / (numVotesTotal - numVotesThisHour)
Denom = numVotesTuisHour if numVotesTotal - numVotesThisHour == 0
Highest number is hottest
My other option is to use an algorithm similar to Reddit's (except that the score decreases as time goes by):
[controversial * log(x)] - (TimePassed / interval)
x = { numVotesTotal if numVotesTotal >= 10, 10 if numVotesTotal < 10
Highest number is hottest
The first algorithm would allow older posts to become "hot" again in the future while the second one wouldn't.
So my question is, which one of these two algorithms do you think is more effective? Which one do you think will display the truly "hot" topics at the moment? Can you think of any advantages or disadvantages to using one over the other? I just want to make sure I don't overlook anything so that I can ensure the content is as relevant as possible. Any feedback would be great! Thanks!
Am I missing something. In the first formula you have numVotesTotal in the denominator. So higher number of votes all time will mean it will never be so hot even if it is not so old.
For example if I have two posts - P1 and P2 (both equally controversial). Say P1 has numVotesTotal = 20, and P2 has numVotesTotal = 1000. Now in the last one hour P1 gets numVotesThisHour = 10 and P2 gets numVotesThisHour = 200.
According to the algorithm, P1 is more famous than P2. It doesn't make sense to me.
I think the first algorithm relies too heavily on instantaneous trend. Think of NASCAR, the current leader could be going 0 m.p.h. because he's at a pit stop. The second one uses the notion of average trend. I think both have their uses.
So for two posts with the same total votes and controversial rating, but where posts one receives 20 votes in the first hour and zero in the second, while the other receives 10 in each hour. The first post will be buried by the first algorithm but the second algorithm will rank them equally.
YMMV, but I think the 'hotness' is entirely dependent on the time frame, and not at all on the total votes unless your time frame is 'all time'. Also, it seems to me that the proportion of all votes in the relevant time frame, rather than the absolute number of them, is the important figure.
You might have several categories of hot:
Hottest this hour
Hottest this week
Hottest since your last visit
Hottest all time
So, 'Hottest in the last [whatever]' could be calculated like this:
votes_for_topic_in_timeframe / all_votes_in_timeframe
if you especially want a number between 0 and 1, (useful for comparing across categories) or, if you only want the ones in a specific timeframe, just take the votes_for_topic_in_timeframe values and sort into descending order.
If you don't want the user explicitly choosing the time frame, you may want to calculate all (say) four versions (or perhaps just the top 3), assign a multiplier to each category to give each category a relative importance, and calculate total values for each topic to take the top n. This has the advantage of potentially hiding from the user that no-one at all has voted in the last hour ;)

Algorithm to calculate a page importance based on its views / comments

I need an algorithm that allows me to determine an appropriate <priority> field for my website's sitemap based on the page's views and comments count.
For those of you unfamiliar with sitemaps, the priority field is used to signal the importance of a page relative to the others on the same website. It must be a decimal number between 0 and 1.
The algorithm will accept two parameters, viewCount and commentCount, and will return the priority value. For example:
GetPriority(100000, 100000); // Damn, a lot of views/comments! The returned value will be very close to 1, for example 0.995
GetPriority(3, 2); // Ok not many users are interested in this page, so for example it will return 0.082
You mentioned doing this in an SQL query, so I'll give samples in that.
If you have a table/view Pages, something like this
Pages
-----
page_id:int
views:int - indexed
comments:int - indexed
Then you can order them by writing
SELECT * FROM Pages
ORDER BY
(0.3+LOG10(10+views)/LOG10(10+(SELECT MAX(views) FROM Pages))) +
(0.7+LOG10(10+comments)/LOG10(10+(SELECT MAX(comments) FROM Pages)))
I've deliberately chosen unequal weighting between views and comments. A problem that can arise with keeping an equal weighting with views/comments is that the ranking becomes a self-fulfilling prophecy - a page is returned at the top of the list, so it's visited more often, and thus gets more points, so it's shown at the stop of the list, and it's visited more often, and it gets more points.... Putting more weight on on the comments reflects that these take real effort and show real interest.
The above formula will give you ranking based on all-time statistics. So an article that amassed the same number of views/comments in the last week as another article amassed in the last year will be given the same priority. It may make sense to repeat the formula, each time specifying a range of dates, and favoring pages with higher activity, e.g.
0.3*(score for views/comments today) - live data
0.3*(score for views/comments in the last week)
0.25*(score for views/comments in the last month)
0.15*(score for all views/comments, all time)
This will ensure that "hot" pages are given higher priority than similarly scored pages that haven't seen much action lately. All values apart from today's scores can be persisted in tables by scheduled stored procedures so that the database isn't having to aggregate many many comments/view stats. Only today's stats are computed "live". Taking it one step further, the ranking formula itself can be computed and stored for historical data by a stored procedure run daily.
EDIT: To get a strict range from 0.1 to 1.0, you would motify the formula like this. But I stress - this will only add overhead and is unecessary - the absolute values of priority are not important - only their relative values to other urls. The search engine uses these to answer the question, is URL A more important/relevant than URL B? It does this by comparing their priorities - which one is greatest - not their absolute values.
// unnormalized - x is some page id
un(x) = 0.3*log(views(x)+10)/log(10+maxViews()) +
0.7*log(comments(x)+10)/log(10+maxComments())
// the original formula (now in pseudo code)
The maximum will be 1.0, the minimum will start at 1.0 and move downwards as more views/comments are made.
we define un(0) as the minimum value, i.e. (where views(x) and comments(x) are both 0 in the above formula)
To get a normalized formula from 0.1 to 1.0, you then compute n(x), the normalized priority for page x
(1.0-un(x)) * (un(0)-0.1)
n(x) = un(x) - ------------------------- when un(0) != 1.0
1.0-un(0)
= 0.1 otherwise.
Priority = W1 * views / maxViewsOfAllArticles + W2 * comments / maxCommentsOfAllArticles
with W1+W2=1
Although IMHO, just use 0.5*log_10(10+views)/log_10(10+maxViews) + 0.5*log_10(10+comments)/log_10(10+maxComments)
What you're looking for here is not an algorithm, but a formula.
Unfortunately, you haven't really specified the details of what you want, so there's no way we can provide the formula to you.
Instead, let's try to walk through the problem together.
You've got two incoming parameters, the viewCount and the commentCount. You want to return a single number, Priority. So far, so good.
You say that Priority should range between 0 and 1, but this isn't really important. If we were to come up with a formula we liked, but resulted in values between 0 and N, we could just divide the results by N-- so this constraint isn't really relevant.
Now, the first thing we need to decide is the relative weight of Comments vs Views.
If page A has 100 comments and 10 views, and page B has 10 comments and 100 views, which should have a higher priority? Or, should it be the same priority? You need to decide what's right for your definition of Priority.
If you decide, for example, that comments are 5 times more valuable than views, then we can begin with a formula like
Priority = 5 * Comments + Views
Obviously, this can be generalized to
Priority = A * Comments + B * Views
Where A and B are relative weights.
But, sometimes we want our weights to be exponential instead of linear, like
Priority = Comment ^ A + Views ^ B
which will give a very different curve than the earlier formula.
Similarly,
Priority = Comment ^ A * Views ^ B
will give higher value to a page with 20 comments and 20 views than one with 1 comment and 40 views, if the weights are equal.
So, to summarize:
You really ought to make a spreadsheet with sample values for Views and Comments, and then play around with various formulas until you get one that has the distribution that you are hoping for.
We can't do it for you, because we don't know how you want to value things.
I know it has been a while since this was asked, but I encountered a similar problem and had a different solution.
When you want to have a way to rank something, and there are multiple factors that you're using to perform that ranking, you're doing something called multi-criteria decision analysis. (MCDA). See: http://en.wikipedia.org/wiki/Multi-criteria_decision_analysis
There are several ways to handle this. In your case, your criteria have different "units". One is in units of comments, the other is in units of views. Futhermore, you may want to give different weight to these criteria based on whatever business rules you come up with.
In that case, the best solution is something called a weighted product model. See: http://en.wikipedia.org/wiki/Weighted_product_model
The gist is that you take each of your criteria and turn it into a percentage (as was previously suggested), then you take that percentage and raise it to the power of X, where X is a number between 0 and 1. This number represents your weight. Your total weights should add up to one.
Lastly, you multiple each of the results together to come up with a rank. If the rank is greater than 1, than the numerator page has a higher rank than the denominator page.
Each page would be compared against every other page by doing something like:
p1C = page 1 comments
p1V = page 1 view
p2C = page 2 comments
p2V = page 2 views
wC = comment weight
wV = view weight
rank = (p1C/p2C)^(wC) * (p1V/p2V)^(wV)
The end result is a sorted list of pages according to their rank.
I've implemented this in C# by performing a sort on a collection of objects implementing IComparable.
What several posters have essentially advocated without conceptual clarification is that you use linear regression to determine a weighting function of webpage view and comment counts to establish priority.
This technique is pretty easy to implement for your problem, and the basic concept is described well in this Wikipedia article on linear regression models.
A quick summary of how to apply it to your problem is:
Determine the parameters of the line which best fits the view and comment count data for all your site's webpages, i.e., use linear regression.
Use the line parameters to derive your priority function for the view/count parameters.
Code examples for basic linear regression should not be hard to track down if you don't want to implement it from scratch from basic math formulas (use the web, Numerical Recipes, etc.). Also, any general math software package like Matlab, R, etc., comes with linear regression functions.
The most naive approach would be the following:
Let v[i] the views of page i, c[i] the number of comments for page i, then define the relative view weight for page i to be
r_v(i) = v[i]/(sum_j v[j])
where sum_j v[j] is the total of the v[.] over all pages. Similarly define the relative comment weight for page i to be
r_c(i) = c[i]/(sum_j c[j]).
Now you want some constant parameter p: 0 < p < 1 which indicates the importance of views over comments: p = 0 means only comments are significant, p = 1 means only views are significant, and p = 0.5 gives equal weight.
Then set the priority to be
p*r_v(i) + (1-p)*r_c(i)
This might be over-simplistic but its probably the best starting point.

Understanding algorithms for measuring trends

What's the rationale behind the formula used in the hive_trend_mapper.py program of this Hadoop tutorial on calculating Wikipedia trends?
There are actually two components: a monthly trend and a daily trend. I'm going to focus on the daily trend, but similar questions apply to the monthly one.
In the daily trend, pageviews is an array of number of page views per day for this topic, one element per day, and total_pageviews is the sum of this array:
# pageviews for most recent day
y2 = pageviews[-1]
# pageviews for previous day
y1 = pageviews[-2]
# Simple baseline trend algorithm
slope = y2 - y1
trend = slope * log(1.0 +int(total_pageviews))
error = 1.0/sqrt(int(total_pageviews))
return trend, error
I know what it's doing superficially: it just looks at the change over the past day (slope), and scales this up to the log of 1+total_pageviews (log(1)==0, so this scaling factor is non-negative). It can be seen as treating the month's total pageviews as a weight, but tempered as it grows - this way, the total pageviews stop making a difference for things that are "popular enough," but at the same time big changes on insignificant don't get weighed as much.
But why do this? Why do we want to discount things that were initially unpopular? Shouldn't big deltas matter more for items that have a low constant popularity, and less for items that are already popular (for which the big deltas might fall well within a fraction of a standard deviation)? As a strawman, why not simply take y2-y1 and be done with it?
And what would the error be useful for? The tutorial doesn't really use it meaningfully again. Then again, it doesn't tell us how trend is used either - this is what's plotted in the end product, correct?
Where can I read up for a (preferably introductory) background on the theory here? Is there a name for this madness? Is this a textbook formula somewhere?
Thanks in advance for any answers (or discussion!).
As the in-line comment goes, this is a simple "baseline trend algorithm",
which basically means before you compare the trends of two different pages, you have to establish
a baseline. In many cases, the mean value is used, it's straightforward if you
plot the pageviews against the time axis. This method is widely used in monitoring
water quality, air pollutants, etc. to detect any significant changes w.r.t the baseline.
In OP's case, the slope of pageviews is weighted by the log of totalpageviews.
This sorta uses the totalpageviews as a baseline correction for the slope. As Simon put it, this puts a balance
between two pages with very different totalpageviews.
For exmaple, A has a slope 500 over 1000,000 total pageviews, B is 1000 over 1,000.
A log basically means 1000,000 is ONLY twice more important than 1,000 (rather than 1000 times).
If you only consider the slope, A is less popular than B.
But with a weight, now the measure of popularity of A is the same as B. I think it is quite intuitive:
though A's pageviews is only 500 pageviews, but that's because it's saturating, you still gotta give it enough credit.
As for the error, I believe it comes from the (relative) standard error, which has a factor 1/sqrt(n), where
n is the number of data points. In the code, the error is equal to (1/sqrt(n))*(1/sqrt(mean)).
It roughly translates into : the more data points, the more accurate the trend. I don't see
it is an exact math formula, just a brute trend analysis algorithm, anyway the relative
value is more important in this context.
In summary, I believe it's just an empirical formula. More advanced topics can be found in some biostatistics textbooks (very similar to monitoring the breakout of a flu or the like.)
The code implements statistics (in this case the "baseline trend"), you should educate yourself on that and everything becomes clearer. Wikibooks has a good instroduction.
The algorithm takes into account that new pages are by definition more unpopular than existing ones (because - for example - they are linked from relatively few other places) and suggests that those new pages will grow in popularity over time.
error is the error margin the system expects for its prognoses. The higher error is, the more unlikely the trend will continue as expected.
The reason for moderating the measure by the volume of clicks is not to penalise popular pages but to make sure that you can compare large and small changes with a single measure. If you just use y2 - y1 you will only ever see the click changes on large volume pages. What this is trying to express is "significant" change. 1000 clicks change if you attract 100 clicks is really significant. 1000 click change if you attract 100,000 is less so. What this formula is trying to do is make both of these visible.
Try it out at a few different scales in Excel, you'll get a good view of how it operates.
Hope that helps.
another way to look at it is this:
suppose your page and my page are made at same day, and ur page gets total views about ten million, and mine about 1 million till some point. then suppose the slope at some point is a million for me, and 0.5 million for you. if u just use slope, then i win, but ur page already had more views per day at that point, urs were having 5 million, and mine 1 million, so that a million on mine still makes it 2 million, and urs is 5.5 million for that day. so may be this scaling concept is to try to adjust the results to show that ur page is also good as a trend setter, and its slope is less but it already was more popular, but the scaling is only a log factor, so doesnt seem too problematic to me.

How to rank stories based on "controversy"?

I'd like to rank my stories based on "controversy" quotient. For example, reddit.com currently has "controversial" section: http://www.reddit.com/controversial/
When a story has a lot of up and a lot of down votes, it's controversial even though the total score is 0 (for example). How should I calculate this quotient score so that when there's a lot of people voting up and down, I can capture this somehow.
Thanks!!!
Nick
I would recommend using the standard deviation of the votes.
A controversial vote that's 100% polarised would have equal numbers of -1 and +1 votes, so the mean would be 0 and the stddev would be around 1.0
Conversely a completely consistent set of votes (with no votes in the opposite direction) would have a mean of 1 or -1 and a stddev of 0.0.
Votes that aren't either completely consistent or completely polarised will produce a standard deviation figure between 0 and ~1.0 where that value will indicate the degree of controversy in the vote.
The easiest method is to count the number of upvote/downvote pairings for a given comment within the timeframe (e.g. 1 week, 48 hours etc), and have comments with the most parings appear first. Anything more complex requires trial-and-error or experimentation on the best algorithm - as always, it varies on the content of the site and how you want it weighted.
Overall, it's not much different than a hotness algorithm, which works by detecting the most upvotes or views within a timeframe.
What about simply getting the smaller of the two values (up or down) of a point in time? If it goes up a lot and goes down a little, or the other way around it, is not controversial.
If for example the items has 10 ups and 5 downs, the "controversiality level" is 5, since there is 5 people disagreeing about liking it or not. On the other hand if it has either 10 ups or 10 downs, the "controversiality level" is 0, since no one is disagreeing.
So in the end the smaller of both items in this case defines the "hotness" or the "controversiality". Does this make sense?
// figure out if up or down is winning - doesn't matter which
if (up_votes > down_votes)
{
win_votes = up_votes;
lose_votes = down_votes;
}
else
{
win_votes = down_votes;
lose_votes = up_votes;
}
// losewin_ratio is always <= 1, near 0 if win_votes >> lose_votes
losewin_ratio = lose_votes / win_votes;
total_votes = up_votes + down_votes;
controversy_score = total_votes * losewin_ratio; // large means controversial
This formula will produce high scores for stories that have a lot of votes and a near 50/50 voting split, and low scores for stories that have either few votes or many votes for one choice.

Resources