Ranking algorithm with multiples factors - algorithm

So I'd like to rank items depending on multiples factors, however some are more important than others.
Concretely, I've a list of products which all have the following properties:
A price
A weight (in kilogrammes)
A time to build (in minutes)
A size (in centimetres)
Each property has a different scale, and I know the min & max range of them.
For example the price are between 10 and 200, while the weight are between 1.2 and 3.4, etc.
I'd like to apply a priority to the size, then to the time to build, weight and finally the price.
However, I'd like to ensure that no matter the time to build, the weight or the price values are, the size should be the first things that should matters.
For example:
[{
price: 320,
size: 10,
weight: 0.4
time: 4
},
{
price: 230,
size: 5,
weight: 1.2
time: 23
},
{
price: 230,
size: 10,
weight: 1.2
time: 23
}]
should results in:
[{
price: 230,
size: 5, // the lowest the better
weight: 1.2
time: 23
},
{
price: 320, // the higher the better
size: 10,
weight: 0.4
time: 4
},
{
price: 230,
size: 10,
weight: 1.2
time: 23
}]
I'm not very good at math and I don't really know where to start.
I'm thinking of something like scale each values on the same range (for example from 0 to 100) and them apply a factor the resulting range value and them add them all before to sort or something like that.
Any ideas.

It looks like you want to sort by size in increasing order, then if two objects have the same size then by time in increasing order, then by weight in increasing order, then by price in decreasing order (according to the comment in your example).
If you are using a language like Python, just put the four values into a tuple, with the items to be in decreasing order with their values replaced with their negatives. For each item your key is
(size, time, weight, -price)
Then Python itself will sort those tuples appropriately--it is built into the language. This is the easiest thing to do.
If you are using a language without sorted tuples, or for some reason you really want the key to be a floating point value, you can do this. For each factor, look at the known minimum and maximum. Use those to scale that factor to a number between 1 and 10, perhaps including 1 but not including 10. Make sure 1 goes with the value to be sorted first. This can be done with
scale1to10 = (value - min) / (max + 1 - min) * 9 + 1
for increasing factors, and with
scale1to10 = (max - value) / (max + 1 - min) * 9 + 1
for decreasing factors. Then combine all the factors into one "4-digit" number, as in
scale = scalesize * 1000 + scaletime * 100 + scaleweight * 10 + scaleprice
Note that the most important factor is multiplied by the highest power of 10, while the lowest has no multiplier.

Your sorting function should have the ability to pass a comparison function. Pass a comparison function that tests each field in order, only moving to the next if everything so far is equal. In most languages this is not too hard to do. My guess (based on your posting history) is that you're using PHP so http://www.the-art-of-web.com/php/sortarray/ is likely to be the best guide to how to do that.

Related

d3.ticks() in d3 v4 often gives a top tick value lower than the highest value in the range

In D3 version 4's D3-array module, d3.ticks() takes an array and outputs an array of nicely-rounded values suitable for chart axis labels:
d3.ticks(start, stop, count)
Returns an array of approximately count + 1 uniformly-spaced, nicely-rounded values between start and stop (inclusive). Each value is a power of ten multiplied by 1, 2 or 5. See also tickStep and linear.ticks...
Ticks are inclusive in the sense that they may include the specified start and stop values if (and only if) they are exact, nicely-rounded values consistent with the inferred step. More formally, each returned tick t satisfies start ≤ t and t ≤ stop.
However, I don't get the results I expect. For example, for this:
d3.ticks( 0, 63500, 7 );
I'd expect output like:
[ 0, 10000, 20000, 30000, 40000, 50000, 60000, 70000 ]
Instead, what I get is:
[ 0, 10000, 20000, 30000, 40000, 50000, 60000 ]
... where my highest value (63500) is greater than my chart's highest tick, meaning I'd expect a value to extend off the chart.
Requesting a different number of ticks doesn't solve it:
( 0, 63500, 8 ) gives the same thing
( 0, 63500, 9 ) also gives the same
( 0, 63500, 10 ) gives [0, 5000, 10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 55000, 60000] which is no better
From random trial and error, the only input I could find that actually exceeded my maximum value was 24, which gave [0, 2000, 4000, ... 62000, 64000]
This is for a function that can be passed any data set, so it can't require a hand-picked number of ticks for each data set
Why is this happening (have I misunderstood something, or doesn't this violate "t ≤ stop") from the docs?), and how do I ensure that my ticks do cover my range?
I've seen the question d3.js scale doesn't place tick at the top which describes a similar problem in D3 version 3 with an illustration:
...but the functions are different in version 4 and the documentation links in the answer don't apply to D3 version 4.
I've looked at D3-Scale - the link to linear.ticks() appears to be broken - but I'm not really understanding why what I'm trying to do would need a whole new D3 submodule and scaling algorithim.
Try this
getRoundedYMaxValue(number) {
var numLen = number.toString().length;
var dividerValue = "1"
for (var i = 0; i < numLen - 1; i++) {
dividerValue += 0
}
return Math.ceil(number / dividerValue) * dividerValue
}

How to create a scoring system using two variables

I have an application (Node/Angular) that I'm creating where I'm trying to rank users based on overall performance across two metrics. There are two metrics used to track the users we are using are the following:
Units Produced (ranges between 0 - 6000)
Rate of production = [ Units Produced ] / [ Labor Hours ] (ranges between 0 - 100)
However, ranking users explicitly by either of these variables doesn't make sense, because it creates some strange incentives/behaviors.
For instance, it is possible to have a really high Rate of Production, but a super low number of total number of units produced by working really hard over a short period of time. Alternatively, you can have a very high number of Units Produced, but it may be due to the fact that they worked overtime, and thus were able to produce more units than anyone else just due to the fact that they had longer to work, and they could have a low Rate of Production.
Does anyone have experience designing these types of scoring systems? How have you handled it?
First, I would recommend to bring them on the same scale. E.g. divide Units produced by 60.
Then, if you are fine with equal weights, there are three common simple choices:
Add the scores
Multiply the scores (equal to adding logs of each)
Take the minimum of the two scores
Which of the ones is best, depends on to what extent you want it to be a measure of combined good results. In your case, I would recommend you to multiply and put a scale on the resulting product.
If you want to go a little more complex and weigh or play around with how much to reward separate vs joint scores, you can use the following formula:
V = alpha * log_b[Units Produced / 60] + (1-alpha) * log_b[Rate of Production],
where alpha determines the weighting of one vs the other and the base of the logarithmic function determines to what extent a joint success is rewarded.
I did something very similar I found it valuable to break them into leagues or tiers, for example using Units Produced as a base.
Novice = 100 Units Produced
Beginner = 500 Units Produced
Advanced = 2000 Units Produced
Expert = 4000 Units Produced
Putting this into a useable object
var levels = [
{id: 1, name: "Novice", minUnits: 100, maxUnits: 599 },
{id: 2, name: "Beginner", minUnits: 500, maxUnits: 1999 },
{id: 3, name: "Advanced", minUnits: 2000, maxUnits: 3999 },
{id: 4, name: "Expert", minUnits: 4000, maxUnits: 6000 }
]
You can then use your Rate of production to multiply by a weighted value inside the levels, you can determine what this is. You can play with the values to make it as hard or as easy as you want.
You can do a combination with
SCORE = 200/( K_1/x_1 + K_2/x_2 )
// x_1 : Score 1
// x_2 : Score 2
// K_1 : Maximum of Score 1
// K_2 : Maximum of Score 2
Of course be carefull when dividing by zero. If either x_1 or x_2 are zero then SCORE=0. If x_1=K_1 and x_2=K_2 then SCORE=100 (maximum)
Otherwise the score is somewhere in between. If x_1/K_1 = x_2/K_2 = z then SCORE = 100*z
This weighs the lower score more such that you get rewarded when raising one of the two scores (unlike a minimum of the two scenarios) but not as much as raising both.

Finding the interval with the highest summed count?

Given a set of entries, each containing a time index and a int count value,
ie
class Entry
{
time:int
count:int
}
write a function that will give the time interval with the highest count together,
ie,
if we had entries
100, 2
100, 1
110, 10
200, 4
1000, 3
1200, 8
and we ran something like
int highestInterval(int interval_range)
highestInterval( 50 )
it would return 100, because in 100-150, you have counts 2, 1, and 10.
I managed to get a O(n^2) solution for it, but I think theres a better solution. I think it might have to do with some preprocessing of the interval buckets, but I can't figure out the solution.
It seems that you made it already using two for loops so it is a mere question of improvement.
Here goes one possible solution:
CODE:
raw_data=[100,2;
100,1;
110,10;
200,4;
1000,3;
1200,8];
[max_val,indx]=max(cell2mat(arrayfun(#(A) sum(raw_data(abs(raw_data(A,1)-raw_data(:,1))<50,2)),1:size(raw_data,1),'UniformOutput',false)));
raw_data(indx,1)
OUTPUT:
ans =
100

What is a better way to sort by a 5 star rating?

I'm trying to sort a bunch of products by customer ratings using a 5 star system. The site I'm setting this up for does not have a lot of ratings and continue to add new products so it will usually have a few products with a low number of ratings.
I tried using average star rating but that algorithm fails when there is a small number of ratings.
Example a product that has 3x 5 star ratings would show up better than a product that has 100x 5 star ratings and 2x 2 star ratings.
Shouldn't the second product show up higher because it is statistically more trustworthy because of the larger number of ratings?
Prior to 2015, the Internet Movie Database (IMDb) publicly listed the formula used to rank their Top 250 movies list. To quote:
The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
R = average for the movie (mean)
v = number of votes for the movie
m = minimum votes required to be listed in the Top 250 (currently 25000)
C = the mean vote across the whole report (currently 7.0)
For the Top 250, only votes from regular voters are considered.
It's not so hard to understand. The formula is:
rating = (v / (v + m)) * R +
(m / (v + m)) * C;
Which can be mathematically simplified to:
rating = (R * v + C * m) / (v + m);
The variables are:
R – The item's own rating. R is the average of the item's votes. (For example, if an item has no votes, its R is 0. If someone gives it 5 stars, R becomes 5. If someone else gives it 1 star, R becomes 3, the average of [1, 5]. And so on.)
C – The average item's rating. Find the R of every single item in the database, including the current one, and take the average of them; that is C. (Suppose there are 4 items in the database, and their ratings are [2, 3, 5, 5]. C is 3.75, the average of those numbers.)
v – The number of votes for an item. (To given another example, if 5 people have cast votes on an item, v is 5.)
m – The tuneable parameter. The amount of "smoothing" applied to the rating is based on the number of votes (v) in relation to m. Adjust m until the results satisfy you. And don't misinterpret IMDb's description of m as "minimum votes required to be listed" – this system is perfectly capable of ranking items with less votes than m.
All the formula does is: add m imaginary votes, each with a value of C, before calculating the average. In the beginning, when there isn't enough data (i.e. the number of votes is dramatically less than m), this causes the blanks to be filled in with average data. However, as votes accumulates, eventually the imaginary votes will be drowned out by real ones.
In this system, votes don't cause the rating to fluctuate wildly. Instead, they merely perturb it a bit in some direction.
When there are zero votes, only imaginary votes exist, and all of them are C. Thus, each item begins with a rating of C.
See also:
A demo. Click "Solve".
Another explanation of IMDb's system.
An explanation of a similar Bayesian star-rating system.
Evan Miller shows a Bayesian approach to ranking 5-star ratings:
where
nk is the number of k-star ratings,
sk is the "worth" (in points) of k stars,
N is the total number of votes
K is the maximum number of stars (e.g. K=5, in a 5-star rating system)
z_alpha/2 is the 1 - alpha/2 quantile of a normal distribution. If you want 95% confidence (based on the Bayesian posterior distribution) that the actual sort criterion is at least as big as the computed sort criterion, choose z_alpha/2 = 1.65.
In Python, the sorting criterion can be calculated with
def starsort(ns):
"""
http://www.evanmiller.org/ranking-items-with-star-ratings.html
"""
N = sum(ns)
K = len(ns)
s = list(range(K,0,-1))
s2 = [sk**2 for sk in s]
z = 1.65
def f(s, ns):
N = sum(ns)
K = len(ns)
return sum(sk*(nk+1) for sk, nk in zip(s,ns)) / (N+K)
fsns = f(s, ns)
return fsns - z*math.sqrt((f(s2, ns)- fsns**2)/(N+K+1))
For example, if an item has 60 five-stars, 80 four-stars, 75 three-stars, 20 two-stars and 25 one-stars, then its overall star rating would be about 3.4:
x = (60, 80, 75, 20, 25)
starsort(x)
# 3.3686975120774694
and you can sort a list of 5-star ratings with
sorted([(60, 80, 75, 20, 25), (10,0,0,0,0), (5,0,0,0,0)], key=starsort, reverse=True)
# [(10, 0, 0, 0, 0), (60, 80, 75, 20, 25), (5, 0, 0, 0, 0)]
This shows the effect that more ratings can have upon the overall star value.
You'll find that this formula tends to give an overall rating which is a bit
lower than the overall rating reported by sites such as Amazon, Ebay or Wal-mart
particularly when there are few votes (say, less than 300). This reflects the
higher uncertainy that comes with fewer votes. As the number of votes increases
(into the thousands) all overall these rating formulas should tend to the
(weighted) average rating.
Since the formula only depends on the frequency distribution of 5-star ratings
for the item itself, it is easy to combine reviews from multiple sources (or,
update the overall rating in light of new votes) by simply adding the frequency
distributions together.
Unlike the IMDb formula, this formula does not depend on the average score
across all items, nor an artificial minimum number of votes cutoff value.
Moreover, this formula makes use of the full frequency distribution -- not just
the average number of stars and the number of votes. And it makes sense that it
should since an item with ten 5-stars and ten 1-stars should be treated as
having more uncertainty than (and therefore not rated as highly as) an item with
twenty 3-star ratings:
In [78]: starsort((10,0,0,0,10))
Out[78]: 2.386028063783418
In [79]: starsort((0,0,20,0,0))
Out[79]: 2.795342687927806
The IMDb formula does not take this into account.
See this page for a good analysis of star-based rating systems, and this one for a good analysis of upvote-/downvote- based systems.
For up and down voting you want to estimate the probability that, given the ratings you have, the "real" score (if you had infinite ratings) is greater than some quantity (like, say, the similar number for some other item you're sorting against).
See the second article for the answer, but the conclusion is you want to use the Wilson confidence. The article gives the equation and sample Ruby code (easily translated to another language).
Well, depending on how complex you want to make it, you could have ratings additionally be weighted based on how many ratings the person has made, and what those ratings are. If the person has only made one rating, it could be a shill rating, and might count for less. Or if the person has rated many things in category a, but few in category b, and has an average rating of 1.3 out of 5 stars, it sounds like category a may be artificially weighed down by the low average score of this user, and should be adjusted.
But enough of making it complex. Let’s make it simple.
Assuming we’re working with just two values, ReviewCount and AverageRating, for a particular item, it would make sense to me to look ReviewCount as essentially being the “reliability” value. But we don’t just want to bring scores down for low ReviewCount items: a single one-star rating is probably as unreliable as a single 5 star rating. So what we want to do is probably average towards the middle: 3.
So, basically, I’m thinking of an equation something like X * AverageRating + Y * 3 = the-rating-we-want. In order to make this value come out right we need X+Y to equal 1. Also we need X to increase in value as ReviewCount increases...with a review count of 0, x should be 0 (giving us an equation of “3”), and with an infinite review count X should be 1 (which makes the equation = AverageRating).
So what are X and Y equations? For the X equation want the dependent variable to asymptotically approach 1 as the independent variable approaches infinity. A good set of equations is something like:
Y = 1/(factor^RatingCount)
and (utilizing the fact that X must be equal to 1-Y)
X = 1 – (1/(factor^RatingCount)
Then we can adjust "factor" to fit the range that we're looking for.
I used this simple C# program to try a few factors:
// We can adjust this factor to adjust our curve.
double factor = 1.5;
// Here's some sample data
double RatingAverage1 = 5;
double RatingCount1 = 1;
double RatingAverage2 = 4.5;
double RatingCount2 = 5;
double RatingAverage3 = 3.5;
double RatingCount3 = 50000; // 50000 is not infinite, but it's probably plenty to closely simulate it.
// Do the calculations
double modfactor = Math.Pow(factor, RatingCount1);
double modRating1 = (3 / modfactor)
+ (RatingAverage1 * (1 - 1 / modfactor));
double modfactor2 = Math.Pow(factor, RatingCount2);
double modRating2 = (3 / modfactor2)
+ (RatingAverage2 * (1 - 1 / modfactor2));
double modfactor3 = Math.Pow(factor, RatingCount3);
double modRating3 = (3 / modfactor3)
+ (RatingAverage3 * (1 - 1 / modfactor3));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage1, RatingCount1, modRating1));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage2, RatingCount2, modRating2));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage3, RatingCount3, modRating3));
// Hold up for the user to read the data.
Console.ReadLine();
So you don’t bother copying it in, it gives this output:
RatingAverage: 5, RatingCount: 1, Adjusted Rating: 3.67
RatingAverage: 4.5, RatingCount: 5, Adjusted Rating: 4.30
RatingAverage: 3.5, RatingCount: 50000, Adjusted Rating: 3.50
Something like that? You could obviously adjust the "factor" value as needed to get the kind of weighting you want.
You could sort by median instead of arithmetic mean. In this case both examples have a median of 5, so both would have the same weight in a sorting algorithm.
You could use a mode to the same effect, but median is probably a better idea.
If you want to assign additional weight to the product with 100 5-star ratings, you'll probably want to go with some kind of weighted mode, assigning more weight to ratings with the same median, but with more overall votes.
If you just need a fast and cheap solution that will mostly work without using a lot of computation here's one option (assuming a 1-5 rating scale)
SELECT Products.id, Products.title, avg(Ratings.score), etc
FROM
Products INNER JOIN Ratings ON Products.id=Ratings.product_id
GROUP BY
Products.id, Products.title
ORDER BY (SUM(Ratings.score)+25.0)/(COUNT(Ratings.id)+20.0) DESC, COUNT(Ratings.id) DESC
By adding in 25 and dividing by the total ratings + 20 you're basically adding 10 worst scores and 10 best scores to the total ratings and then sorting accordingly.
This does have known issues. For example, it unfairly rewards low-scoring products with few ratings (as this graph demonstrates, products with an average score of 1 and just one rating score a 1.2 while products with an average score of 1 and 1k+ ratings score closer to 1.05). You could also argue it unfairly punishes high-quality products with few ratings.
This chart shows what happens for all 5 ratings over 1-1000 ratings:
http://www.wolframalpha.com/input/?i=Plot3D%5B%2825%2Bxy%29/%2820%2Bx%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
You can see the dip upwards at the very bottom ratings, but overall it's a fair ranking, I think. You can also look at it this way:
http://www.wolframalpha.com/input/?i=Plot3D%5B6-%28%2825%2Bxy%29/%2820%2Bx%29%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
If you drop a marble on most places in this graph, it will automatically roll towards products with both higher scores and higher ratings.
Obviously, the low number of ratings puts this problem at a statistical handicap. Never the less...
A key element to improving the quality of an aggregate rating is to "rate the rater", i.e. to keep tabs of the ratings each particular "rater" has supplied (relative to others). This allows weighing their votes during the aggregation process.
Another solution, more of a cope out, is to supply the end-users with a count (or a range indication thereof) of votes for the underlying item.
One option is something like Microsoft's TrueSkill system, where the score is given by mean - 3*stddev, where the constants can be tweaked.
After look for a while, I choose the Bayesian system.
If someone is using Ruby, here a gem for it:
https://github.com/wbotelhos/rating
I'd highly recommend the book Programming Collective Intelligence by Toby Segaran (OReilly) ISBN 978-0-596-52932-1 which discusses how to extract meaningful data from crowd behaviour. The examples are in Python, but its easy enough to convert.

What is the best way to compute trending topics or tags?

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.
I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) no one mentions should be the hottest ones.
Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment.
Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem".
My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones.
I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it?
I hope you can help me (coding examples would be great).
This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average.
In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.
z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]
When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.
Please see Wikipedia for more information, about z-scores.
Code
from math import sqrt
def zscore(obs, pop):
# Size of population.
number = float(len(pop))
# Average population value.
avg = sum(pop) / number
# Standard deviation of population.
std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
# Zscore Calculation.
return (obs - avg) / std
Sample Output
>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506
Notes
You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.
You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.
If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.
from math import sqrt
class zscore:
def __init__(self, pop = []):
self.number = float(len(pop))
self.total = sum(pop)
self.sqrTotal = sum(x ** 2 for x in pop)
def update(self, value):
self.number += 1.0
self.total += value
self.sqrTotal += value ** 2
def avg(self):
return self.total / self.number
def std(self):
return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
def score(self, obs):
return (obs - self.avg()) / self.std()
Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day.
New Addition
Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:
Code
class fazscore:
def __init__(self, decay, pop = []):
self.sqrAvg = self.avg = 0
# The rate at which the historic data's effect will diminish.
self.decay = decay
for x in pop: self.update(x)
def update(self, value):
# Set initial averages to the first value in the sequence.
if self.avg == 0 and self.sqrAvg == 0:
self.avg = float(value)
self.sqrAvg = float((value ** 2))
# Calculate the average of the rest of the values using a
# floating average.
else:
self.avg = self.avg * self.decay + value * (1 - self.decay)
self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
return self
def std(self):
# Somewhat ad-hoc standard deviation calculation.
return sqrt(self.sqrAvg - self.avg ** 2)
def score(self, obs):
if self.std() == 0: return (obs - self.avg) * float("infinity")
else: return (obs - self.avg) / self.std()
Sample IO
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
>>> fazscore(0.8, [40] * 200).score(1)
-inf
Update
As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line,
if self.std() == 0: return 0
to:
if self.std() == 0: return (obs - self.avg) * float("infinity")
This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to:
if self.std() == 0: return obs - self.avg
You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.
This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.
Normalize
One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.
Derive
Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.
This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.
Regarding the article
The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.
As Nixuz points out this is also referred to as a Z or Standard Score.
Chad Birch and Adam Davis are correct in that you will have to look backward to establish a baseline. Your question, as phrased, suggests that you only want to view data from the past 24 hours, and that won't quite fly.
One way to give your data some memory without having to query for a large body of historical data is to use an exponential moving average. The advantage of this is that you can update this once per period and then flush all old data, so you only need to remember a single value. So if your period is a day, you have to maintain a "daily average" attribute for each topic, which you can do by:
a_n = a_(n-1)*b + c_n*(1-b)
Where a_n is the moving average as of day n, b is some constant between 0 and 1 (the closer to 1, the longer the memory) and c_n is the number of hits on day n. The beauty is if you perform this update at the end of day n, you can flush c_n and a_(n-1).
The one caveat is that it will be initially sensitive to whatever you pick for your initial value of a.
EDIT
If it helps to visualize this approach, take n = 5, a_0 = 1, and b = .9.
Let's say the new values are 5,0,0,1,4:
a_0 = 1
c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854
Doesn't look very much like an average does it? Note how the value stayed close to 1, even though our next input was 5. What's going on? If you expand out the math, what you get that:
a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0
What do I mean by leftover weight? Well, in any average, all weights must add to 1. If n were infinity and the ... could go on forever, then all weights would sum to 1. But if n is relatively small, you get a good amount of weight left on the original input.
If you study the above formula, you should realize a few things about this usage:
All data contributes something to the average forever. Practically speaking, there is a point where the contribution is really, really small.
Recent values contribute more than older values.
The higher b is, the less important new values are and the longer old values matter. However, the higher b is, the more data you need to water down the initial value of a.
I think the first two characteristics are exactly what you are looking for. To give you an idea of simple this can be to implement, here is a python implementation (minus all the database interaction):
>>> class EMA(object):
... def __init__(self, base, decay):
... self.val = base
... self.decay = decay
... print self.val
... def update(self, value):
... self.val = self.val*self.decay + (1-self.decay)*value
... print self.val
...
>>> a = EMA(1, .9)
1
>>> a.update(10)
1.9
>>> a.update(10)
2.71
>>> a.update(10)
3.439
>>> a.update(10)
4.0951
>>> a.update(10)
4.68559
>>> a.update(10)
5.217031
>>> a.update(10)
5.6953279
>>> a.update(10)
6.12579511
>>> a.update(10)
6.513215599
>>> a.update(10)
6.8618940391
>>> a.update(10)
7.17570463519
Typically "buzz" is figured out using some form of exponential/log decay mechanism. For an overview of how Hacker News, Reddit, and others handle this in a simple way, see this post.
This doesn't fully address the things that are always popular. What you're looking for seems to be something like Google's "Hot Trends" feature. For that, you could divide the current value by a historical value and then subtract out ones that are below some noise threshold.
I think they key word you need to notice is "abnormally". In order to determine when something is "abnormal", you have to know what is normal. That is, you're going to need historical data, which you can average to find out the normal rate of a particular query. You may want to exclude abnormal days from the averaging calculation, but again that'll require having enough data already, so that you know which days to exclude.
From there, you'll have to set a threshold (which would require experimentation, I'm sure), and if something goes outside the threshold, say 50% more searches than normal, you can consider it a "trend". Or, if you want to be able to find the "Top X Trendiest" like you mentioned, you just need to order things by how far (percentage-wise) they are away from their normal rate.
For example, let's say that your historical data has told you that Britney Spears usually gets 100,000 searches, and Paris Hilton usually gets 50,000. If you have a day where they both get 10,000 more searches than normal, you should be considering Paris "hotter" than Britney, because her searches increased 20% more than normal, while Britney's were only 10%.
God, I can't believe I just wrote a paragraph comparing "hotness" of Britney Spears and Paris Hilton. What have you done to me?
I was wondering if it is at all possible to use regular physics acceleration formula in such a case?
v2-v1/t or dv/dt
We can consider v1 to be initial likes/votes/count-of-comments per hour and v2 to be current "velocity" per hour in last 24 hours?
This is more like a question than an answer, but seems it may just work. Any content with highest acceleration will be the trending topic...
I am sure this may not solve Britney Spears problem :-)
probably a simple gradient of topic frequency would work -- large positive gradient = growing quickly in popularity.
the easiest way would be to bin the number of searched each day, so you have something like
searches = [ 10, 7, 14, 8, 9, 12, 55, 104, 100 ]
and then find out how much it changed from day to day:
hot_factor = [ b-a for a, b in zip(searches[:-1], searches[1:]) ]
# hot_factor is [ -3, 7, -6, 1, 3, 43, 49, -4 ]
and just apply some sort of threshold so that days where the increase was > 50 are considered 'hot'. you could make this far more complicated if you'd like, too. rather than absolute difference you can take the relative difference so that going from 100 to 150 is considered hot, but 1000 to 1050 isn't. or a more complicated gradient that takes into account trends over more than just one day to the next.
I had worked on a project, where my aim was finding Trending Topics from Live Twitter Stream and also doing sentimental analysis on the trending topics (finding if Trending Topic positively/negatively talked about). I've used Storm for handling twitter stream.
I've published my report as a blog: http://sayrohan.blogspot.com/2013/06/finding-trending-topics-and-trending.html
I've used Total Count and Z-Score for the ranking.
The approach that I've used is bit generic, and in the discussion section, I've mentioned that how we can extend the system for non-Twitter Application.
Hope the information helps.
You could use log-likelihood-ratios to compare the current date with the last month or year. This is statistically sound (given that your events are not normally distributed, which is to be assumed from your question).
Just sort all your terms by logLR and pick the top ten.
public static void main(String... args) {
TermBag today = ...
TermBag lastYear = ...
for (String each: today.allTerms()) {
System.out.println(logLikelihoodRatio(today, lastYear, each) + "\t" + each);
}
}
public static double logLikelihoodRatio(TermBag t1, TermBag t2, String term) {
double k1 = t1.occurrences(term);
double k2 = t2.occurrences(term);
double n1 = t1.size();
double n2 = t2.size();
double p1 = k1 / n1;
double p2 = k2 / n2;
double p = (k1 + k2) / (n1 + n2);
double logLR = 2*(logL(p1,k1,n1) + logL(p2,k2,n2) - logL(p,k1,n1) - logL(p,k2,n2));
if (p1 < p2) logLR *= -1;
return logLR;
}
private static double logL(double p, double k, double n) {
return (k == 0 ? 0 : k * Math.log(p)) + ((n - k) == 0 ? 0 : (n - k) * Math.log(1 - p));
}
PS, a TermBag is an unordered collection of words. For each document you create one bag of terms. Just count the occurrences of words. Then the method occurrences returns the number of occurrences of a given word, and the method size returns the total number of words. It is best to normalize the words somehow, typically toLowerCase is good enough. Of course, in the above examples you would create one document with all queries of today, and one with all queries of the last year.
If you simply look at tweets, or status messages to get your topics, you're going to encounter a lot of noise. Even if you remove all stop words. One way to get a better subset of topic candidates is to focus only on tweets/messages that share a URL, and get the keywords from the title of those web pages. And make sure you apply POS tagging to get nouns + noun phrases as well.
Titles of web pages usually are more descriptive and contain words that describe what the page is about. In addition, sharing a web page usually is correlated with sharing news that is breaking (ie if a celebrity like Michael Jackson died, you're going to get a lot of people sharing an article about his death).
I've ran experiments where I only take popular keywords from titles, AND then get the total counts of those keywords across all status messages, and they definitely remove a lot of noise. If you do it this way, you don't need a complex algorith, just do a simple ordering of the keyword frequencies, and you're halfway there.
The idea is to keep track of such things and notice when they jump significantly as compared to their own baseline.
So, for queries that have more than a certain threshhold, track each one and when it changes to some value (say almost double) of its historical value, then it is a new hot trend.

Resources