Algorithm for tracking values through time - algorithm

There is likely a known algorithm for doing this, but I wasn't able to find it using my Google skills, so I will try to describe what I have to do and what I did so far.
I have a source of characteristic values of a system which I would like to plot as a trend. The values are being returned from an algorithm in real time, and each value has a set of properties (magnitude, phase, quality).
However, these values can appear and disappear in time, and I can also get some intermittent values which I will disregard if they don't repeat during a longer period (several samples).
For example, I might be getting these values:
Time (Mag, Phase, Quality)
t = 1 (10.10, 0.90, 0.90); (17.00, 0.02, 0,12)
t = 2 (10.15, 0.91, 0.89); (17.10, 0.12, 0,12)
t = 3 (17.10, 0.12, 0,12)
t = 4 (10.25, 0.91, 0.89); (17.12, 0.12, 0,12)
t = 5 ( 6.15, 0.41, 0.39); (10.35, 0.91, 0.89); (17.12, 0.12, 0,12)
t = 6 (10.20, 0.90, 0.85); (17.02, 0.13, 0,11)
t = 7 ( 9.20, 0.90, 0.85); (11.20, 0.90, 0.85); (17.02, 0.13, 0,11)
t = 8 ( 9.80, 0.90, 0.85); (11.80, 0.90, 0.85); (17.02, 0.13, 0,11)
I'd like to track these sets of values through time according to the similarity with previous values. I.e. in the example above, I have two main trends (Mag 10 and Mag 17), with several specific situations:
moments where I will shortly lose one of the values (Mag 10 is lost in t = 3),
moments where I shortly get a new temporary/invalid reading (Mag 6 in t = 5) for a single sample,
moments where it's not completely clear which set corresponds to the previous sample (Mag 9.2 and Mag 11.2 could both be a continuation of Mag 10.2 from the previous sample, and in t = 8 it becomes apparent that there are now two different sets (Mag 9.8 and Mag 11.8).
If I just grouped the values as they arrive from the system, I would not get their correct trends, i.e. without tracking, the magnitudes would appear like this:
However, properly matching these values against old magnitude should result in this trend:
I've written an algorithm which tracks the values through time by effectively trying all permutations of sets against the previous "active" sets. It calculates the differences between all new values and the previous known values, which is basically a N^2 algorithm, and then checks all permutations to find the smallest total distance (something like N! complexity):
for each X in new_sets
for each Y in existing_sets
distance(X, Y) = calculate_distance(X, Y);
for each P in permutations(new_sets)
total_distance = sum(distance(X, Y)) for all (X, Y) in permutation
permutation P with min total_distance is the best match
As I go through time, I also remove measurements from existing_sets if they are not matched within several samples.
This works reasonable fine, as long as I don't have too many values, but the time complexity becomes problematic after I begin tracking more than 10 items. It also feels like reinventing the wheel.
Is there a known/better (in terms of time complexity) algorithm for doing this?

Without constraints on the behaviour of the sources, there is obviously no solution. If we may say that the magnitudes from different sources are reasonably separated, and changes are reasonably small, the solution is to keep the trends in the sorted order. Then binary search them to find the trend closest to the new reading.

Related

What techniques are effective to find periodicity in arbitrary data points?

By "arbitrary" I mean that I don't have a signal sampled on a grid that is amenable to taking an FFT. I just have points (e.g. in time) where events happened, and I'd like an estimate of the rate, for example:
p = [0, 1.1, 1.9, 3, 3.9, 6.1 ...]
...could be hits from a process with a nominal periodicity (repetition interval) of 1.0, but with noise and some missed detections.
Are there well known methods for processing such data?
A least square algorithm may do the trick, if correctly initialized. A clustering method can be applied to this end.
As an FFT is performed, the signal is depicted as a sum of sine waves. The amplitude of the frequencies may be depicted as resulting from a least square fit on the signal. Hence, if the signal is unevenly sampled, resolving the same least square problem may make sense if the Fourier transform is to be estimated. If applied to a evenly sampled signal, it boils down to the same result.
As your signal is descrete, you may want to fit it as a sum of Dirac combs. It seems more sound to minimize the sum of squared distance to the nearest Dirac of the Dirac comb. This is a non-linear optimization problem where Dirac combs are described by their period and offset. This non-linear least-square problem can be solved by mean of the Levenberg-Marquardt algorithm. Here is an python example making use of the scipy.optimize.leastsq() function. Moreover, the error on the estimated period and offset can be estimated as depicted in How to compute standard deviation errors with scipy.optimize.least_squares . It is also documented in the documentation of curve_fit() and Getting standard errors on fitted parameters using the optimize.leastsq method in python
Nevertheless, half the period, or the thrid of the period, ..., would also fit, and multiples of the period are local minima that are to be avoided by a refining the initialization of the Levenberg-Marquardt algorithm. To this end, the differences between times of events can be clustered, the cluster featuring the smallest value being that of the expected period. As proposed in Clustering values by their proximity in python (machine learning?) , the clustering function sklearn.cluster.MeanShift() is applied.
Notice that the procedure can be extended to multidimentionnal data to look for periodic patterns or mixed periodic patterns featuring different fundamental periods.
import numpy as np
from scipy.optimize import least_squares
from scipy.optimize import leastsq
from sklearn.cluster import MeanShift, estimate_bandwidth
ticks=[0,1.1,1.9,3,3.9,6.1]
import scipy
print scipy.__version__
def crudeEstimate():
# loooking for the period by looking at differences between values :
diffs=np.zeros(((len(ticks))*(len(ticks)-1))/2)
k=0
for i in range(len(ticks)):
for j in range(i):
diffs[k]=ticks[i]-ticks[j]
k=k+1
#see https://stackoverflow.com/questions/18364026/clustering-values-by-their-proximity-in-python-machine-learning
X = np.array(zip(diffs,np.zeros(len(diffs))), dtype=np.float)
bandwidth = estimate_bandwidth(X, quantile=1.0/len(ticks))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
print cluster_centers
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
for k in range(n_clusters_):
my_members = labels == k
print "cluster {0}: {1}".format(k, X[my_members, 0])
estimated_period=np.min(cluster_centers[:,0])
return estimated_period
def disttoDiracComb(x):
residual=np.zeros((len(ticks)))
for i in range(len(ticks)):
mindist=np.inf
for j in range(len(x)/2):
offset=x[2*j+1]
period=x[2*j]
#print period, offset
index=np.floor((ticks[i]-offset)/period)
#print 'index', index
currdist=ticks[i]-(index*period+offset)
if currdist>0.5*period:
currdist=period-currdist
index=index+1
#print 'event at ',ticks[i], 'not far from index ',index, '(', currdist, ')'
#currdist=currdist*currdist
#print currdist
if currdist<mindist:
mindist=currdist
residual[i]=mindist
#residual=residual-period*period
#print x, residual
return residual
estimated_period=crudeEstimate()
print 'crude estimate by clustering :',estimated_period
xp=np.array([estimated_period,0.0])
#res_1 = least_squares(disttoDiracComb, xp,method='lm',xtol=1e-15,verbose=1)
p,pcov,infodict,mesg,ier=leastsq(disttoDiracComb, x0=xp,ftol=1e-18, full_output=True)
#print ' p is ',p, 'covariance is ', pcov
# see https://stackoverflow.com/questions/14581358/getting-standard-errors-on-fitted-parameters-using-the-optimize-leastsq-method-i
s_sq = (disttoDiracComb(p)**2).sum()/(len(ticks)-len(p))
pcov=pcov *s_sq
perr = np.sqrt(np.diag(pcov))
#print 'estimated standard deviation on parameter :' , perr
print 'estimated period is ', p[0],' +/- ', 1.96*perr[0]
print 'estimated offset is ', p[1],' +/- ', 1.96*perr[1]
Applied to your sample, it prints :
crude estimate by clustering : 0.975
estimated period is 1.0042857141346768 +/- 0.04035792507868619
estimated offset is -0.011428571139828817 +/- 0.13385206912205957
It sounds like you need to decide what exactly you want to determine. If you want to know the average interval in a set of timestamps, then that's easy (just take the mean or median).
If you expect that the interval could be changing, then you need to have some idea about how fast it is changing. Then you can find a windowed moving average. You need to have an idea of how fast it is changing so that you can select your window size appropriately - a larger window will give you a smoother result, but a smaller window will be more responsive to a faster-changing rate.
If you have no idea whether the data is following any sort of pattern, then you are probably in the territory of data exploration. In that case, I would start by plotting the intervals, to see if a pattern appears to the eye. This might also benefit from applying a moving average if the data is quite noisy.
Essentially, whether or not there is something in the data and what it means is up to you and your knowledge of the domain. That is, in any set of timestamps there will be an average (and you can also easily calculate the variance to give an indication of variability in the data), but it is up to you whether that average carries any meaning.

How can I acquire a certain "random range" in a higher frequency?

My question is basically, "how can I obtain certain random values within a specific range more than random values outside the range?"
Allow me to demonstrate what I mean:
If I were to, on a good amount of trials, start picking a variety of
random numbers from 1-10, I should be seeing more numbers in the 7-10
range than in the 1-6 range.
I tried a couple of ways, but I am not getting desirable results.
First Function:
function getAverage(i)
math.randomseed(os.time())
local sum = 0;
for j = 1,i do
sum = sum + (1-math.random()^3)*10
end
print(sum/i)
end
getAverage(500)
I was constantly getting numbers only around 7.5, such as 7.48, and 7.52. Although this does indeed get me a number within my range, I don't want such strict consistancy.
Second Function:
function getAverage(i)
math.randomseed(os.time())
local sum = 0;
for j = 1,i do
sum = sum + (math.random() > .3 and math.random(7,10) or math.random(1,6))
end
print(sum/i)
end
getAverage(500)
This function didn't work as I wanted it to either. I primarily getting numbers such as 6.8 and 7.2 but nothing even close to 8.
Third Function:
function getAverage(i)
math.randomseed(os.time())
local sum = 0;
for j = 1,i do
sum = sum + (((math.random(10) * 2)/1.2)^1.05) - math.random(1,3)
end
print(sum/i)
end
getAverage(500)
This function was giving me slightly more favorable results, with the function consistently returning 8, but that is the issue - consistency.
What type of paradigms or practical solutions can I use to generate more random numbers within a specific range over another range?
I have labeled this as Lua, but a solution in any language that is understandable is acceptable.
I don't want such strict consistancy.
What does that mean?
If you average a very large number of values in a given range from any RNG, you should expect that to produce the same number. That means each of the numbers in the range was equally likely to appear.
This function didn't work as I wanted it to either. I primarily getting numbers such as 6.8 and 7.2 but nothing even close to 8.
You have to clarify what "didn't work" means. Why would you expect it to give you 8? You can see it won't just by looking at the formula you used.
For instance, if you'd used math.random(1,10), assuming all numbers in the range have an equal chance of appearing, you should expect the average to be 5.5, dead in the middle of 1 and 10 (because (1+2+3+4+5+6+7+8+9+10)/10 = 5.5).
You used math.random() > .3 and math.random(7,10) or math.random(1,6) which is saying 70% of the time to give 7, 8, 9, or 10 (average = 8.5) and 30% of the time to give you 1, 2, 3, 4, 5, or 6 (average = 3.5). That should give you an overall average of 7 (because 3.5 * .3 + 8.5 * .7 = 7). If you bump up your sample size, that's exactly what you'll see. You're seeing values on either size because you sample size is so small (try bumping it up to 100000).
I've made skewed random values before by simply generating two random numbers in the range, and then picking the largest (or smallest). This skews the probability towards the high (or low) endpoint.
Picking the smallest of two gives you a linear probability distribution.
Picking the smallest of three gives you a parabolic distribution (more selectivity, less probability at "the other end"). For my needs, a linear distribution was fine.
Not exactly what you wanted, but maybe it's good enough.
Have fun!

How to create a scoring system using two variables

I have an application (Node/Angular) that I'm creating where I'm trying to rank users based on overall performance across two metrics. There are two metrics used to track the users we are using are the following:
Units Produced (ranges between 0 - 6000)
Rate of production = [ Units Produced ] / [ Labor Hours ] (ranges between 0 - 100)
However, ranking users explicitly by either of these variables doesn't make sense, because it creates some strange incentives/behaviors.
For instance, it is possible to have a really high Rate of Production, but a super low number of total number of units produced by working really hard over a short period of time. Alternatively, you can have a very high number of Units Produced, but it may be due to the fact that they worked overtime, and thus were able to produce more units than anyone else just due to the fact that they had longer to work, and they could have a low Rate of Production.
Does anyone have experience designing these types of scoring systems? How have you handled it?
First, I would recommend to bring them on the same scale. E.g. divide Units produced by 60.
Then, if you are fine with equal weights, there are three common simple choices:
Add the scores
Multiply the scores (equal to adding logs of each)
Take the minimum of the two scores
Which of the ones is best, depends on to what extent you want it to be a measure of combined good results. In your case, I would recommend you to multiply and put a scale on the resulting product.
If you want to go a little more complex and weigh or play around with how much to reward separate vs joint scores, you can use the following formula:
V = alpha * log_b[Units Produced / 60] + (1-alpha) * log_b[Rate of Production],
where alpha determines the weighting of one vs the other and the base of the logarithmic function determines to what extent a joint success is rewarded.
I did something very similar I found it valuable to break them into leagues or tiers, for example using Units Produced as a base.
Novice = 100 Units Produced
Beginner = 500 Units Produced
Advanced = 2000 Units Produced
Expert = 4000 Units Produced
Putting this into a useable object
var levels = [
{id: 1, name: "Novice", minUnits: 100, maxUnits: 599 },
{id: 2, name: "Beginner", minUnits: 500, maxUnits: 1999 },
{id: 3, name: "Advanced", minUnits: 2000, maxUnits: 3999 },
{id: 4, name: "Expert", minUnits: 4000, maxUnits: 6000 }
]
You can then use your Rate of production to multiply by a weighted value inside the levels, you can determine what this is. You can play with the values to make it as hard or as easy as you want.
You can do a combination with
SCORE = 200/( K_1/x_1 + K_2/x_2 )
// x_1 : Score 1
// x_2 : Score 2
// K_1 : Maximum of Score 1
// K_2 : Maximum of Score 2
Of course be carefull when dividing by zero. If either x_1 or x_2 are zero then SCORE=0. If x_1=K_1 and x_2=K_2 then SCORE=100 (maximum)
Otherwise the score is somewhere in between. If x_1/K_1 = x_2/K_2 = z then SCORE = 100*z
This weighs the lower score more such that you get rewarded when raising one of the two scores (unlike a minimum of the two scenarios) but not as much as raising both.

Algorithm for nice graph labels for time/date axis?

I'm looking for a "nice numbers" algorithm for determining the labels on a date/time value axis. I'm familiar with Paul Heckbert's Nice Numbers algorithm.
I have a plot that displays time/date on the X axis and the user can zoom in and look at a smaller time frame. I'm looking for an algorithm that picks nice dates to display on the ticks.
For example:
Looking at a day or so: 1/1 12:00, 1/1 4:00, 1/1 8:00...
Looking at a week: 1/1, 1/2, 1/3...
Looking at a month: 1/09, 2/09, 3/09...
The nice label ticks don't need to correspond to the first visible point, but close to it.
Is anybody familiar with such an algorithm?
The 'nice numbers' article you linked to mentioned that
the nicest numbers in decimal are 1, 2, 5 and all power-of-10 multiples of these numbers
So I think for doing something similar with date/time you need to start by similarly breaking down the component pieces. So take the nice factors of each type of interval:
If you're showing seconds or minutes use 1, 2, 3, 5, 10, 15, 30
(I skipped 6, 12, 15, 20 because they don't "feel" right).
If you're showing hours use 1, 2, 3, 4, 6, 8, 12
for days use 1, 2, 7
for weeks use 1, 2, 4 (13 and 26 fit the model but seem too odd to me)
for months use 1, 2, 3, 4, 6
for years use 1, 2, 5 and power-of-10 multiples
Now obviously this starts to break down as you get into larger amounts. Certainly you don't want to do show 5 weeks worth of minutes, even in "pretty" intervals of 30 minutes or something. On the other hand, when you only have 48 hours worth, you don't want to show 1 day intervals. The trick as you have already pointed out is finding decent transition points.
Just on a hunch, I would say a reasonable crossover point would be about twice as much as the next interval. That would give you the following (min and max number of intervals shown afterwards)
use seconds if you have less than 2 minutes worth (1-120)
use minutes if you have less than 2 hours worth (2-120)
use hours if you have less than 2 days worth (2-48)
use days if you have less than 2 weeks worth (2-14)
use weeks if you have less than 2 months worth (2-8/9)
use months if you have less than 2 years worth (2-24)
otherwise use years (although you could continue with decades, centuries, etc if your ranges can be that long)
Unfortunately, our inconsistent time intervals mean that you end up with some cases that can have over 1 hundred intervals while others have at most 8 or 9. So you'll want to pick the size of your intervals such than you don't have more than 10-15 intervals at most (or less than 5 for that matter). Also, you could break from a strict definition of 2 times the next biggest interval if you think its easy to keep track of. For instance, you could use hours up to 3 days (72 hours) and weeks up to 4 months. A little trial and error might be necessary.
So to go back over, choose the interval type based on the size of your range, then choose the interval size by picking one of the "nice" numbers that will leave you with between 5 and about 15 tick marks. Or if you know and/or can control the actual number of pixels between tick marks you could put upper and lower bounds on how many pixels are acceptable between ticks (if they are spaced too far apart the graph may be hard to read, but if there are too many ticks the graph will be cluttered and your labels may overlap).
Have a look at
http://tools.netsa.cert.org/netsa-python/doc/index.html
It has a nice.py ( python/netsa/data/nice.py ) which i think is stand-alone, and should work fine.
Still no answer to this question... I'll throw my first idea in then! I assume you have the range of the visible axis.
This is probably how I would do.
Rough pseudo:
// quantify range
rangeLength = endOfVisiblePart - startOfVisiblePart;
// qualify range resolution
if (range < "1.5 day") {
resolution = "day"; // it can be a number, e.g.: ..., 3 for day, 4 for week, ...
} else if (range < "9 days") {
resolution = "week";
} else if (range < "35 days") {
resolution = "month";
} // you can expand this in both ways to get from nanoseconds to geological eras if you wish
After that, it should (depending on what you have easy access to) be quite easy to determine the value to each nice label tick. Depending on the 'resolution', you format it differently. E.g.: MM/DD for "week", MM:SS for "minute", etc., just like you said.
[Edit - I expanded this a little more at http://www.acooke.org/cute/AutoScalin0.html ]
A naive extension of the "nice numbers" algorithm seems to work for base 12 and 60, which gives good intervals for hours and minutes. This is code I just hacked together:
LIM10 = (10, [(1.5, 1), (3, 2), (7, 5)], [1, 2, 5])
LIM12 = (12, [(1.5, 1), (3, 2), (8, 6)], [1, 2, 6])
LIM60 = (60, [(1.5, 1), (20, 15), (40, 30)], [1, 15, 40])
def heckbert_d(lo, hi, ntick=5, limits=None):
'''
Heckbert's "nice numbers" algorithm for graph ranges, from "Graphics Gems".
'''
if limits is None:
limits = LIM10
(base, rfs, fs) = limits
def nicenum(x, round):
step = base ** floor(log(x)/log(base))
f = float(x) / step
nf = base
if round:
for (a, b) in rfs:
if f < a:
nf = b
break
else:
for a in fs:
if f <= a:
nf = a
break
return nf * step
delta = nicenum(hi-lo, False)
return nicenum(delta / (ntick-1), True)
def heckbert(lo, hi, ntick=5, limits=None):
'''
Heckbert's "nice numbers" algorithm for graph ranges, from "Graphics Gems".
'''
def _heckbert():
d = heckbert_d(lo, hi, ntick=ntick, limits=limits)
graphlo = floor(lo / d) * d
graphhi = ceil(hi / d) * d
fmt = '%' + '.%df' % max(-floor(log10(d)), 0)
value = graphlo
while value < graphhi + 0.5*d:
yield fmt % value
value += d
return list(_heckbert())
So, for example, if you want to display seconds from 0 to 60,
>>> heckbert(0, 60, limits=LIM60)
['0', '15', '30', '45', '60']
or hours from 0 to 5:
>>> heckbert(0, 5, limits=LIM12)
['0', '2', '4', '6']
I'd suggest you grab the source code to gnuplot or RRDTool (or even Flot) and examine how they approach this problem. The general case is likely to be N labels applied based on width of your plot, which some kind of 'snapping' to the nearest 'nice' number.
Every time I've written such an algorithm (too many times really), I've used a table of 'preferences'... ie: based on the time range on the plot, decide if I'm using Weeks, Days, Hours, Minutes etc as the main axis point. I usually included some preferred formatting, as I rarely want to see the date for each minute I plot on the graph.
I'd be happy but surprised to find someone using a formula (like Heckbert does) to find 'nice', as the variation in time units between minutes, hours, days, and weeks are not that linear.
In theory you can also change your concept. Where it is not your data at the center of the visualization, but at the center you have your scale.
When you know the start and the end of the dates of your data, you can create a scale with all dates and dispatch you data in this scale. Like a fixed scales.
You can have a scale of type year, month, day, hours, ... and limit the scaling just to these scales, implying you remove the concept of free scaling.
The advantage is to can easily show dates gaps. But if you have a lot of gaps, that can become also useless.

What is a better way to sort by a 5 star rating?

I'm trying to sort a bunch of products by customer ratings using a 5 star system. The site I'm setting this up for does not have a lot of ratings and continue to add new products so it will usually have a few products with a low number of ratings.
I tried using average star rating but that algorithm fails when there is a small number of ratings.
Example a product that has 3x 5 star ratings would show up better than a product that has 100x 5 star ratings and 2x 2 star ratings.
Shouldn't the second product show up higher because it is statistically more trustworthy because of the larger number of ratings?
Prior to 2015, the Internet Movie Database (IMDb) publicly listed the formula used to rank their Top 250 movies list. To quote:
The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
R = average for the movie (mean)
v = number of votes for the movie
m = minimum votes required to be listed in the Top 250 (currently 25000)
C = the mean vote across the whole report (currently 7.0)
For the Top 250, only votes from regular voters are considered.
It's not so hard to understand. The formula is:
rating = (v / (v + m)) * R +
(m / (v + m)) * C;
Which can be mathematically simplified to:
rating = (R * v + C * m) / (v + m);
The variables are:
R – The item's own rating. R is the average of the item's votes. (For example, if an item has no votes, its R is 0. If someone gives it 5 stars, R becomes 5. If someone else gives it 1 star, R becomes 3, the average of [1, 5]. And so on.)
C – The average item's rating. Find the R of every single item in the database, including the current one, and take the average of them; that is C. (Suppose there are 4 items in the database, and their ratings are [2, 3, 5, 5]. C is 3.75, the average of those numbers.)
v – The number of votes for an item. (To given another example, if 5 people have cast votes on an item, v is 5.)
m – The tuneable parameter. The amount of "smoothing" applied to the rating is based on the number of votes (v) in relation to m. Adjust m until the results satisfy you. And don't misinterpret IMDb's description of m as "minimum votes required to be listed" – this system is perfectly capable of ranking items with less votes than m.
All the formula does is: add m imaginary votes, each with a value of C, before calculating the average. In the beginning, when there isn't enough data (i.e. the number of votes is dramatically less than m), this causes the blanks to be filled in with average data. However, as votes accumulates, eventually the imaginary votes will be drowned out by real ones.
In this system, votes don't cause the rating to fluctuate wildly. Instead, they merely perturb it a bit in some direction.
When there are zero votes, only imaginary votes exist, and all of them are C. Thus, each item begins with a rating of C.
See also:
A demo. Click "Solve".
Another explanation of IMDb's system.
An explanation of a similar Bayesian star-rating system.
Evan Miller shows a Bayesian approach to ranking 5-star ratings:
where
nk is the number of k-star ratings,
sk is the "worth" (in points) of k stars,
N is the total number of votes
K is the maximum number of stars (e.g. K=5, in a 5-star rating system)
z_alpha/2 is the 1 - alpha/2 quantile of a normal distribution. If you want 95% confidence (based on the Bayesian posterior distribution) that the actual sort criterion is at least as big as the computed sort criterion, choose z_alpha/2 = 1.65.
In Python, the sorting criterion can be calculated with
def starsort(ns):
"""
http://www.evanmiller.org/ranking-items-with-star-ratings.html
"""
N = sum(ns)
K = len(ns)
s = list(range(K,0,-1))
s2 = [sk**2 for sk in s]
z = 1.65
def f(s, ns):
N = sum(ns)
K = len(ns)
return sum(sk*(nk+1) for sk, nk in zip(s,ns)) / (N+K)
fsns = f(s, ns)
return fsns - z*math.sqrt((f(s2, ns)- fsns**2)/(N+K+1))
For example, if an item has 60 five-stars, 80 four-stars, 75 three-stars, 20 two-stars and 25 one-stars, then its overall star rating would be about 3.4:
x = (60, 80, 75, 20, 25)
starsort(x)
# 3.3686975120774694
and you can sort a list of 5-star ratings with
sorted([(60, 80, 75, 20, 25), (10,0,0,0,0), (5,0,0,0,0)], key=starsort, reverse=True)
# [(10, 0, 0, 0, 0), (60, 80, 75, 20, 25), (5, 0, 0, 0, 0)]
This shows the effect that more ratings can have upon the overall star value.
You'll find that this formula tends to give an overall rating which is a bit
lower than the overall rating reported by sites such as Amazon, Ebay or Wal-mart
particularly when there are few votes (say, less than 300). This reflects the
higher uncertainy that comes with fewer votes. As the number of votes increases
(into the thousands) all overall these rating formulas should tend to the
(weighted) average rating.
Since the formula only depends on the frequency distribution of 5-star ratings
for the item itself, it is easy to combine reviews from multiple sources (or,
update the overall rating in light of new votes) by simply adding the frequency
distributions together.
Unlike the IMDb formula, this formula does not depend on the average score
across all items, nor an artificial minimum number of votes cutoff value.
Moreover, this formula makes use of the full frequency distribution -- not just
the average number of stars and the number of votes. And it makes sense that it
should since an item with ten 5-stars and ten 1-stars should be treated as
having more uncertainty than (and therefore not rated as highly as) an item with
twenty 3-star ratings:
In [78]: starsort((10,0,0,0,10))
Out[78]: 2.386028063783418
In [79]: starsort((0,0,20,0,0))
Out[79]: 2.795342687927806
The IMDb formula does not take this into account.
See this page for a good analysis of star-based rating systems, and this one for a good analysis of upvote-/downvote- based systems.
For up and down voting you want to estimate the probability that, given the ratings you have, the "real" score (if you had infinite ratings) is greater than some quantity (like, say, the similar number for some other item you're sorting against).
See the second article for the answer, but the conclusion is you want to use the Wilson confidence. The article gives the equation and sample Ruby code (easily translated to another language).
Well, depending on how complex you want to make it, you could have ratings additionally be weighted based on how many ratings the person has made, and what those ratings are. If the person has only made one rating, it could be a shill rating, and might count for less. Or if the person has rated many things in category a, but few in category b, and has an average rating of 1.3 out of 5 stars, it sounds like category a may be artificially weighed down by the low average score of this user, and should be adjusted.
But enough of making it complex. Let’s make it simple.
Assuming we’re working with just two values, ReviewCount and AverageRating, for a particular item, it would make sense to me to look ReviewCount as essentially being the “reliability” value. But we don’t just want to bring scores down for low ReviewCount items: a single one-star rating is probably as unreliable as a single 5 star rating. So what we want to do is probably average towards the middle: 3.
So, basically, I’m thinking of an equation something like X * AverageRating + Y * 3 = the-rating-we-want. In order to make this value come out right we need X+Y to equal 1. Also we need X to increase in value as ReviewCount increases...with a review count of 0, x should be 0 (giving us an equation of “3”), and with an infinite review count X should be 1 (which makes the equation = AverageRating).
So what are X and Y equations? For the X equation want the dependent variable to asymptotically approach 1 as the independent variable approaches infinity. A good set of equations is something like:
Y = 1/(factor^RatingCount)
and (utilizing the fact that X must be equal to 1-Y)
X = 1 – (1/(factor^RatingCount)
Then we can adjust "factor" to fit the range that we're looking for.
I used this simple C# program to try a few factors:
// We can adjust this factor to adjust our curve.
double factor = 1.5;
// Here's some sample data
double RatingAverage1 = 5;
double RatingCount1 = 1;
double RatingAverage2 = 4.5;
double RatingCount2 = 5;
double RatingAverage3 = 3.5;
double RatingCount3 = 50000; // 50000 is not infinite, but it's probably plenty to closely simulate it.
// Do the calculations
double modfactor = Math.Pow(factor, RatingCount1);
double modRating1 = (3 / modfactor)
+ (RatingAverage1 * (1 - 1 / modfactor));
double modfactor2 = Math.Pow(factor, RatingCount2);
double modRating2 = (3 / modfactor2)
+ (RatingAverage2 * (1 - 1 / modfactor2));
double modfactor3 = Math.Pow(factor, RatingCount3);
double modRating3 = (3 / modfactor3)
+ (RatingAverage3 * (1 - 1 / modfactor3));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage1, RatingCount1, modRating1));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage2, RatingCount2, modRating2));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage3, RatingCount3, modRating3));
// Hold up for the user to read the data.
Console.ReadLine();
So you don’t bother copying it in, it gives this output:
RatingAverage: 5, RatingCount: 1, Adjusted Rating: 3.67
RatingAverage: 4.5, RatingCount: 5, Adjusted Rating: 4.30
RatingAverage: 3.5, RatingCount: 50000, Adjusted Rating: 3.50
Something like that? You could obviously adjust the "factor" value as needed to get the kind of weighting you want.
You could sort by median instead of arithmetic mean. In this case both examples have a median of 5, so both would have the same weight in a sorting algorithm.
You could use a mode to the same effect, but median is probably a better idea.
If you want to assign additional weight to the product with 100 5-star ratings, you'll probably want to go with some kind of weighted mode, assigning more weight to ratings with the same median, but with more overall votes.
If you just need a fast and cheap solution that will mostly work without using a lot of computation here's one option (assuming a 1-5 rating scale)
SELECT Products.id, Products.title, avg(Ratings.score), etc
FROM
Products INNER JOIN Ratings ON Products.id=Ratings.product_id
GROUP BY
Products.id, Products.title
ORDER BY (SUM(Ratings.score)+25.0)/(COUNT(Ratings.id)+20.0) DESC, COUNT(Ratings.id) DESC
By adding in 25 and dividing by the total ratings + 20 you're basically adding 10 worst scores and 10 best scores to the total ratings and then sorting accordingly.
This does have known issues. For example, it unfairly rewards low-scoring products with few ratings (as this graph demonstrates, products with an average score of 1 and just one rating score a 1.2 while products with an average score of 1 and 1k+ ratings score closer to 1.05). You could also argue it unfairly punishes high-quality products with few ratings.
This chart shows what happens for all 5 ratings over 1-1000 ratings:
http://www.wolframalpha.com/input/?i=Plot3D%5B%2825%2Bxy%29/%2820%2Bx%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
You can see the dip upwards at the very bottom ratings, but overall it's a fair ranking, I think. You can also look at it this way:
http://www.wolframalpha.com/input/?i=Plot3D%5B6-%28%2825%2Bxy%29/%2820%2Bx%29%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
If you drop a marble on most places in this graph, it will automatically roll towards products with both higher scores and higher ratings.
Obviously, the low number of ratings puts this problem at a statistical handicap. Never the less...
A key element to improving the quality of an aggregate rating is to "rate the rater", i.e. to keep tabs of the ratings each particular "rater" has supplied (relative to others). This allows weighing their votes during the aggregation process.
Another solution, more of a cope out, is to supply the end-users with a count (or a range indication thereof) of votes for the underlying item.
One option is something like Microsoft's TrueSkill system, where the score is given by mean - 3*stddev, where the constants can be tweaked.
After look for a while, I choose the Bayesian system.
If someone is using Ruby, here a gem for it:
https://github.com/wbotelhos/rating
I'd highly recommend the book Programming Collective Intelligence by Toby Segaran (OReilly) ISBN 978-0-596-52932-1 which discusses how to extract meaningful data from crowd behaviour. The examples are in Python, but its easy enough to convert.

Resources