stasmodels SARIMAX predictions - statsmodels

I'm trying to understand how to verify a ARIMAX model for > 1 step ahead using statsmodels.
My understanding is the results.get_prediction(start=, dynamic=) api does this but I'm having trouble getting my head around how it works. My training data is indexed by a localised DateTimeIndex (tz='Sydney\Australia') at 15T freq. I want to predict a full day for '2019-02-04 00:00:00+1100' using one-step-ahead prediction up to '2019-02-04 06:00:00+1100' the previous predicted endogenous values for the rest of the day.
Is the code below correct? It seems statsmodel converts the start to a TimeStamp and treats dynamic as a multiple of the freq, so this should start the simulation using 1 step ahead until 06:00 then use the previous predicted endogenous values. The results don't look great so I want to confirm it's a model issue rather than me having incorrect diagnosis.
dt = '2019-02-04'
predict = res.get_prediction(start='2019-02-04 00:00:00+11:00')
predict_dy = res.get_prediction(start='2019-02-04 00:00:00+11:00', dynamic=4*6)
fig = plt.figure(figsize=(10,10)) ax = fig.gca()
y_train[dt].plot(ax=ax, style='o', label='Observed')
predict.predicted_mean[dt].plot(ax=ax, style='r--', label='One-step-ahead forecast')
predict_dy.predicted_mean[dt].plot(ax=ax, style='g', label='Dynamic forecast')

It seems statsmodel converts the start to a TimeStamp
Yes, if you give it a string value, then it will attempt to map it to an index in your dataset (like a timestamp).
and treats dynamic as a multiple of the freq
But this is not correct. dynamic is an integer offset to start. So if dynamic=0, that means that dynamic prediction begins at start, whereas if dynamic=1, that means that dynamic prediction begins at start+1.
It's not quite clear to me what's going on in your example (or what you think is not great about the predictions you generated), so here is a description of how dynamic works that may help:
Here's an example that may help explain how things work. A couple of key points for this exercise will be:
I set all elements of endog to be equal to 1
This is an AR(1) model with parameter 0.5. That means that if we know y_t, then the prediction of y_t+1 is equal to 0.5 * y_t.
Now, the example code is:
ix = pd.date_range(start='2018-12-01', end='2019-01-31', freq='D')
endog = pd.Series(np.ones(len(ix)), index=ix)
mod = sm.tsa.SARIMAX(endog, order=(1, 0, 0), concentrate_scale=True)
res = mod.smooth([0.5])
p1 = res.predict(start='January 1, 2019', end='January 5, 2019').rename('d=False')
p2 = res.predict(start='January 1, 2019', end='January 5, 2019', dynamic=0).rename('d=0')
p3 = res.predict(start='January 1, 2019', end='January 5, 2019', dynamic=1).rename('d=2')
print(pd.concat([p1, p2, p3], axis=1))
this gives:
d=False d=0 d=2
2019-01-01 0.5 0.50000 0.5000
2019-01-02 0.5 0.25000 0.5000
2019-01-03 0.5 0.12500 0.2500
2019-01-04 0.5 0.06250 0.1250
2019-01-05 0.5 0.03125 0.0625
The first column (d=False) is the default case, where dynamic=False. Here, all predictions are one-step-ahead predictions. Since I set every element of endog to 1 and we have an AR(1) model with parameter 0.5, all one-step-ahead predictions will be equal to 0.5 * 1 = 0.5.
In the second column (d=0), we specify that dynamic=0 so that dynamic prediction begins at the first prediction. This means that we do not use any endog data past start - 1 in forming our predictions, which in this case means we do not use any data past December 31, 2018 in making predictions. The first prediction will be equal to 0.5 times the observation on December 31, 2018, i.e. 0.5 * 1 = 0.5. Each subsequent prediction will be equal to 0.5 * the previous prediction, so the second prediction is 0.5 * 0.5 = 0.25, etc.
The third column (d=1) is like the second column, except that here dynamic=1 so that dynamic prediction begins at the second prediction. This means we do not use any endog data past start (i.e. past January 1, 2019).

Related

Calculate Marginal effect by hand (without using packages or Stata or R) with logit and dummy variables

I have the following dilemma:
I understand-ish what marginal effects are, also the calculation of it, derivation of the sigmoid function and how to interpret it (as a the change in probability by increasing your variable of interest by "a little bit", this little bit being 1 for discrete vars or by a std(x)/1000 for continuous ). Now, the part I find tricky is to corroborate the results of the marginal effects by hand and recalculating the probabilities for x=0 and then x=1 (for example) and then get a difference in probability equal to the marginal effect I got earlier, I am particularly stuck with dummy variables since If I increase one, I have to decrease the other one, so I am not so sure how to work around it and interpret it. (this question also applies for highly correlated variables)
To make it more clear, let's say I have the following dataset:
#Python
[1. , 0. , 0. , 4.6, 3.1, 1.5, 0.2],
[1. , 0. , 1. , 5. , 3.6, 1.4, 0.2],
[1. , 1. , 0. , 5.4, 3.9, 1.7, 0.4],
[1. , 0. , 1. , 4.6, 3.4, 1.4, 0.3],
[1. , 1. , 0. , 5. , 3.4, 1.5, 0.2],
[1. , 0. , 0. , 4.4, 2.9, 1.4, 0.2],
[1. , 0. , 1. , 4.9, 3.1, 1.5, 0.1],
[1. , 1. , 0. , 5.4, 3.7, 1.5, 0.2],
...
Var_0 = What will be the intercept.
Var_1, var_2 = One hot encoded variables (2/3 dummies), one dropped to avoid co linearity.
Var 3+ = Normal continuous variables
Coefficients:
[ 7.56986405, 0.75703164, 0.27158741, -0.37447474, -2.79926022, 1.43890492, -2.95286947]
logit
[-3.34739217,
-2.27001103,
-1.49517926,
-0.77178644,
-0.808111,
-2.48474722,
-1.76183804,
-0.90621541
...]
Probabilities
[0.03398066,
0.09363728,
0.18314562,
0.31609279,
0.30829318,
0.0769344 ,
0.14656029,
0.28777491,
...]
Marginal effect = p*(1-p) * B_j
Now let's say that I am interested in the marginal effect of var_1 (one of the dummies), I will simply do: p*(1-p) * 0.7570
Which will result in an array of length n (# of obs) with different marginal effects (which is fine because I understand that the effects are non constant and non-linear). Let's say this array goes from [0.0008 to 0.0495]
Now the problem is, how can you verify this results? How can I measure the marginal effect when the dummy goes from values 0 to 1?
You could argue that I could do two things MEM and AME methods:
MEM: Leave all the values at its mean and then calculate all over again for var_1 = 0 and then for var_1 = 1 (MEM method)
(you can't really do this because that you will be assuming that you can have
some observations where var_1 and var_2 will be equal to 1 at the same time,
which incorrect since the mean for a dummy is like a proportion of how many "1s"
there are for that column)
AME: Leave as observed, but changing all the values of var_1 to 0 (making all the values of var_2 = 1) and then do the opposite (var_1 = 1, var_2 =0, you have to do this since it can't belong to two categories at the same time), and then take the average of the results (AME method) (Side comment:One thing I am not sure if it is the average between the difference in marginal effects when var_1 = 0 and then 1, or if it is an average between the probabilities when var_1 =0 and then 1, I used both, but probability I think it makes more sense to me)
Now, if I try the 2nd approach I get very different results to what I originally got ( which were values between [0.0008 to 0.0495]), it gives me values between [0.0022 to 0.1207], which is a massive difference.
To summarise:
How can do a mathematical corroboration to get the same values I got initially ([0.0008 to 0.0495])
How can I interpret these original values in the first place? Because if I take 0.0495, I am basically saying, if I increase var_1 by 1-unit (from 0 to 1), I will have a 4.95% increase in probability of my event happening, the problems is that it doesn't consider that to make the 1-unit increase I need to, by default, decrease the other dummy variable (var_2), so I will be doing something of a double-change in the variables or like a double marginal effect at the same time.

algorithm to map values to unit scale with logarithmic spacing

I'm looking for an algorithm to map values to a unit scale with logarithmic spacing. The scale ranges from 0 to 1. Incoming values would be in the range of 0 to 10000.
0 maps to 0, 1 maps to .2, 10 maps to .4, 100 maps to .6
1000 maps to .8, 10000 maps to 1.0
Help/pointers would be appreciated.
If you are literally looking "to map values to a unit scale with logarithmic spacing", and with f(0)=0, then your example values are wrong.
However, you can do this with f(x) = log(1+x)/log(1+max)
So with max=10000, we have :
f(0)=0
f(1)=0.0753
f(2)=0.1193
f(10)=0.2603
f(100)=0.5010
f(1000)=0.7501
f(10000)=1
which on a log scale makes sense : if 1 is near 0 and 10000 is 1, then 100 which has the average number of zeroes of the previous numbers, should be around 0.5. You really don't wart to start considering log(0)as an option.
However, as soon as your minimum value is not 0 anymore (even if the min value is very very very small, as soon as it's non-zero), you can do a more reasonable interpolation :
f(x) = (log(x) - log(min)) / (log(max) - log(min))
which is the same as user3246191's comment under his answer :
f(x) = log(x/min) / log(max/min)
Since all values returned by f in this post are fractions of logarithms, you can take the logarithm in any base you please. I would recommend the native one for your programming language (ie if log10(x) is defined as ln(x)/ln(10), take ln(x) instead).
It is not really clear what is the transform you are trying to apply. For what you try to say it seems that a potential function would be
f(x) = 0.2(1+ log(x)/log(10))
which satisfies f(1) = 0.2, f(10) = 0.4, f(100) = 0.6, f(1000) = 0.8, f(10000) = 1
but on the other hand f(0.1) = 0 and f(0) = -infty.
Of course it is possible to modify f so that f(0) = 0 but this will be somewhat arbitrary and your question is not really well formulated then.

Microsoft Excel cumulative calculation performance

I know 2 ways to calculate cumulative values in Excel.
1st method:
A B
Value Cumulative total
9 =A1
8 =B1+A2
7 =B2+A3
6 =B3+A4
2nd method:
A B
Value Cumulative total
9 =SUM($A$1:A1)
8 =SUM($A$1:A2)
7 =SUM($A$1:A3)
6 =SUM($A$1:A4)
2 questions:
Which method has better performance when the data set gets really big (say 100k rows)? 1st method seems to be having less overhead. Because when adding a new value in column A (Value), new cell in column B only needs to do "B(n-1)+A(n)". Where in 2nd method, is it smart enough to do similar? Or it will adds 100k rows from A1:A(n)?
What's the best way to calculate the cumulative values? I found 2nd method is more popular though I doubt its performance. The only upside for 2nd method I can see now is the formula in column B cells are more consistent. In 1st method, the 1st cell in column B has to be a determined in advance.
number sequence 9, 8, 7, 6, -9, -8, -7, -6; workbook set to manual calculation, triggered by following code:
Sub ManualCalc()
Dim R As Range
Set R = Selection
[F1] = Now()
R.Worksheet.Calculate
[F2] = Now()
[F3] = ([F2] - [F1]) * 86400
End Sub
At 4096 rows calculation time is not measurable for both variants (0 seconds), at 65536 rows your 1st method is still not measurable, your 2nd method takes a bit less than 8 seconds on my laptop (Dell Latitude E6420, Win7, Office2010 - average of 3 measurements each). So for high number of rows I would therefore prefer method 1.
Regarding your Q1 ... yes it would add 100k sums of ever growing ranges ... Excel is not supposed to be smart, it's supposed to calculate whatever you ask it to calculate. If it did, it would interpret the intention of a set of formulas at runtime which I'd regard as very dangerous!

How to decide on weights?

For my work, I need some kind of algorithm with the following input and output:
Input: a set of dates (from the past). Output: a set of weights - one weight per one given date (the sum of all weights = 1).
The basic idea is that the closest date to today's date should receive the highest weight, the second closest date will get the second highest weight, and so on...
Any ideas?
Thanks in advance!
First, for each date in your input set assign the amount of time between the date and today.
For example: the following date set {today, tomorrow, yesterday, a week from today} becomes {0, 1, 1, 7}. Formally: val[i] = abs(today - date[i]).
Second, inverse the values in such a way that their relative weights are reversed. The simplest way of doing so would be: val[i] = 1/val[i].
Other suggestions:
val[i] = 1/val[i]^2
val[i] = 1/sqrt(val[i])
val[i] = 1/log(val[i])
The hardest and most important part is deciding how to inverse the values. Think, what should be the nature of the weights? (do you want noticeable differences between two far away dates, or maybe two far away dates should have pretty equal weights? Do you want a date which is very close to today have an extremely bigger weight or a reasonably bigger weight?).
Note that you should come up with an inverting procedure where you cannot divide by zero. In the example above, dividing by val[i] results in division by zero. One method to avoid division by zero is called smoothing. The most trivial way to "smooth" your data is using the add-one smoothing where you just add one to each value (so today becomes 1, tomorrow becomes 2, next week becomes 8, etc).
Now the easiest part is to normalize the values so that they'll sum up to one.
sum = val[1] + val[2] + ... + val[n]
weight[i] = val[i]/sum for each i
Sort dates and remove dups
Assign values (maybe starting from the farthest date in steps of 10 or whatever you need - these value can be arbitrary, they just reflect order and distance)
Normalize weights to add up to 1
Executable pseudocode (tweakable):
#!/usr/bin/env python
import random, pprint
from operator import itemgetter
# for simplicity's sake dates are integers here ...
pivot_date = 1000
past_dates = set(random.sample(range(1, pivot_date), 5))
weights, stepping = [], 10
for date in sorted(past_dates):
weights.append( (date, stepping) )
stepping += 10
sum_of_steppings = sum([ itemgetter(1)(x) for x in weights ])
normalized = [ (d, (w / float(sum_of_steppings)) ) for d, w in weights ]
pprint.pprint(normalized)
# Example output
# The 'date' closest to 1000 (here: 889) has the highest weight,
# 703 the second highest, and so forth ...
# [(151, 0.06666666666666667),
# (425, 0.13333333333333333),
# (571, 0.2),
# (703, 0.26666666666666666),
# (889, 0.3333333333333333)]
How to weight: just compute the difference of all dates and the current date
x(i) = abs(date(i) - current_date)
you can then use different expression to assign weights:
w(i) = 1/x(i)
w(i) = exp(-x(i))
w(i) = exp(-x(i)^2))
use gaussian distribution - more complicated, do not recommend
Then use normalized weights: w(i)/sum(w(i)) so that the sum is 1.
(Note that the exponential func is always used by statisticians in survival analysis)
The first thing that comes to my mind to to use a geometric series:
http://en.wikipedia.org/wiki/Geometric_series
(1/2)+(1/4)+(1/8)+(1/16)+(1/32)+(1/64)+(1/128)+(1/256)..... sums to one.
Yesterday would be 1/2
2 days ago would be 1/4
and so on
Is is the index for the i-th date.
Assign weights equal to to Ni / D.
D0 is the first date.
Ni is the difference in days between the i-th date and the first date D0.
D is the normalization factor
converts dates to yyyymmddhhmiss format (24 hours), add all these values ​​and the total, divide by the total time, and sort by this value.
declare #data table
(
Date bigint,
Weight float
)
declare #sumTotal decimal(18,2)
insert into #Data (Date)
select top 100
replace(replace(replace(convert(varchar,Datetime,20),'-',''),':',''),' ','')
from Dates
select #sumTotal=sum(Date)
from #Data
update #Data set
Weight=Date/#sumTotal
select * from #Data order by 2 desc

What is the best way to compute trending topics or tags?

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.
I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) no one mentions should be the hottest ones.
Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment.
Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem".
My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones.
I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it?
I hope you can help me (coding examples would be great).
This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average.
In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.
z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]
When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.
Please see Wikipedia for more information, about z-scores.
Code
from math import sqrt
def zscore(obs, pop):
# Size of population.
number = float(len(pop))
# Average population value.
avg = sum(pop) / number
# Standard deviation of population.
std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
# Zscore Calculation.
return (obs - avg) / std
Sample Output
>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506
Notes
You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.
You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.
If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.
from math import sqrt
class zscore:
def __init__(self, pop = []):
self.number = float(len(pop))
self.total = sum(pop)
self.sqrTotal = sum(x ** 2 for x in pop)
def update(self, value):
self.number += 1.0
self.total += value
self.sqrTotal += value ** 2
def avg(self):
return self.total / self.number
def std(self):
return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
def score(self, obs):
return (obs - self.avg()) / self.std()
Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day.
New Addition
Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:
Code
class fazscore:
def __init__(self, decay, pop = []):
self.sqrAvg = self.avg = 0
# The rate at which the historic data's effect will diminish.
self.decay = decay
for x in pop: self.update(x)
def update(self, value):
# Set initial averages to the first value in the sequence.
if self.avg == 0 and self.sqrAvg == 0:
self.avg = float(value)
self.sqrAvg = float((value ** 2))
# Calculate the average of the rest of the values using a
# floating average.
else:
self.avg = self.avg * self.decay + value * (1 - self.decay)
self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
return self
def std(self):
# Somewhat ad-hoc standard deviation calculation.
return sqrt(self.sqrAvg - self.avg ** 2)
def score(self, obs):
if self.std() == 0: return (obs - self.avg) * float("infinity")
else: return (obs - self.avg) / self.std()
Sample IO
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
>>> fazscore(0.8, [40] * 200).score(1)
-inf
Update
As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line,
if self.std() == 0: return 0
to:
if self.std() == 0: return (obs - self.avg) * float("infinity")
This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to:
if self.std() == 0: return obs - self.avg
You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.
This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.
Normalize
One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.
Derive
Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.
This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.
Regarding the article
The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.
As Nixuz points out this is also referred to as a Z or Standard Score.
Chad Birch and Adam Davis are correct in that you will have to look backward to establish a baseline. Your question, as phrased, suggests that you only want to view data from the past 24 hours, and that won't quite fly.
One way to give your data some memory without having to query for a large body of historical data is to use an exponential moving average. The advantage of this is that you can update this once per period and then flush all old data, so you only need to remember a single value. So if your period is a day, you have to maintain a "daily average" attribute for each topic, which you can do by:
a_n = a_(n-1)*b + c_n*(1-b)
Where a_n is the moving average as of day n, b is some constant between 0 and 1 (the closer to 1, the longer the memory) and c_n is the number of hits on day n. The beauty is if you perform this update at the end of day n, you can flush c_n and a_(n-1).
The one caveat is that it will be initially sensitive to whatever you pick for your initial value of a.
EDIT
If it helps to visualize this approach, take n = 5, a_0 = 1, and b = .9.
Let's say the new values are 5,0,0,1,4:
a_0 = 1
c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854
Doesn't look very much like an average does it? Note how the value stayed close to 1, even though our next input was 5. What's going on? If you expand out the math, what you get that:
a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0
What do I mean by leftover weight? Well, in any average, all weights must add to 1. If n were infinity and the ... could go on forever, then all weights would sum to 1. But if n is relatively small, you get a good amount of weight left on the original input.
If you study the above formula, you should realize a few things about this usage:
All data contributes something to the average forever. Practically speaking, there is a point where the contribution is really, really small.
Recent values contribute more than older values.
The higher b is, the less important new values are and the longer old values matter. However, the higher b is, the more data you need to water down the initial value of a.
I think the first two characteristics are exactly what you are looking for. To give you an idea of simple this can be to implement, here is a python implementation (minus all the database interaction):
>>> class EMA(object):
... def __init__(self, base, decay):
... self.val = base
... self.decay = decay
... print self.val
... def update(self, value):
... self.val = self.val*self.decay + (1-self.decay)*value
... print self.val
...
>>> a = EMA(1, .9)
1
>>> a.update(10)
1.9
>>> a.update(10)
2.71
>>> a.update(10)
3.439
>>> a.update(10)
4.0951
>>> a.update(10)
4.68559
>>> a.update(10)
5.217031
>>> a.update(10)
5.6953279
>>> a.update(10)
6.12579511
>>> a.update(10)
6.513215599
>>> a.update(10)
6.8618940391
>>> a.update(10)
7.17570463519
Typically "buzz" is figured out using some form of exponential/log decay mechanism. For an overview of how Hacker News, Reddit, and others handle this in a simple way, see this post.
This doesn't fully address the things that are always popular. What you're looking for seems to be something like Google's "Hot Trends" feature. For that, you could divide the current value by a historical value and then subtract out ones that are below some noise threshold.
I think they key word you need to notice is "abnormally". In order to determine when something is "abnormal", you have to know what is normal. That is, you're going to need historical data, which you can average to find out the normal rate of a particular query. You may want to exclude abnormal days from the averaging calculation, but again that'll require having enough data already, so that you know which days to exclude.
From there, you'll have to set a threshold (which would require experimentation, I'm sure), and if something goes outside the threshold, say 50% more searches than normal, you can consider it a "trend". Or, if you want to be able to find the "Top X Trendiest" like you mentioned, you just need to order things by how far (percentage-wise) they are away from their normal rate.
For example, let's say that your historical data has told you that Britney Spears usually gets 100,000 searches, and Paris Hilton usually gets 50,000. If you have a day where they both get 10,000 more searches than normal, you should be considering Paris "hotter" than Britney, because her searches increased 20% more than normal, while Britney's were only 10%.
God, I can't believe I just wrote a paragraph comparing "hotness" of Britney Spears and Paris Hilton. What have you done to me?
I was wondering if it is at all possible to use regular physics acceleration formula in such a case?
v2-v1/t or dv/dt
We can consider v1 to be initial likes/votes/count-of-comments per hour and v2 to be current "velocity" per hour in last 24 hours?
This is more like a question than an answer, but seems it may just work. Any content with highest acceleration will be the trending topic...
I am sure this may not solve Britney Spears problem :-)
probably a simple gradient of topic frequency would work -- large positive gradient = growing quickly in popularity.
the easiest way would be to bin the number of searched each day, so you have something like
searches = [ 10, 7, 14, 8, 9, 12, 55, 104, 100 ]
and then find out how much it changed from day to day:
hot_factor = [ b-a for a, b in zip(searches[:-1], searches[1:]) ]
# hot_factor is [ -3, 7, -6, 1, 3, 43, 49, -4 ]
and just apply some sort of threshold so that days where the increase was > 50 are considered 'hot'. you could make this far more complicated if you'd like, too. rather than absolute difference you can take the relative difference so that going from 100 to 150 is considered hot, but 1000 to 1050 isn't. or a more complicated gradient that takes into account trends over more than just one day to the next.
I had worked on a project, where my aim was finding Trending Topics from Live Twitter Stream and also doing sentimental analysis on the trending topics (finding if Trending Topic positively/negatively talked about). I've used Storm for handling twitter stream.
I've published my report as a blog: http://sayrohan.blogspot.com/2013/06/finding-trending-topics-and-trending.html
I've used Total Count and Z-Score for the ranking.
The approach that I've used is bit generic, and in the discussion section, I've mentioned that how we can extend the system for non-Twitter Application.
Hope the information helps.
You could use log-likelihood-ratios to compare the current date with the last month or year. This is statistically sound (given that your events are not normally distributed, which is to be assumed from your question).
Just sort all your terms by logLR and pick the top ten.
public static void main(String... args) {
TermBag today = ...
TermBag lastYear = ...
for (String each: today.allTerms()) {
System.out.println(logLikelihoodRatio(today, lastYear, each) + "\t" + each);
}
}
public static double logLikelihoodRatio(TermBag t1, TermBag t2, String term) {
double k1 = t1.occurrences(term);
double k2 = t2.occurrences(term);
double n1 = t1.size();
double n2 = t2.size();
double p1 = k1 / n1;
double p2 = k2 / n2;
double p = (k1 + k2) / (n1 + n2);
double logLR = 2*(logL(p1,k1,n1) + logL(p2,k2,n2) - logL(p,k1,n1) - logL(p,k2,n2));
if (p1 < p2) logLR *= -1;
return logLR;
}
private static double logL(double p, double k, double n) {
return (k == 0 ? 0 : k * Math.log(p)) + ((n - k) == 0 ? 0 : (n - k) * Math.log(1 - p));
}
PS, a TermBag is an unordered collection of words. For each document you create one bag of terms. Just count the occurrences of words. Then the method occurrences returns the number of occurrences of a given word, and the method size returns the total number of words. It is best to normalize the words somehow, typically toLowerCase is good enough. Of course, in the above examples you would create one document with all queries of today, and one with all queries of the last year.
If you simply look at tweets, or status messages to get your topics, you're going to encounter a lot of noise. Even if you remove all stop words. One way to get a better subset of topic candidates is to focus only on tweets/messages that share a URL, and get the keywords from the title of those web pages. And make sure you apply POS tagging to get nouns + noun phrases as well.
Titles of web pages usually are more descriptive and contain words that describe what the page is about. In addition, sharing a web page usually is correlated with sharing news that is breaking (ie if a celebrity like Michael Jackson died, you're going to get a lot of people sharing an article about his death).
I've ran experiments where I only take popular keywords from titles, AND then get the total counts of those keywords across all status messages, and they definitely remove a lot of noise. If you do it this way, you don't need a complex algorith, just do a simple ordering of the keyword frequencies, and you're halfway there.
The idea is to keep track of such things and notice when they jump significantly as compared to their own baseline.
So, for queries that have more than a certain threshhold, track each one and when it changes to some value (say almost double) of its historical value, then it is a new hot trend.

Resources