Why does this hash table lookup probe like it does? - algorithm

This code (extraced from an LZW compression program of unknown origin) finds an empty slot in a hash table of size 5021, indexed from 0 to 5020:
probe := <random 12-bit hash key>
// probe is initially 0 to 4095
repeat
{
if table[probe] is empty then return(probe);
if probe == 0 then probe := -1 else dec(probe, 5021-probe);
if probe < 0 then inc(probe, 5021);
}
This isn't quite the typical linear or quadratic probing. Why probe like that? Is this a known probing algorithm, and where can I find out more about it?

The algorithm for calculating the new probe is simple despite its ugly looks:
if probe == 0
probe <= 5020
else
probe <= (2*probe) % 5021
It is not quite clear why this function has been picked, but it really does go through all possible positions 1..5020 in a cyclic and seemingly random way (and 0 is sent back to the loop). (No, it doesnt, see the oops!) It should be noted that number 5021 is slightly magical in this context.
This algorithm is actually a linear congruential generator (LCG). See http://en.wikipedia.org/wiki/Linear_congruential_generator
OOPS: It is a LCG, but not one with the optimal period, because 2^1004 % 5021 == 1. There are five different cycles with the following members: (1, 2, 4, 8, ...), (3, 6, 12, 24, ...), (7, 14, 28, 56, ...), (9, 18, 36, 72, ...), and (11, 22, 44, 88, ...). Even more odd someone has chosen to use this algorithm. (Or then I analysed it wrong.)

Related

Recovering element of an array, given sums of items at indexes matching bitmasks

Suppose there was an array E of 2^n elements. For example:
E = [2, 3, 5, 7, 11, 13, 17, 19]
Unfortunately, someone has come along and scrambled the array. They took all elements whose index in binary is of the form 1XX, and added them into the elements at index 0XX (i.e. they did E[0] += E[1], E[2] += E[3], etc. Then they did the same thing for indexes like X1X into X0X, and for XX1 into XX0.
More specifically, they ran this pseudo-code over the array:
def scramble(e):
n = lg_2(len(e))
for p in range(n):
m = 1 << p
for i in range(len(e)):
if (i & m) != 0:
e[i - m] += e[i]
In terms of our example, this causes:
E_1 = [2+3, 3, 5+7, 7, 11+13, 13, 17+19, 19]
E_1 = [5, 3, 12, 7, 24, 13, 36, 19]
E_2 = [5+12, 3+7, 12, 7, 24+36, 13+19, 36, 19]
E_2 = [17, 10, 12, 7, 60, 32, 36, 19]
E_3 = [17+60, 10+32, 12+36, 7+19, 60, 32, 36, 19]
E_3 = [77, 42, 48, 26, 60, 32, 36, 19]
You're given the array after it's been scrambled (i.e. your input is E_3). Your goal is to recover the original first element of E, (i.e. the number 2).
One way to get the 2 back is undo all the scrambling. Run the scrambling code, but with the += replaced by a -=. However, doing that is very expensive. It takes n 2^n time. Is there a faster way?
Alternate Form
Stated another way, I give you an array S where the element at index i is the sum of all elements with an index j satisfying (j & i) == ifrom a list E. For example, S[101110] is E[101110] + E[111110] + E[101111] + E[111111]). How expensive is it to recover an element of E, given S?
The item at 111111... is easy, because S[111111...] = E[111111...], but S[000000...] depends on a all the elements from E in a non-uniform way so it seems to be harder to get back.
Extended
What if we don't just want to recover the original items, but want to recover sums of the original items that have match a mask that can specify must-be-1, no-constraint, and must-be-0? Is this harder?
Call the number of items in the array N, and the size of the bitmasks being used B so N = 2^B.
You can't do better than O(N).
The example solution in the question, which just runs the scrambling in reverse, takes O(N B) time. We can reduce that to O(N) by discarding items that won't contribute to the actual value we read at the end. This makes the unscrambling much simpler, actually: just iteratively subtract the last half of the array from the first half, then discard the last half, until you have one item left.
def unscrambleFirst(S):
while len(S) > 1:
h = len(S)/2
for i in range(h):
S = S[:h] - S[h:] #item-by-item subtraction
return S[0]
It's not possible to go faster than O(N). We can prove it with linear algebra.
The original array has N independent items, i.e. it is a vector with N degrees of freedom.
The scrambling operation only uses linear operations, and so is equivalent to multiplying that vector by a matrix. (The matrix is [[1, 1], [0, 1]] tiled inside of itself B times; it ends up looking like a Sierpinski triangle).
The scrambling operation matrix is invertible (that's why we can undo the scrambling).
Therefore the scrambled vector must still have N degrees of freedom.
But our O(N) solution is a linear combination of every element of the scrambled vector.
And since the elements of the scrambled vector must all be linearly independent for there to be N degrees of freedom in it, we can't rewrite the usage of any one element with usage of the others.
Therefore we can't change which items we rely on, and we know that we rely on all of them in one case so it must be all of them in all cases.
Hopefully that's clear enough. The scrambling distributes the first item in a way that requires you to look at every item to get it back.

Unique pair of two integers [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Mapping two integers to one, in a unique and deterministic way
I'm trying to create unique identificator for pair of two integers (Ruby) :
f(i1,i2) = f(i2, i1) = some_unique_value
So, i1+i2, i1*i2, i1^i2 -not unique as well as (i1>i2) ? "i1" + "i2" : "i2" + "i1".
I think following solution will be ok:
(i1>i2) ? "i1" + "_" + "i2" : "i2" + "_" + "i1"
but:
I have to save result in DB and index it. So I prefer it to be an integer and as small as it possible.
Is Zlib.crc32(f(i1,i2)) can guaranty uniqueness?
Thanks.
UPD:
Actually, I'm not sure the result MUST be integer. Maybe I can convert it to decimal:
(i1>i2) ? i1.i2 : i2.i1
?
What you're looking for is called a Pairing function.
The following illustration from the German wikipedia page clearly shows how it works:
Implemented in Ruby:
def cantor_pairing(n, m)
(n + m) * (n + m + 1) / 2 + m
end
(0..5).map do |n|
(0..5).map do |m|
cantor_pairing(n, m)
end
end
=> [[ 0, 2, 5, 9, 14, 20],
[ 1, 4, 8, 13, 19, 26],
[ 3, 7, 12, 18, 25, 33],
[ 6, 11, 17, 24, 32, 41],
[10, 16, 23, 31, 40, 50],
[15, 22, 30, 39, 49, 60]]
Note that you will need to store the result of this pairing in a datatype with as many bits as both your input numbers put together. (If both input numbers are 32-bit, you will need a 64-bit datatype to be able to store all possible combinations, obviously.)
No, Zlib.crc32(f(i1,i2)) is not unique for all integer values of i1 and i2.
If i1 and i2 are also 32bit numbers then there are many more combinations of them than can be stored in a 32bit number, which is returned by CRC32.
CRC32 is not unique, and wouldn't be good to use as a key. Assuming you know the maximum value of your integers i1 and i2:
unique_id = (max_i2+1)*i1 + i2
If your integers can be negative, or will never be below a certain positive integer, you'll need the max and min values:
(max_i2-min_i2+1) * (i1-min_i1) + (i2-min_i2)
This will give you the absolute smallest number possible to identify both integers.
Well, no 4-byte hash will be unique when its input is an arbitrary binary string of more than 4 bytes. Your strings are from a highly restricted symbol set, so collisions will be fewer, but "no, not unique".
There are two ways to use a smaller integer than the possible range of values for both of your integers:
Have a system that works despite occasional collisions
Check for collisions and use some sort of rehash
The obvious way to solve your problem with a 1:1 mapping requires that you know the maximum value of one of the integers. Just multiply one by the maximum value and add the other, or determine a power of two ceiling, shift one value accordingly, then OR in the other. Either way, every bit is reserved for one or the other of the integers. This may or may not meet your "as small as possible" requirement.
Your ###_### string is unique per pair; if you could just store that as a string you win.
Here's a better, more space efficient solution:. My answer on it here

algorithm to find number of integers with given digits within a given range

If I am given the full set of digits in the form of a list list and I want to know how many (valid) integers they can form within a given range [A, B], what algorithm can I use to do it efficiently?
For example, given a list of digits (containing duplicates and zeros) list={5, 3, 3, 2, 0, 0}, I want to know how many integers can be formed in the range [A, B]=[20, 400] inclusive. For example, in this case, 20, 23, 25, 30, 32, 33, 35, 50, 52, 53, 200, 203, 205, 230, 233, 235, 250, 253, 300, 302, 303, 305, 320, 323, 325, 330, 332, 335, 350, 352, 353 are all valid.
Step 1: Find the number of digits your answers are likely to fall in. In your
example it is 2 or 3.
Step 2: For a given number size (number of digits)
Step 2a: Pick the possibilities for the first (most significant digit).
Find the min and max number starting with that digit (ascend or descending
order of rest of the digits). If both of them fall into the range:
step 2ai: Count the number of digits starting with that first digit and
update that count
Step 2b: Else if both max and min are out of range, ignore.
Step 2c: Otherwise, add each possible digit as second most significant digit
and repeat the same step
Solving by example of your case:
For number size of 2 i.e. __:
0_ : Ignore since it starts with 0
2_ : Minimum=20, Max=25. Both are in range. So update count by 3 (second digit might be 0,3,5)
3_ : Minimum=30, Max=35. Both are in range. So update count by 4 (second digit might be 0,2,3,5)
5_ : Minimum=50, Max=53. Both are in range. So update count by 3 (second digit might be 0,2,3)
For size 3:
0__ : Ignore since it starts with 0
2__ : Minimum=200, max=253. Both are in range. Find the number of ways you can choose 2 numbers from a set of {0,0,3,3,5}, and update the count.
3__ : Minimum=300, max=353. Both are in range. Find the number of ways you can choose 2 numbers from a set of {0,0,2,3,5}, and update the count.
5__ : Minimum=500, max=532. Both are out of range. Ignore.
A more interesting case is when max limit is 522 (instead of 400):
5__ : Minimum=500, max=532. Max out of range.
50_: Minimum=500, Max=503. Both in range. Add number of ways you can choose one digit from {0,2,3,5}
52_: Minimum=520, Max=523. Max out of range.
520: In range. Add 1 to count.
522: In range. Add 1 to count.
523: Out of range. Ignore.
53_: Minimum=530, Max=532. Both are out of range. Ignore.
def countComb(currentVal, digSize, maxVal, minVal, remSet):
minPosVal, maxPosVal = calculateMinMax( currentVal, digSize, remSet)
if maxVal>= minPosVal >= minVal and maxVal>= maxPosVal >= minVal
return numberPermutations(remSet,digSize, currentVal)
elif minPosVal< minVal and maxPosVal < minVal or minPosVal> maxVal and maxPosVal > maxVal:
return 0
else:
count=0
for k in unique(remSet):
tmpRemSet = [i for i in remSet]
tmpRemSet.remove(k)
count+= countComb(currentVal+k, digSize, maxVal, minVal, tmpRemSet)
return count
In your case: countComb('',2,400,20,['0','0','2','3','3','5']) +
countComb('',3,400,20,['0','0','2','3','3','5']) will give the answer.
def calculateMinMax( currentVal, digSize, remSet):
numRemain = digSize - len(currentVal)
minPosVal = int( sorted(remSet)[:numRemain] )
maxPosVal = int( sorted(remSet,reverse=True)[:numRemain] )
return minPosVal,maxPosVal
numberPermutations(remSet,digSize, currentVal): Basically number of ways
you can choose (digSize-len(currentVal)) values from remSet. See permutations
with repeats.
If the range is small but the list is big, the easy solution is just loop over the range and check if every number can be generated from the list. The checking can be made fast by using a hash table or an array with a count for how many times each number in the list can still be used.
For a list of n digits, z of which are zero, a lower bound l, and an upper bound u...
Step 1: The Easy Stuff
Consider a situation in which you have a 2-digit lower bound and a 4-digit upper bound. While it might be tricky to determine how many 2- and 4-digit numbers are within the bounds, we at least know that all 3-digit numbers are. And if the bounds were a 2-digit number and a 5-digit number, you know that all 3- and 4-digit numbers are fair game.
So let's generalize this to to a lower bound with a digits and an upper bound with b digits. For every k between a and b (not including a and b, themselves), all k-digit numbers are within the range.
How many such numbers are there? Consider how you'd pick them: the first digit must be one of the n numbers which is non-zero (so one of (n - z) numbers), and the rest are picked from the yet-unpicked list, i.e. (n-1) choices for the second digit, (n-2) for the third, etc. So this is looking like a factorial, but with a weird first term. How many numbers of the n are picked? Why, k of them, which means we have to divide by (n - k)! to ensure we only pick k digits in total. So the equation for each k looks something like: (n - z)(n - 1)!/(n - k)! Plug in every k in the range (a, b), and you have the number of (a+1)- to (b-1)-digit numbers possible, all of which must be valid.
Step 2: The Edge Cases
Things are a little bit trickier when you consider a- and b-digit numbers. I don't think you can avoid starting a depth-first search through all possible combinations of digits, but you can at least abort on an entire branch if it exceeds the boundary.
For example, if your list contained { 7, 5, 2, 3, 0 } and you had an upper bound of 520, your search might go something like the following:
Pick the 7: does 7 work in the hundreds place? No, because 700 > 520;
abort this branch entirely (i.e. don't consider 752, 753, 750, 725, etc.)
Pick the 5: does 5 work in the hundreds place? Yes, because 500 <= 520.
Pick the 7: does 7 work in the tens place? No, because 570 > 520.
Abort this branch (i.e. don't consider 573, 570, etc.)
Pick the 2: does 2 work in the tens place? Yes, because 520 <= 520.
Pick the 7: does 7 work in the ones place? No, because 527 > 520.
Pick the 3: does 3 work in the ones place? No, because 523 > 520.
Pick the 0: does 0 work in the ones place? Yes, because 520 <= 520.
Oh hey, we found a number. Make sure to count it.
Pick the 3: does 3 work in the tens place? No; abort this branch.
Pick the 0: does 0 work in the tens place? Yes.
...and so on.
...and then you'd do the same for the lower bound, but flipping the comparators. It's not nearly as efficient as the k-digit combinations in the (a, b) interval (i.e. O(1)), but at least you can avoid a good deal by pruning branches that must be impossible early on. In any case, this strategy ensures you only have to actually enumerate the two edge cases that are the boundaries, regardless of how wide your (a, b) interval is (or if you have 0 as your lower bound, only one edge case).
EDIT:
Something I forgot to mention (sorry, I typed all of the above on the bus home):
When doing the depth-first search, you actually only have to recurse when your first number equals the first number of the bound. That is, if your bound is 520 and you've just picked 3 as your first number, you can just add (n-1)!/(n-3)! immediately and skip the entire branch, because all 3-digit numbers beginning with 300 are certainly all below 500.

Algorithm to find the next number in a sequence

Ever since I started programming this has been something I have been curious about. But seems too complicated for me to even attempt.
I'd love to see a solution.
1, 2, 3, 4, 5 // returns 6 (n + 1)
10, 20, 30, 40, 50 //returns 60 (n + 10)
10, 17, 31, 59, 115 //returns 227 ((n * 2) - 3)
What you want to do is called polynomial interpolation. There are many methods (see http://en.wikipedia.org/wiki/Polynomial_interpolation ), but you have to have an upper bound U on the degree of the polynomial and at least U + 1 values.
If you have sequential values, then there is a simple algorithm.
Given a sequence x1, x2, x3, ..., let Delta(x) be the sequence of differences x2 - x1, x3 - x2, x4 - x3, ... . If you have consecutive values of a degree n polynomial, then the nth iterate of Delta is a constant sequence.
For example, the polynomial n^3:
1, 8, 27, 64, 125, 216, ...
7, 19, 37, 61, 91, ...
12, 18, 24, 30, ...
6, 6, 6, ...
To get the next value, fill in another 6 and then work backward.
6, 6, 6, 6 = 6, ...
12, 18, 24, 30, 36 = 30 + 6, ...
7, 19, 37, 61, 91, 127 = 91 + 36, ...
1, 8, 27, 64, 125, 216, 343 = 216 + 127, ...
The restriction on the number of values above ensures that your sequence never becomes empty while performing the differences.
Sorry to disappoint, but this isn't quite possible (in general), as there are an infinite number of sequences for any given k values. Maybe with certain constraints..
You can take a look at this Everything2 post, which points to Lagrange polynomial.
Formally there is no unique next value to a partial sequence. The problem as usually understood can be clearly stated as:
Assume that the partial sequence exhibited is just sufficient to constrain some generating rule, deduce the simplest possible rule and exhibit the next value generated.
The problem turns on the meaning of "simplest", and is thus not really good for algorithmatic solutions. It can be done if you confine the problem to a certain class of functional forms for the generating rule, but the details depend on what forms you are willing to accept.
The book Numerical Recipes has pages and pages of real practical algorithms to do this kind of stuff. It's well worth the read!
The first two cases are easy:
>>> seq1 = [1, 2, 3, 4, 5]
>>> seq2 = [10, 20, 30, 40, 50]
>>> def next(seq):
... m = (seq[1] - seq[0])/(1-0)
... b = seq[0] - m * 0
... return m*len(seq) + b
>>> next(seq1)
6
>>> next(seq2)
60
The third case would require solving for a non-linear function.
You can try to use extrapolation. It will help you to find formulas to describe a given sequence.
I am sorry, I can't tell you much more, since my mathematic education happened quite a while ago. But you should find more informations in good books.
That kind of number series are often part of "intelligence tests", which leads me to think in the terms of such an algorithm being something passing (at least part of) a Turing Test, which is something quite hard to accomplish.
I like the idea and sequence one and two would seem to me that this is possible, but then again you cannot generalize as the sequence could totally go off base. The answer is probably that you cannot generalize, what you can do is write an algorithm to perform a specific sequence knowing the (n+1) or (2n+2) etc...
One thing you may be able to do is take a difference between element i and element i+1 and element i+2.
for example, in your third example:
10 17 31 59 115
Difference between 17 and 10 is 7, and the difference between 31 and 17 is 14, and the difference between 59 and 31 is 28, and the diffeerence between 115 and 59 is 56.
So you note that it becomes the element i+1 = i + (7*2^n).
So 17 = 10 + (7*2^0)
And 31 = 17 + (7*2^1)
And so on...
For an arbitrary function it can't be done, but for a linear function like in each of your examples it's simple enough.
You have f(n+1) = a*f(n) + b, and the problem amounts to finding a and b.
Given at least three terms of the sequence, you can do this (you need three because you have three unknowns -- the starting point, a, and b). For instance, suppose you have f(0), f(1) and f(2).
We can solve the equations:
f(1) = a*f(0) + b
f(2) = a*f(1) + b
The solution for is:
a = (f(2)-f(1))/(f(1)-f(0))
b = f(1) - f(0)*(f(2)-f(1))/(f(1)-f(0))
(You'll want to separately solve the case where f(0) = f(1) to avoid division by zero.)
Once you have a and b, you can repeatedly apply the formula to your starting value to generate any term in the sequence.
One could also write a more general procedure that works when given any three points in the sequence (e.g. 4th, 7th, 23rd, or whatever) . . . this is just a simple example.
Again, though, we had to make some assumptions about what form our solution would have . . . in this case taking it to be linear as in your example. One could take it to be a more general polynomial, for instance, but in that case you need more terms of the sequence to find the solution, depending on the degree of the polynomial.
See also the chapter "To Seek Whence Comes a Sequence" from the book "Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought" by Douglas Hofstadter
http://portal.acm.org/citation.cfm?id=218753.218755&coll=GUIDE&dl=GUIDE&CFID=80584820&CFTOKEN=18842417

What is the best way to compute trending topics or tags?

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.
I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) no one mentions should be the hottest ones.
Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment.
Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem".
My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones.
I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it?
I hope you can help me (coding examples would be great).
This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average.
In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.
z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]
When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.
Please see Wikipedia for more information, about z-scores.
Code
from math import sqrt
def zscore(obs, pop):
# Size of population.
number = float(len(pop))
# Average population value.
avg = sum(pop) / number
# Standard deviation of population.
std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
# Zscore Calculation.
return (obs - avg) / std
Sample Output
>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506
Notes
You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.
You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.
If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.
from math import sqrt
class zscore:
def __init__(self, pop = []):
self.number = float(len(pop))
self.total = sum(pop)
self.sqrTotal = sum(x ** 2 for x in pop)
def update(self, value):
self.number += 1.0
self.total += value
self.sqrTotal += value ** 2
def avg(self):
return self.total / self.number
def std(self):
return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
def score(self, obs):
return (obs - self.avg()) / self.std()
Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day.
New Addition
Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:
Code
class fazscore:
def __init__(self, decay, pop = []):
self.sqrAvg = self.avg = 0
# The rate at which the historic data's effect will diminish.
self.decay = decay
for x in pop: self.update(x)
def update(self, value):
# Set initial averages to the first value in the sequence.
if self.avg == 0 and self.sqrAvg == 0:
self.avg = float(value)
self.sqrAvg = float((value ** 2))
# Calculate the average of the rest of the values using a
# floating average.
else:
self.avg = self.avg * self.decay + value * (1 - self.decay)
self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
return self
def std(self):
# Somewhat ad-hoc standard deviation calculation.
return sqrt(self.sqrAvg - self.avg ** 2)
def score(self, obs):
if self.std() == 0: return (obs - self.avg) * float("infinity")
else: return (obs - self.avg) / self.std()
Sample IO
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
>>> fazscore(0.8, [40] * 200).score(1)
-inf
Update
As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line,
if self.std() == 0: return 0
to:
if self.std() == 0: return (obs - self.avg) * float("infinity")
This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to:
if self.std() == 0: return obs - self.avg
You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.
This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.
Normalize
One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.
Derive
Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.
This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.
Regarding the article
The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.
As Nixuz points out this is also referred to as a Z or Standard Score.
Chad Birch and Adam Davis are correct in that you will have to look backward to establish a baseline. Your question, as phrased, suggests that you only want to view data from the past 24 hours, and that won't quite fly.
One way to give your data some memory without having to query for a large body of historical data is to use an exponential moving average. The advantage of this is that you can update this once per period and then flush all old data, so you only need to remember a single value. So if your period is a day, you have to maintain a "daily average" attribute for each topic, which you can do by:
a_n = a_(n-1)*b + c_n*(1-b)
Where a_n is the moving average as of day n, b is some constant between 0 and 1 (the closer to 1, the longer the memory) and c_n is the number of hits on day n. The beauty is if you perform this update at the end of day n, you can flush c_n and a_(n-1).
The one caveat is that it will be initially sensitive to whatever you pick for your initial value of a.
EDIT
If it helps to visualize this approach, take n = 5, a_0 = 1, and b = .9.
Let's say the new values are 5,0,0,1,4:
a_0 = 1
c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854
Doesn't look very much like an average does it? Note how the value stayed close to 1, even though our next input was 5. What's going on? If you expand out the math, what you get that:
a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0
What do I mean by leftover weight? Well, in any average, all weights must add to 1. If n were infinity and the ... could go on forever, then all weights would sum to 1. But if n is relatively small, you get a good amount of weight left on the original input.
If you study the above formula, you should realize a few things about this usage:
All data contributes something to the average forever. Practically speaking, there is a point where the contribution is really, really small.
Recent values contribute more than older values.
The higher b is, the less important new values are and the longer old values matter. However, the higher b is, the more data you need to water down the initial value of a.
I think the first two characteristics are exactly what you are looking for. To give you an idea of simple this can be to implement, here is a python implementation (minus all the database interaction):
>>> class EMA(object):
... def __init__(self, base, decay):
... self.val = base
... self.decay = decay
... print self.val
... def update(self, value):
... self.val = self.val*self.decay + (1-self.decay)*value
... print self.val
...
>>> a = EMA(1, .9)
1
>>> a.update(10)
1.9
>>> a.update(10)
2.71
>>> a.update(10)
3.439
>>> a.update(10)
4.0951
>>> a.update(10)
4.68559
>>> a.update(10)
5.217031
>>> a.update(10)
5.6953279
>>> a.update(10)
6.12579511
>>> a.update(10)
6.513215599
>>> a.update(10)
6.8618940391
>>> a.update(10)
7.17570463519
Typically "buzz" is figured out using some form of exponential/log decay mechanism. For an overview of how Hacker News, Reddit, and others handle this in a simple way, see this post.
This doesn't fully address the things that are always popular. What you're looking for seems to be something like Google's "Hot Trends" feature. For that, you could divide the current value by a historical value and then subtract out ones that are below some noise threshold.
I think they key word you need to notice is "abnormally". In order to determine when something is "abnormal", you have to know what is normal. That is, you're going to need historical data, which you can average to find out the normal rate of a particular query. You may want to exclude abnormal days from the averaging calculation, but again that'll require having enough data already, so that you know which days to exclude.
From there, you'll have to set a threshold (which would require experimentation, I'm sure), and if something goes outside the threshold, say 50% more searches than normal, you can consider it a "trend". Or, if you want to be able to find the "Top X Trendiest" like you mentioned, you just need to order things by how far (percentage-wise) they are away from their normal rate.
For example, let's say that your historical data has told you that Britney Spears usually gets 100,000 searches, and Paris Hilton usually gets 50,000. If you have a day where they both get 10,000 more searches than normal, you should be considering Paris "hotter" than Britney, because her searches increased 20% more than normal, while Britney's were only 10%.
God, I can't believe I just wrote a paragraph comparing "hotness" of Britney Spears and Paris Hilton. What have you done to me?
I was wondering if it is at all possible to use regular physics acceleration formula in such a case?
v2-v1/t or dv/dt
We can consider v1 to be initial likes/votes/count-of-comments per hour and v2 to be current "velocity" per hour in last 24 hours?
This is more like a question than an answer, but seems it may just work. Any content with highest acceleration will be the trending topic...
I am sure this may not solve Britney Spears problem :-)
probably a simple gradient of topic frequency would work -- large positive gradient = growing quickly in popularity.
the easiest way would be to bin the number of searched each day, so you have something like
searches = [ 10, 7, 14, 8, 9, 12, 55, 104, 100 ]
and then find out how much it changed from day to day:
hot_factor = [ b-a for a, b in zip(searches[:-1], searches[1:]) ]
# hot_factor is [ -3, 7, -6, 1, 3, 43, 49, -4 ]
and just apply some sort of threshold so that days where the increase was > 50 are considered 'hot'. you could make this far more complicated if you'd like, too. rather than absolute difference you can take the relative difference so that going from 100 to 150 is considered hot, but 1000 to 1050 isn't. or a more complicated gradient that takes into account trends over more than just one day to the next.
I had worked on a project, where my aim was finding Trending Topics from Live Twitter Stream and also doing sentimental analysis on the trending topics (finding if Trending Topic positively/negatively talked about). I've used Storm for handling twitter stream.
I've published my report as a blog: http://sayrohan.blogspot.com/2013/06/finding-trending-topics-and-trending.html
I've used Total Count and Z-Score for the ranking.
The approach that I've used is bit generic, and in the discussion section, I've mentioned that how we can extend the system for non-Twitter Application.
Hope the information helps.
You could use log-likelihood-ratios to compare the current date with the last month or year. This is statistically sound (given that your events are not normally distributed, which is to be assumed from your question).
Just sort all your terms by logLR and pick the top ten.
public static void main(String... args) {
TermBag today = ...
TermBag lastYear = ...
for (String each: today.allTerms()) {
System.out.println(logLikelihoodRatio(today, lastYear, each) + "\t" + each);
}
}
public static double logLikelihoodRatio(TermBag t1, TermBag t2, String term) {
double k1 = t1.occurrences(term);
double k2 = t2.occurrences(term);
double n1 = t1.size();
double n2 = t2.size();
double p1 = k1 / n1;
double p2 = k2 / n2;
double p = (k1 + k2) / (n1 + n2);
double logLR = 2*(logL(p1,k1,n1) + logL(p2,k2,n2) - logL(p,k1,n1) - logL(p,k2,n2));
if (p1 < p2) logLR *= -1;
return logLR;
}
private static double logL(double p, double k, double n) {
return (k == 0 ? 0 : k * Math.log(p)) + ((n - k) == 0 ? 0 : (n - k) * Math.log(1 - p));
}
PS, a TermBag is an unordered collection of words. For each document you create one bag of terms. Just count the occurrences of words. Then the method occurrences returns the number of occurrences of a given word, and the method size returns the total number of words. It is best to normalize the words somehow, typically toLowerCase is good enough. Of course, in the above examples you would create one document with all queries of today, and one with all queries of the last year.
If you simply look at tweets, or status messages to get your topics, you're going to encounter a lot of noise. Even if you remove all stop words. One way to get a better subset of topic candidates is to focus only on tweets/messages that share a URL, and get the keywords from the title of those web pages. And make sure you apply POS tagging to get nouns + noun phrases as well.
Titles of web pages usually are more descriptive and contain words that describe what the page is about. In addition, sharing a web page usually is correlated with sharing news that is breaking (ie if a celebrity like Michael Jackson died, you're going to get a lot of people sharing an article about his death).
I've ran experiments where I only take popular keywords from titles, AND then get the total counts of those keywords across all status messages, and they definitely remove a lot of noise. If you do it this way, you don't need a complex algorith, just do a simple ordering of the keyword frequencies, and you're halfway there.
The idea is to keep track of such things and notice when they jump significantly as compared to their own baseline.
So, for queries that have more than a certain threshhold, track each one and when it changes to some value (say almost double) of its historical value, then it is a new hot trend.

Resources