Algorithm for nice graph labels for time/date axis? - algorithm

I'm looking for a "nice numbers" algorithm for determining the labels on a date/time value axis. I'm familiar with Paul Heckbert's Nice Numbers algorithm.
I have a plot that displays time/date on the X axis and the user can zoom in and look at a smaller time frame. I'm looking for an algorithm that picks nice dates to display on the ticks.
For example:
Looking at a day or so: 1/1 12:00, 1/1 4:00, 1/1 8:00...
Looking at a week: 1/1, 1/2, 1/3...
Looking at a month: 1/09, 2/09, 3/09...
The nice label ticks don't need to correspond to the first visible point, but close to it.
Is anybody familiar with such an algorithm?

The 'nice numbers' article you linked to mentioned that
the nicest numbers in decimal are 1, 2, 5 and all power-of-10 multiples of these numbers
So I think for doing something similar with date/time you need to start by similarly breaking down the component pieces. So take the nice factors of each type of interval:
If you're showing seconds or minutes use 1, 2, 3, 5, 10, 15, 30
(I skipped 6, 12, 15, 20 because they don't "feel" right).
If you're showing hours use 1, 2, 3, 4, 6, 8, 12
for days use 1, 2, 7
for weeks use 1, 2, 4 (13 and 26 fit the model but seem too odd to me)
for months use 1, 2, 3, 4, 6
for years use 1, 2, 5 and power-of-10 multiples
Now obviously this starts to break down as you get into larger amounts. Certainly you don't want to do show 5 weeks worth of minutes, even in "pretty" intervals of 30 minutes or something. On the other hand, when you only have 48 hours worth, you don't want to show 1 day intervals. The trick as you have already pointed out is finding decent transition points.
Just on a hunch, I would say a reasonable crossover point would be about twice as much as the next interval. That would give you the following (min and max number of intervals shown afterwards)
use seconds if you have less than 2 minutes worth (1-120)
use minutes if you have less than 2 hours worth (2-120)
use hours if you have less than 2 days worth (2-48)
use days if you have less than 2 weeks worth (2-14)
use weeks if you have less than 2 months worth (2-8/9)
use months if you have less than 2 years worth (2-24)
otherwise use years (although you could continue with decades, centuries, etc if your ranges can be that long)
Unfortunately, our inconsistent time intervals mean that you end up with some cases that can have over 1 hundred intervals while others have at most 8 or 9. So you'll want to pick the size of your intervals such than you don't have more than 10-15 intervals at most (or less than 5 for that matter). Also, you could break from a strict definition of 2 times the next biggest interval if you think its easy to keep track of. For instance, you could use hours up to 3 days (72 hours) and weeks up to 4 months. A little trial and error might be necessary.
So to go back over, choose the interval type based on the size of your range, then choose the interval size by picking one of the "nice" numbers that will leave you with between 5 and about 15 tick marks. Or if you know and/or can control the actual number of pixels between tick marks you could put upper and lower bounds on how many pixels are acceptable between ticks (if they are spaced too far apart the graph may be hard to read, but if there are too many ticks the graph will be cluttered and your labels may overlap).

Have a look at
http://tools.netsa.cert.org/netsa-python/doc/index.html
It has a nice.py ( python/netsa/data/nice.py ) which i think is stand-alone, and should work fine.

Still no answer to this question... I'll throw my first idea in then! I assume you have the range of the visible axis.
This is probably how I would do.
Rough pseudo:
// quantify range
rangeLength = endOfVisiblePart - startOfVisiblePart;
// qualify range resolution
if (range < "1.5 day") {
resolution = "day"; // it can be a number, e.g.: ..., 3 for day, 4 for week, ...
} else if (range < "9 days") {
resolution = "week";
} else if (range < "35 days") {
resolution = "month";
} // you can expand this in both ways to get from nanoseconds to geological eras if you wish
After that, it should (depending on what you have easy access to) be quite easy to determine the value to each nice label tick. Depending on the 'resolution', you format it differently. E.g.: MM/DD for "week", MM:SS for "minute", etc., just like you said.

[Edit - I expanded this a little more at http://www.acooke.org/cute/AutoScalin0.html ]
A naive extension of the "nice numbers" algorithm seems to work for base 12 and 60, which gives good intervals for hours and minutes. This is code I just hacked together:
LIM10 = (10, [(1.5, 1), (3, 2), (7, 5)], [1, 2, 5])
LIM12 = (12, [(1.5, 1), (3, 2), (8, 6)], [1, 2, 6])
LIM60 = (60, [(1.5, 1), (20, 15), (40, 30)], [1, 15, 40])
def heckbert_d(lo, hi, ntick=5, limits=None):
'''
Heckbert's "nice numbers" algorithm for graph ranges, from "Graphics Gems".
'''
if limits is None:
limits = LIM10
(base, rfs, fs) = limits
def nicenum(x, round):
step = base ** floor(log(x)/log(base))
f = float(x) / step
nf = base
if round:
for (a, b) in rfs:
if f < a:
nf = b
break
else:
for a in fs:
if f <= a:
nf = a
break
return nf * step
delta = nicenum(hi-lo, False)
return nicenum(delta / (ntick-1), True)
def heckbert(lo, hi, ntick=5, limits=None):
'''
Heckbert's "nice numbers" algorithm for graph ranges, from "Graphics Gems".
'''
def _heckbert():
d = heckbert_d(lo, hi, ntick=ntick, limits=limits)
graphlo = floor(lo / d) * d
graphhi = ceil(hi / d) * d
fmt = '%' + '.%df' % max(-floor(log10(d)), 0)
value = graphlo
while value < graphhi + 0.5*d:
yield fmt % value
value += d
return list(_heckbert())
So, for example, if you want to display seconds from 0 to 60,
>>> heckbert(0, 60, limits=LIM60)
['0', '15', '30', '45', '60']
or hours from 0 to 5:
>>> heckbert(0, 5, limits=LIM12)
['0', '2', '4', '6']

I'd suggest you grab the source code to gnuplot or RRDTool (or even Flot) and examine how they approach this problem. The general case is likely to be N labels applied based on width of your plot, which some kind of 'snapping' to the nearest 'nice' number.
Every time I've written such an algorithm (too many times really), I've used a table of 'preferences'... ie: based on the time range on the plot, decide if I'm using Weeks, Days, Hours, Minutes etc as the main axis point. I usually included some preferred formatting, as I rarely want to see the date for each minute I plot on the graph.
I'd be happy but surprised to find someone using a formula (like Heckbert does) to find 'nice', as the variation in time units between minutes, hours, days, and weeks are not that linear.

In theory you can also change your concept. Where it is not your data at the center of the visualization, but at the center you have your scale.
When you know the start and the end of the dates of your data, you can create a scale with all dates and dispatch you data in this scale. Like a fixed scales.
You can have a scale of type year, month, day, hours, ... and limit the scaling just to these scales, implying you remove the concept of free scaling.
The advantage is to can easily show dates gaps. But if you have a lot of gaps, that can become also useless.

Related

Algorithm for tracking values through time

There is likely a known algorithm for doing this, but I wasn't able to find it using my Google skills, so I will try to describe what I have to do and what I did so far.
I have a source of characteristic values of a system which I would like to plot as a trend. The values are being returned from an algorithm in real time, and each value has a set of properties (magnitude, phase, quality).
However, these values can appear and disappear in time, and I can also get some intermittent values which I will disregard if they don't repeat during a longer period (several samples).
For example, I might be getting these values:
Time (Mag, Phase, Quality)
t = 1 (10.10, 0.90, 0.90); (17.00, 0.02, 0,12)
t = 2 (10.15, 0.91, 0.89); (17.10, 0.12, 0,12)
t = 3 (17.10, 0.12, 0,12)
t = 4 (10.25, 0.91, 0.89); (17.12, 0.12, 0,12)
t = 5 ( 6.15, 0.41, 0.39); (10.35, 0.91, 0.89); (17.12, 0.12, 0,12)
t = 6 (10.20, 0.90, 0.85); (17.02, 0.13, 0,11)
t = 7 ( 9.20, 0.90, 0.85); (11.20, 0.90, 0.85); (17.02, 0.13, 0,11)
t = 8 ( 9.80, 0.90, 0.85); (11.80, 0.90, 0.85); (17.02, 0.13, 0,11)
I'd like to track these sets of values through time according to the similarity with previous values. I.e. in the example above, I have two main trends (Mag 10 and Mag 17), with several specific situations:
moments where I will shortly lose one of the values (Mag 10 is lost in t = 3),
moments where I shortly get a new temporary/invalid reading (Mag 6 in t = 5) for a single sample,
moments where it's not completely clear which set corresponds to the previous sample (Mag 9.2 and Mag 11.2 could both be a continuation of Mag 10.2 from the previous sample, and in t = 8 it becomes apparent that there are now two different sets (Mag 9.8 and Mag 11.8).
If I just grouped the values as they arrive from the system, I would not get their correct trends, i.e. without tracking, the magnitudes would appear like this:
However, properly matching these values against old magnitude should result in this trend:
I've written an algorithm which tracks the values through time by effectively trying all permutations of sets against the previous "active" sets. It calculates the differences between all new values and the previous known values, which is basically a N^2 algorithm, and then checks all permutations to find the smallest total distance (something like N! complexity):
for each X in new_sets
for each Y in existing_sets
distance(X, Y) = calculate_distance(X, Y);
for each P in permutations(new_sets)
total_distance = sum(distance(X, Y)) for all (X, Y) in permutation
permutation P with min total_distance is the best match
As I go through time, I also remove measurements from existing_sets if they are not matched within several samples.
This works reasonable fine, as long as I don't have too many values, but the time complexity becomes problematic after I begin tracking more than 10 items. It also feels like reinventing the wheel.
Is there a known/better (in terms of time complexity) algorithm for doing this?
Without constraints on the behaviour of the sources, there is obviously no solution. If we may say that the magnitudes from different sources are reasonably separated, and changes are reasonably small, the solution is to keep the trends in the sorted order. Then binary search them to find the trend closest to the new reading.

Algorithm- How to place n-texts on p-bands so that the global accesing time is minimum. Each text has its' own length

The problem sounds like this, we are given n-texts and they are going to be placed on a p number of tapes/bands(don't really know what's the equivalent in english, but I think you understand what I'm talking about).
In order to read the text situated at a position k on one of the bands, we have to read the texts from positions 1,2,...,k on the certain band. Each text has its' own length.
Now, we have to figure out a way of placing the texts on the p-bands so that we get a global accesing time that is minimum. The global accesing time is calculated by adding all the total accesing times from each band.
The formula for calculating the total accesing time of a band is:
n_
\ [L(T1)+L(T2)+...+L(Ti)]
/_
i=1
Now, that little drawing I did is SUM from 1 to n;
L(T i) is the length of T i;
T i is the text situated at position i on the respective band;
Here is an equivalent in "pseudocode" in case it helps:
n-number of texts;
Band[n]-array of texts
sum=0, sum2=0;
for(int i=0;i<n;i++)
{sum=0;
for(int j=0;j<=i;j++ )
sum=sum+Band[j].length;
sum2=sum2+sum; }
return sum2;
Here's an example to clarify the problem:
say p is 3, so we get 3 bands
say n is 9, so we get 9 texts and the lengths are : 2, 3, 4, 5, 6, 7, 8, 9, 10
and they are placed on the bands in the following way:
band-1: 2, 5, 8 -> total accesing time of band-1: 24
band-2: 3, 6, 9 -> total accesing time of band-2: 30
band-3: 4, 7, 10 -> total accesing time of band-3: 36
the global accesing time: 24 + 30 + 36 = 90
I'll refer to text position as the number of texts that appear after a specific text in a tape, it also represents how many additional times will the text be read.
Since you are solely interested in the sum of access time there's no real meaning to how are the texts grouped into tapes but what is the position of each text, switching 2 texts in the same position but on different tapes for example won't change the global access time.
Switching 2 texts of different size on different positions will change the time though, generally longer texts should be placed in lower positions (closer to the end)
The algorithm can be greedy, go over the texts from the longest to the shortest and place each text in the last available spot on one of the tapes with the least texts in it, so if for example there are 10 texts and 5 tapes then the longer 5 texts will be in the end of each tape and the shorter 5 texts will be in the beginning of it.

Find a period of eventually periodic sequence

Short explanation.
I have a sequence of numbers [0, 1, 4, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7]. As you see, from the 3-rd value the sequence is periodic with a period [0, 0, 1, 1, 2, 3, 7].
I am trying to automatically extract this period from this sequence. The problem is that neither I know the length of the period, nor do I know from which position the sequence becomes periodic.
Full explanation (might require some math)
I am learning combinatorial game theory and a cornerstone of this theory requires one to calculate Grundy values of a game graph. This produces infinite sequence, which in many cases becomes eventually periodic.
I found a way to efficiently calculate grundy values (it returns me a sequence). I would like to automatically extract offset and period of this sequence. I am aware that seeing a part of the sequence [1, 2, 3, 1, 2, 3] you can't be sure that [1, 2, 3] is a period (who knows may be the next number is 4, which breaks the assumption), but I am not interested in such intricacies (I assume that the sequence is enough to find the real period). Also the problem is the sequence can stop in the middle of the period: [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, ...] (the period is still 1, 2, 3).
I also need to find the smallest offset and period. For example for original sequence, the offset can be [0, 1, 4, 0, 0] and the period [1, 1, 2, 3, 7, 0, 0], but the smallest is [0, 1, 4] and [0, 0, 1, 1, 2, 3, 7].
My inefficient approach is to try every possible offset and every possible period. Construct the sequence using this data and check whether it is the same as original. I have not done any normal analysis, but it looks like it is at least quadratic in terms of time complexity.
Here is my quick python code (have not tested it properly):
def getPeriod(arr):
min_offset, min_period, n = len(arr), len(arr), len(arr)
best_offset, best_period = [], []
for offset in xrange(n):
start = arr[:offset]
for period_len in xrange(1, (n - offset) / 2):
period = arr[offset: offset+period_len]
attempt = (start + period * (n / period_len + 1))[:n]
if attempt == arr:
if period_len < min_period:
best_offset, best_period = start[::], period[::]
min_offset, min_period = len(start), period_len
elif period_len == min_period and len(start) < min_offset:
best_offset, best_period = start[::], period[::]
min_offset, min_period = len(start), period_len
return best_offset, best_period
Which returns me what I want for my original sequence:
offset [0, 1, 4]
period [0, 0, 1, 1, 2, 3, 7]
Is there anything more efficient?
Remark: If there is a period P1 with length L, then there is also a period P2, with the same length, L, such that the input sequence ends exactly with P2 (i.e. we do not have a partial period involved at the end).
Indeed, a different period of the same length can always be obtained by changing the offset. The new period will be a rotation of the initial period.
For example the following sequence has a period of length 4 and offset 3:
0 0 0 (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2
but it also has a period with the same length 4 and offset 5, without a partial period at the end:
0 0 0 1 2 (3 4 1 2) (3 4 1 2) (3 4 1 2) (3 4 1 2) (3 4 1 2)
The implication is that we can find the minimum length of a period by processing the sequence in reverse order, and searching the minimum period using zero offset from the end. One possible approach is to simply use your current algorithm on the reversed list, without the need of the loop over offsets.
Now that we know the length of the desired period, we can also find its minimum offset. One possible approach is to try all various offsets (with the advantage of not needing the loop over lengths, since the length is known), however, further optimizations are possible if necessary, e.g. by advancing as much as possible when processing the list from the end, allowing the final repetition of the period (i.e. the one closest to the start of the un-reversed sequence) to be partial.
I would start with constructing histogram of the values in the sequence
So you just make a list of all numbers used in sequence (or significant part of it) and count their occurrence. This is O(n) where n is sequence size.
sort the histogram ascending
This is O(m.log(m)) where m is number of distinct values. You can also ignore low probable numbers (count<treshold) which are most likely in the offset or just irregularities further lowering m. For periodic sequences m <<< n so you can use it as a first marker if the sequence is periodic or not.
find out the period
In the histogram the counts should be around multiples of the n/period. So approximate/find GCD of the histogram counts. The problem is that you need to take into account there are irregularities present in the counts and also in the n (offset part) so you need to compute GCD approximately. for example:
sequence = { 1,1,2,3,3,1,2,3,3,1,2,3,3 }
has ordered histogram:
item,count
2 3
1 4
3 6
the GCD(6,4)=2 and GCD(6,3)=3 you should check at least +/-1 around the GCD results so the possible periods are around:
T = ~n/2 = 13/2 = 6
T = ~n/3 = 13/3 = 4
So check T={3,4,5,6,7} just to be sure. Use always GCD between the highest counts vs. lowest counts. If the sequence has many distinct numbers you can also do a histogram of counts checking only the most common values.
To check period validity just take any item near end or middle of the sequence (just use probable periodic area). Then look for it in close area near probable period before (or after) its occurrence. If found few times you got the right period (or its multiple)
Get the exact period
Just check the found period fractions (T/2, T/3, ...) or do a histogram on the found period and the smallest count tells you how many real periods you got encapsulated so divide by it.
find offset
When you know the period this is easy. Just scan from start take first item and see if after period is there again. If not remember position. Stop at the end or in the middle of sequence ... or on some treshold consequent successes. This is up to O(n) And the last remembered position is the last item in the offset.
[edit1] Was curious so I try to code it in C++
I simplified/skip few things (assuming at least half of the array is periodic) to test if I did not make some silly mistake in my algorithm and here the result (Works as expected):
const int p=10; // min periods for testing
const int n=500; // generated sequence size
int seq[n]; // generated sequence
int offset,period; // generated properties
int i,j,k,e,t0,T;
int hval[n],hcnt[n],hs; // histogram
// generate periodic sequence
Randomize();
offset=Random(n/5);
period=5+Random(n/5);
for (i=0;i<offset+period;i++) seq[i]=Random(n);
for (i=offset,j=i+period;j<n;i++,j++) seq[j]=seq[i];
if ((offset)&&(seq[offset-1]==seq[offset-1+period])) seq[offset-1]++;
// compute histogram O(n) on last half of it
for (hs=0,i=n>>1;i<n;i++)
{
for (e=seq[i],j=0;j<hs;j++)
if (hval[j]==e) { hcnt[j]++; j=-1; break; }
if (j>=0) { hval[hs]=e; hcnt[hs]=1; hs++; }
}
// bubble sort histogram asc O(m^2)
for (e=1,j=hs;e;j--)
for (e=0,i=1;i<j;i++)
if (hcnt[i-1]>hcnt[i])
{ e=hval[i-1]; hval[i-1]=hval[i]; hval[i]=e;
e=hcnt[i-1]; hcnt[i-1]=hcnt[i]; hcnt[i]=e; e=1; }
// test possible periods
for (j=0;j<hs;j++)
if ((!j)||(hcnt[j]!=hcnt[j-1])) // distinct counts only
if (hcnt[j]>1) // more then 1 occurence
for (T=(n>>1)/(hcnt[j]+1);T<=(n>>1)/(hcnt[j]-1);T++)
{
for (i=n-1,e=seq[i],i-=T,k=0;(i>=(n>>1))&&(k<p)&&(e==seq[i]);i-=T,k++);
if ((k>=p)||(i<n>>1)) { j=hs; break; }
}
// compute histogram O(T) on last multiple of period
for (hs=0,i=n-T;i<n;i++)
{
for (e=seq[i],j=0;j<hs;j++)
if (hval[j]==e) { hcnt[j]++; j=-1; break; }
if (j>=0) { hval[hs]=e; hcnt[hs]=1; hs++; }
}
// least count is the period multiple O(m)
for (e=hcnt[0],i=0;i<hs;i++) if (e>hcnt[i]) e=hcnt[i];
if (e) T/=e;
// check/handle error
if (T!=period)
{
return;
}
// search offset size O(n)
for (t0=-1,i=0;i<n-T;i++)
if (seq[i]!=seq[i+T]) t0=i;
t0++;
// check/handle error
if (t0!=offset)
{
return;
}
Code is still not optimized. For n=10000 it takes around 5ms on mine setup. The result is in t0 (offset) and T (period). You may need to play with the treshold constants a bit
I had to do something similar once. I used brute force and some common sense, the solution is not very elegant but it works. The solution always works, but you have to set the right parameters (k,j, con) in the function.
The sequence is saved as a list in the variable seq.
k is the size of the sequence array, if you think your sequence will take long to become periodic then set this k to a big number.
The variable found will tell us if the array passed the periodic test with period j
j is the period.
If you expect a huge period then you must set j to a big number.
We test the periodicity by checking the last j+30 numbers of the sequence.
The bigger the period (j) the more we must check.
As soon as one of the test is passed we exit the function and we return the smaller period.
As you may notice the accuracy depends on the variables j and k but if you set them to very big numbers it will always be correct.
def some_sequence(s0, a, b, m):
try:
seq=[s0]
snext=s0
findseq=True
k=0
while findseq:
snext= (a*snext+b)%m
seq.append(snext)
#UNTIL THIS PART IS JUST TO CREATE THE SEQUENCE (seq) SO IS NOT IMPORTANT
k=k+1
if k>20000:
# I IS OUR LIST INDEX
for i in range(1,len(seq)):
for j in range(1,1000):
found =True
for con in range(j+30):
#THE TRICK IS TO START FROM BEHIND
if not (seq[-i-con]==seq[-i-j-con]):
found = False
if found:
minT=j
findseq=False
return minT
except:
return None
simplified version
def get_min_period(sequence,max_period,test_numb):
seq=sequence
if max_period+test_numb > len(sequence):
print("max_period+test_numb cannot be bigger than the seq length")
return 1
for i in range(1,len(seq)):
for j in range(1,max_period):
found =True
for con in range(j+test_numb):
if not (seq[-i-con]==seq[-i-j-con]):
found = False
if found:
minT=j
return minT
Where max_period is the maximun period you want to look for, and test_numb is how many numbers of the sequence you want to test, the bigger the better but you have to make max_period+test_numb < len(sequence)

How can I acquire a certain "random range" in a higher frequency?

My question is basically, "how can I obtain certain random values within a specific range more than random values outside the range?"
Allow me to demonstrate what I mean:
If I were to, on a good amount of trials, start picking a variety of
random numbers from 1-10, I should be seeing more numbers in the 7-10
range than in the 1-6 range.
I tried a couple of ways, but I am not getting desirable results.
First Function:
function getAverage(i)
math.randomseed(os.time())
local sum = 0;
for j = 1,i do
sum = sum + (1-math.random()^3)*10
end
print(sum/i)
end
getAverage(500)
I was constantly getting numbers only around 7.5, such as 7.48, and 7.52. Although this does indeed get me a number within my range, I don't want such strict consistancy.
Second Function:
function getAverage(i)
math.randomseed(os.time())
local sum = 0;
for j = 1,i do
sum = sum + (math.random() > .3 and math.random(7,10) or math.random(1,6))
end
print(sum/i)
end
getAverage(500)
This function didn't work as I wanted it to either. I primarily getting numbers such as 6.8 and 7.2 but nothing even close to 8.
Third Function:
function getAverage(i)
math.randomseed(os.time())
local sum = 0;
for j = 1,i do
sum = sum + (((math.random(10) * 2)/1.2)^1.05) - math.random(1,3)
end
print(sum/i)
end
getAverage(500)
This function was giving me slightly more favorable results, with the function consistently returning 8, but that is the issue - consistency.
What type of paradigms or practical solutions can I use to generate more random numbers within a specific range over another range?
I have labeled this as Lua, but a solution in any language that is understandable is acceptable.
I don't want such strict consistancy.
What does that mean?
If you average a very large number of values in a given range from any RNG, you should expect that to produce the same number. That means each of the numbers in the range was equally likely to appear.
This function didn't work as I wanted it to either. I primarily getting numbers such as 6.8 and 7.2 but nothing even close to 8.
You have to clarify what "didn't work" means. Why would you expect it to give you 8? You can see it won't just by looking at the formula you used.
For instance, if you'd used math.random(1,10), assuming all numbers in the range have an equal chance of appearing, you should expect the average to be 5.5, dead in the middle of 1 and 10 (because (1+2+3+4+5+6+7+8+9+10)/10 = 5.5).
You used math.random() > .3 and math.random(7,10) or math.random(1,6) which is saying 70% of the time to give 7, 8, 9, or 10 (average = 8.5) and 30% of the time to give you 1, 2, 3, 4, 5, or 6 (average = 3.5). That should give you an overall average of 7 (because 3.5 * .3 + 8.5 * .7 = 7). If you bump up your sample size, that's exactly what you'll see. You're seeing values on either size because you sample size is so small (try bumping it up to 100000).
I've made skewed random values before by simply generating two random numbers in the range, and then picking the largest (or smallest). This skews the probability towards the high (or low) endpoint.
Picking the smallest of two gives you a linear probability distribution.
Picking the smallest of three gives you a parabolic distribution (more selectivity, less probability at "the other end"). For my needs, a linear distribution was fine.
Not exactly what you wanted, but maybe it's good enough.
Have fun!

What is the best way to compute trending topics or tags?

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.
I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) no one mentions should be the hottest ones.
Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment.
Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem".
My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones.
I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it?
I hope you can help me (coding examples would be great).
This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average.
In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.
z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]
When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.
Please see Wikipedia for more information, about z-scores.
Code
from math import sqrt
def zscore(obs, pop):
# Size of population.
number = float(len(pop))
# Average population value.
avg = sum(pop) / number
# Standard deviation of population.
std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
# Zscore Calculation.
return (obs - avg) / std
Sample Output
>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506
Notes
You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.
You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.
If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.
from math import sqrt
class zscore:
def __init__(self, pop = []):
self.number = float(len(pop))
self.total = sum(pop)
self.sqrTotal = sum(x ** 2 for x in pop)
def update(self, value):
self.number += 1.0
self.total += value
self.sqrTotal += value ** 2
def avg(self):
return self.total / self.number
def std(self):
return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
def score(self, obs):
return (obs - self.avg()) / self.std()
Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day.
New Addition
Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:
Code
class fazscore:
def __init__(self, decay, pop = []):
self.sqrAvg = self.avg = 0
# The rate at which the historic data's effect will diminish.
self.decay = decay
for x in pop: self.update(x)
def update(self, value):
# Set initial averages to the first value in the sequence.
if self.avg == 0 and self.sqrAvg == 0:
self.avg = float(value)
self.sqrAvg = float((value ** 2))
# Calculate the average of the rest of the values using a
# floating average.
else:
self.avg = self.avg * self.decay + value * (1 - self.decay)
self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
return self
def std(self):
# Somewhat ad-hoc standard deviation calculation.
return sqrt(self.sqrAvg - self.avg ** 2)
def score(self, obs):
if self.std() == 0: return (obs - self.avg) * float("infinity")
else: return (obs - self.avg) / self.std()
Sample IO
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
>>> fazscore(0.8, [40] * 200).score(1)
-inf
Update
As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line,
if self.std() == 0: return 0
to:
if self.std() == 0: return (obs - self.avg) * float("infinity")
This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to:
if self.std() == 0: return obs - self.avg
You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.
This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.
Normalize
One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.
Derive
Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.
This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.
Regarding the article
The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.
As Nixuz points out this is also referred to as a Z or Standard Score.
Chad Birch and Adam Davis are correct in that you will have to look backward to establish a baseline. Your question, as phrased, suggests that you only want to view data from the past 24 hours, and that won't quite fly.
One way to give your data some memory without having to query for a large body of historical data is to use an exponential moving average. The advantage of this is that you can update this once per period and then flush all old data, so you only need to remember a single value. So if your period is a day, you have to maintain a "daily average" attribute for each topic, which you can do by:
a_n = a_(n-1)*b + c_n*(1-b)
Where a_n is the moving average as of day n, b is some constant between 0 and 1 (the closer to 1, the longer the memory) and c_n is the number of hits on day n. The beauty is if you perform this update at the end of day n, you can flush c_n and a_(n-1).
The one caveat is that it will be initially sensitive to whatever you pick for your initial value of a.
EDIT
If it helps to visualize this approach, take n = 5, a_0 = 1, and b = .9.
Let's say the new values are 5,0,0,1,4:
a_0 = 1
c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854
Doesn't look very much like an average does it? Note how the value stayed close to 1, even though our next input was 5. What's going on? If you expand out the math, what you get that:
a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0
What do I mean by leftover weight? Well, in any average, all weights must add to 1. If n were infinity and the ... could go on forever, then all weights would sum to 1. But if n is relatively small, you get a good amount of weight left on the original input.
If you study the above formula, you should realize a few things about this usage:
All data contributes something to the average forever. Practically speaking, there is a point where the contribution is really, really small.
Recent values contribute more than older values.
The higher b is, the less important new values are and the longer old values matter. However, the higher b is, the more data you need to water down the initial value of a.
I think the first two characteristics are exactly what you are looking for. To give you an idea of simple this can be to implement, here is a python implementation (minus all the database interaction):
>>> class EMA(object):
... def __init__(self, base, decay):
... self.val = base
... self.decay = decay
... print self.val
... def update(self, value):
... self.val = self.val*self.decay + (1-self.decay)*value
... print self.val
...
>>> a = EMA(1, .9)
1
>>> a.update(10)
1.9
>>> a.update(10)
2.71
>>> a.update(10)
3.439
>>> a.update(10)
4.0951
>>> a.update(10)
4.68559
>>> a.update(10)
5.217031
>>> a.update(10)
5.6953279
>>> a.update(10)
6.12579511
>>> a.update(10)
6.513215599
>>> a.update(10)
6.8618940391
>>> a.update(10)
7.17570463519
Typically "buzz" is figured out using some form of exponential/log decay mechanism. For an overview of how Hacker News, Reddit, and others handle this in a simple way, see this post.
This doesn't fully address the things that are always popular. What you're looking for seems to be something like Google's "Hot Trends" feature. For that, you could divide the current value by a historical value and then subtract out ones that are below some noise threshold.
I think they key word you need to notice is "abnormally". In order to determine when something is "abnormal", you have to know what is normal. That is, you're going to need historical data, which you can average to find out the normal rate of a particular query. You may want to exclude abnormal days from the averaging calculation, but again that'll require having enough data already, so that you know which days to exclude.
From there, you'll have to set a threshold (which would require experimentation, I'm sure), and if something goes outside the threshold, say 50% more searches than normal, you can consider it a "trend". Or, if you want to be able to find the "Top X Trendiest" like you mentioned, you just need to order things by how far (percentage-wise) they are away from their normal rate.
For example, let's say that your historical data has told you that Britney Spears usually gets 100,000 searches, and Paris Hilton usually gets 50,000. If you have a day where they both get 10,000 more searches than normal, you should be considering Paris "hotter" than Britney, because her searches increased 20% more than normal, while Britney's were only 10%.
God, I can't believe I just wrote a paragraph comparing "hotness" of Britney Spears and Paris Hilton. What have you done to me?
I was wondering if it is at all possible to use regular physics acceleration formula in such a case?
v2-v1/t or dv/dt
We can consider v1 to be initial likes/votes/count-of-comments per hour and v2 to be current "velocity" per hour in last 24 hours?
This is more like a question than an answer, but seems it may just work. Any content with highest acceleration will be the trending topic...
I am sure this may not solve Britney Spears problem :-)
probably a simple gradient of topic frequency would work -- large positive gradient = growing quickly in popularity.
the easiest way would be to bin the number of searched each day, so you have something like
searches = [ 10, 7, 14, 8, 9, 12, 55, 104, 100 ]
and then find out how much it changed from day to day:
hot_factor = [ b-a for a, b in zip(searches[:-1], searches[1:]) ]
# hot_factor is [ -3, 7, -6, 1, 3, 43, 49, -4 ]
and just apply some sort of threshold so that days where the increase was > 50 are considered 'hot'. you could make this far more complicated if you'd like, too. rather than absolute difference you can take the relative difference so that going from 100 to 150 is considered hot, but 1000 to 1050 isn't. or a more complicated gradient that takes into account trends over more than just one day to the next.
I had worked on a project, where my aim was finding Trending Topics from Live Twitter Stream and also doing sentimental analysis on the trending topics (finding if Trending Topic positively/negatively talked about). I've used Storm for handling twitter stream.
I've published my report as a blog: http://sayrohan.blogspot.com/2013/06/finding-trending-topics-and-trending.html
I've used Total Count and Z-Score for the ranking.
The approach that I've used is bit generic, and in the discussion section, I've mentioned that how we can extend the system for non-Twitter Application.
Hope the information helps.
You could use log-likelihood-ratios to compare the current date with the last month or year. This is statistically sound (given that your events are not normally distributed, which is to be assumed from your question).
Just sort all your terms by logLR and pick the top ten.
public static void main(String... args) {
TermBag today = ...
TermBag lastYear = ...
for (String each: today.allTerms()) {
System.out.println(logLikelihoodRatio(today, lastYear, each) + "\t" + each);
}
}
public static double logLikelihoodRatio(TermBag t1, TermBag t2, String term) {
double k1 = t1.occurrences(term);
double k2 = t2.occurrences(term);
double n1 = t1.size();
double n2 = t2.size();
double p1 = k1 / n1;
double p2 = k2 / n2;
double p = (k1 + k2) / (n1 + n2);
double logLR = 2*(logL(p1,k1,n1) + logL(p2,k2,n2) - logL(p,k1,n1) - logL(p,k2,n2));
if (p1 < p2) logLR *= -1;
return logLR;
}
private static double logL(double p, double k, double n) {
return (k == 0 ? 0 : k * Math.log(p)) + ((n - k) == 0 ? 0 : (n - k) * Math.log(1 - p));
}
PS, a TermBag is an unordered collection of words. For each document you create one bag of terms. Just count the occurrences of words. Then the method occurrences returns the number of occurrences of a given word, and the method size returns the total number of words. It is best to normalize the words somehow, typically toLowerCase is good enough. Of course, in the above examples you would create one document with all queries of today, and one with all queries of the last year.
If you simply look at tweets, or status messages to get your topics, you're going to encounter a lot of noise. Even if you remove all stop words. One way to get a better subset of topic candidates is to focus only on tweets/messages that share a URL, and get the keywords from the title of those web pages. And make sure you apply POS tagging to get nouns + noun phrases as well.
Titles of web pages usually are more descriptive and contain words that describe what the page is about. In addition, sharing a web page usually is correlated with sharing news that is breaking (ie if a celebrity like Michael Jackson died, you're going to get a lot of people sharing an article about his death).
I've ran experiments where I only take popular keywords from titles, AND then get the total counts of those keywords across all status messages, and they definitely remove a lot of noise. If you do it this way, you don't need a complex algorith, just do a simple ordering of the keyword frequencies, and you're halfway there.
The idea is to keep track of such things and notice when they jump significantly as compared to their own baseline.
So, for queries that have more than a certain threshhold, track each one and when it changes to some value (say almost double) of its historical value, then it is a new hot trend.

Resources