How does d3.scale.quantile work? - d3.js

What is the meaning of this statement?
quantize = d3.scale.quantile().domain([0, 15]).range(d3.range(9));
I saw that the domain is:
0 - 0
1 - 15
range is from 0 to 8 and quantize.quantiles
0 - 1.6
1 - 3.3
2 - 4.9
3 - 6.6
4 - 8.3
5 - 9.9
6 -11.6
7 -13.3
How are the values to quantize.quantiles calculated ? I tried to call quantize(2) but the result was 1. How does quantile work?

The motivation of the quantile scale is to obtain classes which are representative of the actual distribution of the values in the dataset. Therefore, it is necessary to provide it during construction with the full list of values. The scale then splits the input domain (defined by these values) into intervals (quantiles) in such a way that about the same number of values falls into each of the intervals.
From the documentation:
To compute the quantiles, the input domain is sorted, and treated as a population of discrete values.
Hence, when specifying the domain we hand in the scale the whole list of values:
var scale = d3.scale.quantile()
.domain([1, 1, 2, 3, 2, 3, 16])
.range(['blue', 'white', 'red']);
If we then run:
scale.quantiles()
It will output [2, 3] which means that our population of values was split into these three subsets represented by 'blue', 'white', and 'red' respectively:
[1, 1] [2, 2] [3, 3, 16]
Note that this scale should be avoided when there are outliers in the data which you want to show. In the above example 16 is an outlier falling into the upper quantile. It is assigned the same class as 3, which is probably not the desired behavior:
scale(3) // will output "red"
scale(16) // will output "red"

I would recommend reading over the quantile scale documentation, especially that on quantize.quantiles()
But basically, d3 sees that there are 9 values in the output range for this scale, so it creates 9 quantiles based on the 2 value data set: [0, 15].
This leads to the quantize.quantiles() values that you show in your question: [1.6, 3.3, .. ,13.3] , these represent the bounds of the quantiles - anything less than 1.6 will be mapped to the first element of the output range (in this case zero). Anything less than 3.3 and greater than 1.6 will be mapped to the second element of the output range (one). Hence quantize(2) = one, as expected.

Related

Minimum number of additional weights required in order to weight items in range from 1 to 100

Given a set of weights S={w1,w2,w3} and a range of weights, we need to determine whether the weights in S can be used to balance every weight in the range. If not, we need to add the minimum additional weights to S so that all of the weights in the range can be balanced.
For example:
Range is 1 to 5
S = {4,8,9}
The item with weight 1 can be balanced by putting the item on the left pan along with the 8, and put the 9 on the right pan.
1 + 8 = 9
3 + 9 = 8 + 4
4 = 4
5 + 8 = 9 + 4
But 2 can't be balanced using the weights {4,8,9} so we need to add another weight. Adding a weight of 1 allows 2 to balanced with
2 + 8 = 1 + 9
My question is there a mathematical algorithm that can be used to solve this problem?
There certainly are algorithms that would solve this. For clarity's sake, I'm assuming your use of the term "set" is the mathematical set, where all set elements are distinct, though this should not affect the below code all that much.
Breaking down the problem into 2 parts:
(1) Determine if the provided set of weights can be arranged on the scale such that the required range of integer values are covered
A solution to part (1), in python:
(to run, call check_range(int, int, []), where the first two args are the integer bounds of the range, low/high respectively, and the 3rd arg is a list of the weights in set s)
def get_possible_offsets(s=[]):
#the variable set "temp" will hold the possible offsets that we can create by arranging the weights on the scale
temp=set()
#optionally, we don't need to add any of the weights, ergo add value 0 by default
temp.add(0)
#per every weight in the given set of weights
for weight in s:
#take an iterable snapshot of our set of possible offsets
l = list(temp)
#for each value in that list, |i+/-w| the weight value
for i in l:
temp.add(i + weight)
temp.add(abs(i - weight))
#and also add the weight by itself
temp.add(weight)
return(temp)
def check_range(r_low=0, r_high=1, s=[]):
#get the set of weight values available using the provided set of weights
possible_offsets = get_possible_offsets(s)
#list to store the required weight values not available using the provided set of weights
missing_offsets = []
#for each required weight in the range, check if that weight exists in our current possible offsets
for i in range(r_low, r_high+1):
if i not in possible_offsets:
missing_offsets.append(i)
#if we're not missing any values from the required range, then we are done!
if len(missing_offsets) == 0:
print ("Yes! The required range is covered by the provided weights.")
else:
print ("Tragically, the following weight offsets are not covered:",missing_offsets)
(2) If 1. is false, then determine the minimum required additional weights to complete the required range
Part (2) of the problem, I have not added a complete solution yet, however, we just need to take the missing_offsets list in the above code, and boil it down to the additional weight values that could be included in the set of possible_offsets, as performed in the lines of code:
for i in l:
temp.add(i + weight)
temp.add(abs(i - weight))
This problem also sounds a lot like search-tree algos (though not binary), as well as combinatorics, so there are likely several efficient ways of calculating the desired output.
The set of absolute differences between each side of the scale is our range. Let's enumerate them for S, aggregating each element in turn (add and subtract each element to each previously seen absolute difference, then add the element itself as a difference):
S: {4, 8, 9}
up to element S_0:
4
up to element S_1:
4, 12, 8
up to element S_2:
4, 12, 8, 13, 5, 21, 3, 1, 17, 9
Now let's order them:
1, 3, 4, 5, 8, 9, 12, 13, 17, 21
To cover our range, 1 to 5, we need to fill the gap between 1 and 3. Adding a 1 will add ±1 to every difference we can create.
Would it not be the case that to cover any range, we would need to add ceil(k / 2) 1's, where k is the maximum gap in our range, when considering our enumerated differences? In this case, ceil(1 / 2) = one 1?
As ruakh commented below, this is not the case. Any lower range we can build, in fact can be used to fill-in gaps anywhere, and the coverage of the filled-in range can be applied again to growing ranges. For example:
{1, 2} covers 1 to 3
Now add 7 and we've increased our
range to 1 - 10 by virtue of applying ±3 to 7
Now we can add 21 and achieve the
range 21 ± 10!
This points to the possibility of overlapping subproblems.

Randomly select N unique elements from a list, given a probability for each

I've run into a problem: I have a list or array (IList) of elements that have a field (float Fitness). I need to efficiently choose N random unique elements depending on this variable: the bigger - the more likely it is to be chosen.
I searched on the internet, but the algorithms I found were rather unreliable.
The answer stated here seems to have a bigger probability at the beginning which I need to make sure to avoid.
-Edit-
For example I need to choose from objects with the values [-5, -3, 0, 1, 2.5] (negative values included).
The basic algorithm is to sum the values, and then draw a point from 0-sum(values) and an order for the items, and see which one it "intersects".
For the values [0.1, 0.2, 0.3] the "windows" [0-0.1, 0.1-0.3, 0.3-0.6] will look like this:
1 23 456
|-|--|---|
|-*--*---|
And you draw a point [0-0.6] and see what window it hit on the axis.
Pseudo-python for this:
original_values = {val1, val2, ... valn}
# list is to order them, order doesn't matter outside this context.
values = list(original_values)
# limit
limit = sum(values)
draw = random() * limit
while true:
candidate = values.pop()
if candidate > draw:
return candidate
draw -= candidate
So what shall those numbers represent?
Does 2.5 mean, that the probability to be chosen is twice as high than 1.25? Well - the negative values don't fit into that scheme.
I guess fitness means something like -5: very ill, 2.5: very fit. We have a range of 7.5 and could randomly pick an element, if we know how many candidates there are and if we have access by index.
Then, take a random number between -5 and 2.5 and see, if our number is lower than or equal to the candidates fitness. If so, the candidate is picked, else we repeat with step 1. I would say, that we then generate a new threshold to survive, because if we got an 2.5, but no candidate with that fitness remains, we would search infinitely.
The range of fitnesses has to be known for this, too.
fitnesses [-5, -3, 0, 1, 2.5]
rand -5 x x x x x
-2.5 - - x x x
0 - - x x x
2.5 - - - - x
If every candidate shall be testet every round, and the -5 guy shall have a chance to survive, you have to stretch the interval of random numbers a bit, to give him a chance, for instance, from -6 to 3.

ismember fails to find a number generated by bwlabel

Following this post by steve: http://blogs.mathworks.com/steve/2009/02/27/using-ismember-with-the-output-of-regionprops/
I wanted to apply it on a very simple case. Here is the logical image that I have, it clearly has three objects:
This is the code I used :
[L_t,n_t] = bwlabel(logical_image);
iii = find(L_t == 2);
bbb = ismember(L_t,iii);
imshow(bbb);
But all I am getting in bbb is an empty matrix. i.e. a logical image the same size of the original but consisting entirely of 0.
n_t shows 3 objects found. the max value of L_t is 3. Then how come ismember fails to find 2?
It doesn't work because iii is a list of indices (positions in L_t where L_t == 2), and L_t is a number from 1 to 3. This is not the same as what they are doing in the original example:
idx = find((100 <= area_values) & (area_values <= 1000))
Here, area_values is a list taken from regionprops of the area of the different regions in your labelled image. It has the same length, n, as the number of regions (different values) in L. e.g. if there are 10 areas in the image and areas 1, 3, and 7 have areas in the specified range the output of idx is [1 3 7].
This then selects the parts of L where L is 1, 3, or 7:
bw2 = ismember(L, idx);
In your case, iii is a list of pixel indices, not their values. So none of those values are 1, 2 or 3 (even where you looked for the ones that were equal to 2), as those are the indices of the first three pixels of the image. Therefore none of the values in L_t match any of the values in iii.
If all you wanted was an image showing only the second object, then this is enough:
bbb = L_t==2;
imshow(bbb)

Selecting a surviving population in a "voter" Genetic Algorithm

I've been working on a genetic algorithm where there is a population consisting of individuals with a color, and a preference. Preference and color are from a small number of finite states, probably around 4 or 5. (example: 1|1, 5|2, 3|3 etc)
Every individual casts a "vote" for their preference, which assists those individuals with that vote as their color.
My current idea is to cycle through every individual, and calculate the chance that they should survive, based on number of votes, etc. and then roll a die to see if they live.
I'm currently doing it so that if v[x] represents the percent of votes for color x, individual k with color c has v[c] chance of surviving. However, this means that if there are equal numbers of all 5 types of (a|a) individuals, 4/5 of them perish, and that's not good.
Does anyone have any idea of a method of randomness I could use to determine the chance an individual has to survive? For instance, an algorithm that for v votes for c, v individuals with color c survive (on statistical average).
Assign your fitness (likelyness of survival in your case) to each individual as is, then sort them on descending fitness and use binary tournament selection or something similar to sample another population of your chosen size.
Well, you can weight the probabilities according to the value returned by passing each
member of the population to the cost function.
That seems to me the most straightforward way, consistent with the genetic
meta-heuristic.
More common though, is to divide the current population into segments, based on
the value returned from passing them to the cost function.
So for instance,
if each generation consists of 100 members, then the top N (N is just a user-defined
parameter, often something like 5-10% of the total) members w/ the lowest cost
function result) are carried forward to the next generation just as they are (elitism).
Perhaps this is what you mean by 'survive.' If so, then again, these 'survivors'
are determined by ranking the members of the population according to the cost function
value and selecting those members above your defined elitism fraction constant.
The rest (the majority) of the next generation are created either by
mutation or cross-over.
mutation:
# one member of the current population:
[4, 5, 1, 7, 4, 2, 8, 9]
# small random change in one member of prior generation, to create mutant that is
# a member of the next generation
[4, 9, 1, 7, 4, 2, 8, 9]
crossover:
# two of the 'top' members of the current generation
[4, 5, 1, 7, 4, 2, 8, 9]
[2, 3, 6, 9, 2, 1, 6, 4]
# offpsring is a member of the next generation
[4, 5, 1, 7, 2, 1, 6, 4]

What is the best way to compute trending topics or tags?

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.
I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) no one mentions should be the hottest ones.
Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment.
Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem".
My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones.
I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it?
I hope you can help me (coding examples would be great).
This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average.
In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.
z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]
When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.
Please see Wikipedia for more information, about z-scores.
Code
from math import sqrt
def zscore(obs, pop):
# Size of population.
number = float(len(pop))
# Average population value.
avg = sum(pop) / number
# Standard deviation of population.
std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
# Zscore Calculation.
return (obs - avg) / std
Sample Output
>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506
Notes
You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.
You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.
If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.
from math import sqrt
class zscore:
def __init__(self, pop = []):
self.number = float(len(pop))
self.total = sum(pop)
self.sqrTotal = sum(x ** 2 for x in pop)
def update(self, value):
self.number += 1.0
self.total += value
self.sqrTotal += value ** 2
def avg(self):
return self.total / self.number
def std(self):
return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
def score(self, obs):
return (obs - self.avg()) / self.std()
Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day.
New Addition
Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:
Code
class fazscore:
def __init__(self, decay, pop = []):
self.sqrAvg = self.avg = 0
# The rate at which the historic data's effect will diminish.
self.decay = decay
for x in pop: self.update(x)
def update(self, value):
# Set initial averages to the first value in the sequence.
if self.avg == 0 and self.sqrAvg == 0:
self.avg = float(value)
self.sqrAvg = float((value ** 2))
# Calculate the average of the rest of the values using a
# floating average.
else:
self.avg = self.avg * self.decay + value * (1 - self.decay)
self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
return self
def std(self):
# Somewhat ad-hoc standard deviation calculation.
return sqrt(self.sqrAvg - self.avg ** 2)
def score(self, obs):
if self.std() == 0: return (obs - self.avg) * float("infinity")
else: return (obs - self.avg) / self.std()
Sample IO
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
>>> fazscore(0.8, [40] * 200).score(1)
-inf
Update
As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line,
if self.std() == 0: return 0
to:
if self.std() == 0: return (obs - self.avg) * float("infinity")
This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to:
if self.std() == 0: return obs - self.avg
You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.
This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.
Normalize
One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.
Derive
Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.
This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.
Regarding the article
The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.
As Nixuz points out this is also referred to as a Z or Standard Score.
Chad Birch and Adam Davis are correct in that you will have to look backward to establish a baseline. Your question, as phrased, suggests that you only want to view data from the past 24 hours, and that won't quite fly.
One way to give your data some memory without having to query for a large body of historical data is to use an exponential moving average. The advantage of this is that you can update this once per period and then flush all old data, so you only need to remember a single value. So if your period is a day, you have to maintain a "daily average" attribute for each topic, which you can do by:
a_n = a_(n-1)*b + c_n*(1-b)
Where a_n is the moving average as of day n, b is some constant between 0 and 1 (the closer to 1, the longer the memory) and c_n is the number of hits on day n. The beauty is if you perform this update at the end of day n, you can flush c_n and a_(n-1).
The one caveat is that it will be initially sensitive to whatever you pick for your initial value of a.
EDIT
If it helps to visualize this approach, take n = 5, a_0 = 1, and b = .9.
Let's say the new values are 5,0,0,1,4:
a_0 = 1
c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854
Doesn't look very much like an average does it? Note how the value stayed close to 1, even though our next input was 5. What's going on? If you expand out the math, what you get that:
a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0
What do I mean by leftover weight? Well, in any average, all weights must add to 1. If n were infinity and the ... could go on forever, then all weights would sum to 1. But if n is relatively small, you get a good amount of weight left on the original input.
If you study the above formula, you should realize a few things about this usage:
All data contributes something to the average forever. Practically speaking, there is a point where the contribution is really, really small.
Recent values contribute more than older values.
The higher b is, the less important new values are and the longer old values matter. However, the higher b is, the more data you need to water down the initial value of a.
I think the first two characteristics are exactly what you are looking for. To give you an idea of simple this can be to implement, here is a python implementation (minus all the database interaction):
>>> class EMA(object):
... def __init__(self, base, decay):
... self.val = base
... self.decay = decay
... print self.val
... def update(self, value):
... self.val = self.val*self.decay + (1-self.decay)*value
... print self.val
...
>>> a = EMA(1, .9)
1
>>> a.update(10)
1.9
>>> a.update(10)
2.71
>>> a.update(10)
3.439
>>> a.update(10)
4.0951
>>> a.update(10)
4.68559
>>> a.update(10)
5.217031
>>> a.update(10)
5.6953279
>>> a.update(10)
6.12579511
>>> a.update(10)
6.513215599
>>> a.update(10)
6.8618940391
>>> a.update(10)
7.17570463519
Typically "buzz" is figured out using some form of exponential/log decay mechanism. For an overview of how Hacker News, Reddit, and others handle this in a simple way, see this post.
This doesn't fully address the things that are always popular. What you're looking for seems to be something like Google's "Hot Trends" feature. For that, you could divide the current value by a historical value and then subtract out ones that are below some noise threshold.
I think they key word you need to notice is "abnormally". In order to determine when something is "abnormal", you have to know what is normal. That is, you're going to need historical data, which you can average to find out the normal rate of a particular query. You may want to exclude abnormal days from the averaging calculation, but again that'll require having enough data already, so that you know which days to exclude.
From there, you'll have to set a threshold (which would require experimentation, I'm sure), and if something goes outside the threshold, say 50% more searches than normal, you can consider it a "trend". Or, if you want to be able to find the "Top X Trendiest" like you mentioned, you just need to order things by how far (percentage-wise) they are away from their normal rate.
For example, let's say that your historical data has told you that Britney Spears usually gets 100,000 searches, and Paris Hilton usually gets 50,000. If you have a day where they both get 10,000 more searches than normal, you should be considering Paris "hotter" than Britney, because her searches increased 20% more than normal, while Britney's were only 10%.
God, I can't believe I just wrote a paragraph comparing "hotness" of Britney Spears and Paris Hilton. What have you done to me?
I was wondering if it is at all possible to use regular physics acceleration formula in such a case?
v2-v1/t or dv/dt
We can consider v1 to be initial likes/votes/count-of-comments per hour and v2 to be current "velocity" per hour in last 24 hours?
This is more like a question than an answer, but seems it may just work. Any content with highest acceleration will be the trending topic...
I am sure this may not solve Britney Spears problem :-)
probably a simple gradient of topic frequency would work -- large positive gradient = growing quickly in popularity.
the easiest way would be to bin the number of searched each day, so you have something like
searches = [ 10, 7, 14, 8, 9, 12, 55, 104, 100 ]
and then find out how much it changed from day to day:
hot_factor = [ b-a for a, b in zip(searches[:-1], searches[1:]) ]
# hot_factor is [ -3, 7, -6, 1, 3, 43, 49, -4 ]
and just apply some sort of threshold so that days where the increase was > 50 are considered 'hot'. you could make this far more complicated if you'd like, too. rather than absolute difference you can take the relative difference so that going from 100 to 150 is considered hot, but 1000 to 1050 isn't. or a more complicated gradient that takes into account trends over more than just one day to the next.
I had worked on a project, where my aim was finding Trending Topics from Live Twitter Stream and also doing sentimental analysis on the trending topics (finding if Trending Topic positively/negatively talked about). I've used Storm for handling twitter stream.
I've published my report as a blog: http://sayrohan.blogspot.com/2013/06/finding-trending-topics-and-trending.html
I've used Total Count and Z-Score for the ranking.
The approach that I've used is bit generic, and in the discussion section, I've mentioned that how we can extend the system for non-Twitter Application.
Hope the information helps.
You could use log-likelihood-ratios to compare the current date with the last month or year. This is statistically sound (given that your events are not normally distributed, which is to be assumed from your question).
Just sort all your terms by logLR and pick the top ten.
public static void main(String... args) {
TermBag today = ...
TermBag lastYear = ...
for (String each: today.allTerms()) {
System.out.println(logLikelihoodRatio(today, lastYear, each) + "\t" + each);
}
}
public static double logLikelihoodRatio(TermBag t1, TermBag t2, String term) {
double k1 = t1.occurrences(term);
double k2 = t2.occurrences(term);
double n1 = t1.size();
double n2 = t2.size();
double p1 = k1 / n1;
double p2 = k2 / n2;
double p = (k1 + k2) / (n1 + n2);
double logLR = 2*(logL(p1,k1,n1) + logL(p2,k2,n2) - logL(p,k1,n1) - logL(p,k2,n2));
if (p1 < p2) logLR *= -1;
return logLR;
}
private static double logL(double p, double k, double n) {
return (k == 0 ? 0 : k * Math.log(p)) + ((n - k) == 0 ? 0 : (n - k) * Math.log(1 - p));
}
PS, a TermBag is an unordered collection of words. For each document you create one bag of terms. Just count the occurrences of words. Then the method occurrences returns the number of occurrences of a given word, and the method size returns the total number of words. It is best to normalize the words somehow, typically toLowerCase is good enough. Of course, in the above examples you would create one document with all queries of today, and one with all queries of the last year.
If you simply look at tweets, or status messages to get your topics, you're going to encounter a lot of noise. Even if you remove all stop words. One way to get a better subset of topic candidates is to focus only on tweets/messages that share a URL, and get the keywords from the title of those web pages. And make sure you apply POS tagging to get nouns + noun phrases as well.
Titles of web pages usually are more descriptive and contain words that describe what the page is about. In addition, sharing a web page usually is correlated with sharing news that is breaking (ie if a celebrity like Michael Jackson died, you're going to get a lot of people sharing an article about his death).
I've ran experiments where I only take popular keywords from titles, AND then get the total counts of those keywords across all status messages, and they definitely remove a lot of noise. If you do it this way, you don't need a complex algorith, just do a simple ordering of the keyword frequencies, and you're halfway there.
The idea is to keep track of such things and notice when they jump significantly as compared to their own baseline.
So, for queries that have more than a certain threshhold, track each one and when it changes to some value (say almost double) of its historical value, then it is a new hot trend.

Resources