How can I measure trends in certain words, like Twitter? - algorithm

I have newspaper articles' corpus by day. Each word in the corpus has a frequency count of being present that day. I have been toying with finding an algorithm that captures the break-away words, similar to the way Twitter measures Trends in people's tweets.
For Instance, say the word 'recession' appears with the following frequency in the same group of newspapers:
Day 1 | recession | 456
Day 2 | recession | 2134
Day 3 | recession | 3678
While 'europe'
Day 1 | europe | 67895
Day 2 | europe | 71999
Day 3 | europe | 73321
I was thinking of taking the % growth per day and multiplying it by the log of the sum of frequencies. Then I would take the average to score and compare various words.
In this case:
recession = (3.68*8.74+0.72*8.74)/2 = 19.23
europe = (0.06*12.27+0.02*12.27)/2 = 0.49
Is there a better way to capture the explosive growth? I'm trying to mine the daily corpus to find terms that are more and more mentioned in a specific time period across time. PLEASE let me know if there is a better algorithm. I want to be able to find words with high non-constant acceleration. Maybe taking the second derivative would be more effective. Or maybe I'm making this way too complex and watched too much physics programming on the discovery channel. Let me know with a math example if possible Thanks!

First thing to notice is that this can be approximated by a local problem. That is to say, a "trending" word really depends only upon recent data. So immediately we can truncate our data to the most recent N days where N is some experimentally determined optimal value. This significantly cuts down on the amount of data we have to look at.
In fact, the NPR article suggests this.
Then you need to somehow look at growth. And this is precisely what the derivative captures. First thing to do is normalize the data. Divide all your data points by the value of the first data point. This makes it so that the large growth of an infrequent word isn't drowned out by the relatively small growth of a popular word.
For the first derivative, do something like this:
d[i] = (data[i] - data[i+k])/k
for some experimentally determined value of k (which, in this case, is a number of days). Similarly, the second derivative can be expressed as:
d2[i] = (data[i] - 2*data[i+k] + data[i+2k])/(2k)
Higher derivatives can also be expressed like this. Then you need to assign some kind of weighting system for these derivatives. This is a purely experimental procedure which really depends on what you want to consider "trending." For example, you might want to give acceleration of growth half as much weight as the velocity. Another thing to note is that you should try your best to remove noise from your data because derivatives are very sensitive to noise. You do this by carefully choosing your value for k as well as discarding words with very low frequencies altogether.
I also notice that you multiply by the log sum of the frequencies. I presume this is to give the growth of popular words more weight (because more popular words are less likely to trend in the first place). The standard way of measuring how popular a word is is by looking at it's inverse document frequency (IDF).
I would divide by the IDF of a word to give the growth of more popular words more weight.
IDF[word] = log(D/(df[word))
where D is the total number of documents (e.g. for Twitter it would be the total number of tweets) and df[word] is the number of documents containing word (e.g. the number of tweets containing a word).
A high IDF corresponds to an unpopular word whereas a low IDF corresponds to a popular word.

The problem with your approach (measuring daily growth in percentage) is that it disregards the usual "background level" of the word, as your example shows; 'europe' grows more quickly than 'recession', yet is has a much lower score.
If the background level of words has a well-behaved distribution (Gaussian, or something else that doesn't wander too far from the mean) then I think a modification of CanSpice's suggestion would be a good idea. Work out the mean and standard deviation for each word, using days C-N+1-T to C-T, where C is the current date, N is the number of days to take into account, and T is the number of days that define a trend.
Say for instance N=90 and T=3, so we use about three months for the background, and say a trend is defined by three peaks in a row. In that case, for example, you can rank the words according to their chi-squared p-value, calculated like so:
(mu, sigma) = fitGaussian(word='europe', startday=C-N+1-3, endday=C-3)
X1 = count(word='europe', day=C-2)
X2 = count(word='europe', day=C-1)
X3 = count(word='europe', day=C)
S = ((X1-mu)/sigma)^2 + ((X2-mu)/sigma)^2 + ((X3-mu)/sigma)^2
p = pval.chisq(S, df=3)
Essentially then, you can get the words which over the last three days are the most extreme compared to their background level.

I would first try a simple solution. A simple weighted difference between adjacent day should probably work. Maybe taking the log before that. You might have to experiment with the weights. For examle (-2,-1,1,2) would give you points where the data is exploding.
If this is not enough, you can try slope filtering ( http://www.claysturner.com/dsp/fir_regression.pdf ). Since the algorithm is based on linear regression, it should be possible to modify it for other types of regression (for example quadratic).
All attempts using filtering techniques such as these also have the advantage, that they can be made to run very fast and you should be able to find libraries that provide fast filtering.

Related

How to shuffle eight items to approximate maximum entropy?

I need to analyze 8 chemical samples repeatedly over 5 days (each sample is analyzed exactly once every day). I'd like to generate pseudo-random sample sequences for each day which achieve the following:
avoid bias in the daily sequence position (e.g., avoid some samples being processed mostly in the morning)
avoid repeating sample pairs over different days (e.g. 12345678 on day 1 and 87654321 on day 2)
generally randomize the distance between two given samples from one day to the other
I may have poorly phrased the conditions above, but the general idea is to minimize systematic effects like sample cross-contamination and/or analytical drift over each day. I could just shuffle each sequence randomly, but because the number of sequences generated is small (N=5 versus 40,320 possible combinations), I'm unlikely to approach something like maximum entropy.
Any ideas? I suspect this is a common problem in analytical science which has been solved, but I don't know where to look.
By just thinking about:
The base metric that you may be use is the Levenshtein distances or some slightly modification (maybe
myDist(w1, w2) = min(levD(w1, w2), levD(w1.reversed(), w2))
)
Since you want to prevent near distances between any pair of days,
the overall metric can be the sum of the any combinations of sample orders between two days.
Similarity = myDist(day1, day2)
+ myDist(day1, day3)
+ myDist(day1, day4)
+ myDist(day1, day5)
+ myDist(day2, day3)
+ myDist(day2, day4)
+ myDist(day2, day5)
+ myDist(day3, day4)
+ myDist(day3, day5)
+ myDist(day4, day5)
That still is missing, is a heuristic how to create the sample orders.
Your problem reminds me on some fastest path finding problem but with the further difficulty that each selected node influences the weights of the whole graph. So it is much harder.
Maybe a table with all myDistdistances between each pair of the 8! combinations can be created (its commutative, only triangular (without identity diagonal) matrix requiring (~1GB memory)) This may help speeding things up very much.
Maybe take the max from this matrix and consider each combination with value below some threshold as equally worthless to reduce the searchspace.
Build a starting set.
Use 12345678 as the fix day1 since first day does not matter. Never change this.
then repeat until n days are chose:
adding the most distant point from current point.
If there are multiple equal possibility, use the one, that also is most distant from the previous days.
Now iteratively improve the solution - maybe with some ruin-and-recreate-approach. You should always backup the absolute maximum you found and you are able to run as many iterations as you want (and you have time for)
chose (one or two) day(s) with the smallest distance sums to other days
maybe brute force an optimal (in terms of overall distance) combination for these two days.
repeat
If optimization stucks (only same 2 days are chosen or distance is not getting smaller at all)
randomly change one or two days to random orders.
may be totally random (beside day1) starting sets can be selected

Sales Ranking Algorithm

I'm trying to work out the best way of ranking some products, based on both their overall sales to date, and the sales in the last x days (to show trends, hot products)
What I'd like to do is use both, so that the biggest sellers rank highly, but if they have sold out, they are moved down (no recent sales) or if they have particularly high recent sales they get jumped up.
With it being sales, the figures could be 10x different by product, so I'm assuming we need a logarithmic scale for this. What's the best way to combine the two?
What I'd like to do is use both, so that the biggest sellers rank highly, but if they have sold out, they are moved down (no recent sales) or if they have particularly high recent sales they get jumped up.
One particularly simple way of doing this would be to maintain, for each product, two exponential moving averages, one with a short factor, and one with a long factor. Note that each such average is simply updated by multiplying the average up to the day by some factor, and adding the number for that day multiplied by a complement factor.
You'll need to set the two factors based on your problem, but see here an explanation on the relationship between this factor, and the effective time averaged.
The overall score for the product would be some total score taking into account both these averages.
With it being sales, the figures could be 10x different by product, so I'm assuming we need a logarithmic scale for this. What's the best way to combine the two?
There is no best way - you'll need to try different options, and tune them until you're happy with the results.
If the long and short averages are l and s, then a general way of averaging (not the only one!) is α f(l) + (1 - α) f(s), where α is some constant in [0, 1], and f is a damping function. You've mentioned logarithms as a damping function, but you might find that, say, square root works better for your case (it also has less problems with small or zero arguments).

What is a bad, decent, good, and excellent F1-measure range?

I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.
Consider sklearn.dummy.DummyClassifier(strategy='uniform') which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.
In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0 and 3099 samples labelled as 1, f1-score is 0.5 for both classes, and weighted average is 0.5:
Second example, using DummyClassifier(strategy='constant'), i.e. guessing the same label every time, guessing label 1 every time in this case, average of f1-scores is 0.33, while f1 for label 0 is 0.00:
I consider these to be bad f1-scores, given the balanced dataset.
PS. summary generated using sklearn.metrics.classification_report
You did not find any reference for f1 measure range because there is not any range. The F1 measure is a combined matrix of precision and recall.
Let's say you have two algorithms, one has higher precision and lower recall. By this observation , you can not tell that which algorithm is better, unless until your goal is to maximize precision.
So, given this ambiguity about how to select superior algorithm among two (one with higher recall and other with higher precision), we use f1-measure to select superior among them.
f1-measure is a relative term that's why there is no absolute range to define how better your algorithm is.

Computing similarity between two lists

EDIT:
as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.
Eg,
1,7,4,5,8,9
1,7,5,4,9,6
What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists?
I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of documents. Now, I want to compare the similarity between my system and the "state of the art system" in order to measure the correctness of my system. Please note that the order of documents is important as we are talking about a ranked system.
Does anyone know of any measures that can help me find the similarity between these two lists.
The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.
It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.
Using DCG/nDCG to evaluate the system compared to the SOA base line:
Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.
Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)
To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.
If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.
Example:
mySystem = [1,2,5,4,6,7]
stateOfTheArt = [1,2,4,5,6,9]
First you give score to each document, according to the state of the art system [using the formula from above]:
doc1 = 1.0
doc2 = 0.6309297535714574
doc3 = 0.0
doc4 = 0.5
doc5 = 0.43067655807339306
doc6 = 0.38685280723454163
doc7 = 0
doc8 = 0
doc9 = 0.3562071871080222
Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783
Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939
Kendalls tau is the metric you want. It measures the number of pairwise inversions in the list. Spearman's foot rule does the same, but measures distance rather than inversion. They are both designed for the task at hand, measuring the difference in two rank-ordered lists.
Is the list of documents exhaustive? That is, is every document rank ordered by system 1 also rank ordered by system 2? If so a Spearman's rho may serve your purposes. When they don't share the same documents, the big question is how to interpret that result. I don't think there is a measurement that answers that question, although there may be some that implement an implicit answer to it.
As you said, you want to compute how similar one list is to the other. I think simplistically, you can start by counting the number of Inversions. There's a O(NlogN) divide and conquer approach to this. It is a very simple approach to measure the "similarity" between two lists. e.g. you want to compare how 'similar' the music tastes are for two persons on a music website, you take their rankings of a set of songs and count the no. of inversions in it. Lesser the count, more 'similar' their taste is.
since you are already considering the "state of the art system" to be a benchmark of correctness, counting Inversions should give you a basic measure of 'similarity' of your ranking.
Of course this is just a starters approach, but you can build on it as how strict you want to be with the "inversion gap" etc.
D1 D2 D3 D4 D5 D6
-----------------
R1: 1, 7, 4, 5, 8, 9 [Rankings from 'state of the art' system]
R2: 1, 7, 5, 4, 9, 6 [ your Rankings]
Since rankings are in order of documents you can write your own comparator function based on R1 (ranking of the "state of the art system" and hence count the inversions comparing to that comparator.
You can "penalize" 'similarity' for each inversions found: i < j but R2[i] >' R2[j]
( >' here you use your own comparator)
Links you may find useful:
Link1
Link2
Link3
I actually know four different measures for that purpose.
Three have already been mentioned:
NDCG
Kendall's Tau
Spearman's Rho
But if you have more than two ranks that have to be compared, use Kendall's W.
In addition to what has already been said, I would like to point you to the following excellent paper: W. Webber et al, A Similarity Measure for Indefinite Rankings (2010). Besides containing a good review of existing measures (such as above-mentioned Kendall Tau and Spearman's footrule), the authors propose an intuitively appealing probabilistic measure that is applicable for varying length of result lists and when not all items occur in both lists. Roughly speaking, it is parameterized by a "persistence" probability p that a user scans item k+1 after having inspected item k (rather than abandoning). Rank-Biased Overlap (RBO) is the expected overlap ratio of results at the point the user stops reading.
The implementation of RBO is slightly more involved; you can take a peek at an implementation in Apache Pig here.
Another simple measure is cosine similarity, the cosine between two vectors with dimensions corresponding to items, and inverse ranks as weights. However, it doesn't handle items gracefully that only occur in one of the lists (see the implementation in the link above).
For each item i in list 1, let h_1(i) = 1/rank_1(i). For each item i in list 2 not occurring in list 1, let h_1(i) = 0. Do the same for h_2 with respect to list 2.
Compute v12 = sum_i h_1(i) * h_2(i); v11 = sum_i h_1(i) * h_1(i); v22 = sum_i h_2(i) * h_2(i)
Return v12 / sqrt(v11 * v22)
For your example, this gives a value of 0.7252747.
Please let me give you some practical advice beyond your immediate question. Unless your 'production system' baseline is perfect (or we are dealing with a gold set), it is almost always better to compare a quality measure (such as above-mentioned nDCG) rather than similarity; a new ranking will be sometimes better, sometimes worse than the baseline, and you want to know if the former case happens more often than the latter. Secondly, similarity measures are not trivial to interpret on an absolute scale. For example, if you get a similarity score of say 0.72, does this mean it is really similar or significantly different? Similarity measures are more helpful in saying that e.g. a new ranking method 1 is closer to production than another new ranking method 2.
I suppose you are talking about comparing two Information Retrieval System which trust me is not something trivial. It is a complex Computer Science problem.
For measuring relevance or doing kind of A/B testing you need to have couple of things:
A competitor to measure relevance. As you have two systems than this prerequisite is met.
You need to manually rate the results. You can ask your colleagues to rate query/url pairs for popular queries and then for the holes(i.e. query/url pair not rated you can have some dynamic ranking function by using "Learning to Rank" Algorithm http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true (please read below of an example of Google/Bing).
Google and Bing are competitors in the horizontal search market. These search engines employ manual judges around the world and invest millions on them, to rate their results for queries. So for each query/url pairs generally top 3 or top 5 results are rated. Based on these ratings they may use a metric like NDCG (Normalized Discounted Cumulative Gain) , which is one of finest metric and the one of most popular one.
According to wikipedia:
Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks.
Wikipedia explains NDCG in a great manner. It is a short article, please go through that.

Translating a score into a probabilty

People visit my website, and I have an algorithm that produces a score between 1 and 0. The higher the score, the greater the probability that this person will buy something, but the score isn't a probability, and it may not be a linear relationship with the purchase probability.
I have a bunch of data about what scores I gave people in the past, and whether or not those people actually make a purchase.
Using this data about what happened with scores in the past, I want to be able to take a score and translate it into the corresponding probability based on this past data.
Any ideas?
edit: A few people are suggesting bucketing, and I should have mentioned that I had considered this approach, but I'm sure there must be a way to do it "smoothly". A while ago I asked a question about a different but possibly related problem here, I have a feeling that something similar may be applicable but I'm not sure.
edit2: Let's say I told you that of the 100 customers with a score above 0.5, 12 of them purchased, and of the 25 customers with a score below 0.5, 2 of them purchased. What can I conclude, if anything, about the estimated purchase probability of someone with a score of 0.5?
Draw a chart - plot the ratio of buyers to non buyers on the Y axis and the score on the X axis - fit a curve - then for a given score you can get the probability by the hieght of the curve.
(you don't need to phyically create a chart - but the algorithm should be evident from the exercise)
Simples.
That is what logistic regression, probit regression, and company were invented for. Nowdays most people would use logistic regression, but fitting involves iterative algorithms - there are, of course, lots of implementations, but you might not want to write one yourself. Probit regression has an approximate explicit solution described at the link that might be good enough for your purposes.
A possible way to assess whether logistic regression would work for your data, would be to look at a plot of each score versus the logit of the probability of purchase (log(p/(1-p)), and see whether these form a straight line.
I eventually found exactly what I was looking for, an algorithm called “pair-adjacent violators”. I initially found it in this paper, however be warned that there is a flaw in their description of the implementation.
I describe the algorithm, this flaw, and the solution to it on my blog.
Well, the straightforward way to do this would be to calculate which percentage of people in a score interval purchased something and do this for all intervals (say, every .05 points).
Have you noticed an actual correlation between a higher score and an increased likelihood of purchases in your data?
I'm not an expert in statistics and there might be a better answer though.
You could divide the scores into a number of buckets, e.g. 0.0-0.1, 0.1-0.2,... and count the number of customers who purchased and did not purchase something for each bucket.
Alternatively, you may want to plot each score against the amount spent (as a scattergram) and see if there is any obvious relationship.
You could use exponential decay to produce a weighted average.
Take your users, arrange them in order of scores (break ties randomly).
Working from left to right, start with a running average of 0. Each user you get, change the average to average = (1-p) * average + p * (sale ? 1 : 0). Do the same thing from the right to the left, except start with 1.
The smaller you make p, the smoother your curve will become. Play around with your data until you have a value of p that gives you results that you like.
Incidentally this is the key idea behind how load averages get calculated by Unix systems.
Based upon your edit2 comment you would not have enough data to make a statement. Your overall purchase rate is 11.2% That is not statistically different from your 2 purchase rates which are above/below .5 Additionally to validate your score, you would have to insure that the purchase percentages were monotonically increasing as your score increased. You could bucket but you would need to check your results against a probability calculator to make sure they did not occur by chance.
http://stattrek.com/Tables/Binomial.aspx

Resources