Which data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events? - algorithm

Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is sensitive about the actual problem).
For phase 1, my input is the set of (strike id, strike time, strike location) tuples. The desired output is the set of the clusters of more than 1 event that hit the same location within a short time. The number of clusters is not known in advance (so k-means is not that useful here). What is being considered as 'short' could be predefined for a given clustering attempt. That is, I can set it to, say, 3 minutes, than run the algorithm; later try with 4 minutes or 10 minutes. Perhaps a nice touch would be for the algorithm to determine a 'strength' of clustering and recommend that for a given input, the most compact clustering is achieved by using a particular value for 'short', but this is not required initially.
For phase 2, I'd like to take into consideration the amplitude of the strike (i.e., a real number) and look for clusters that are both within a short time and with similar amplitudes.
I googled and checked the answers here about data clustering. The information is a bit bewildering (below is the list of links I found useful). AFAIK, k-means and related algorithms would not be useful because they require the number of clusters to be specified apriori. I'm not asking for someone to solve my problem (I like solving it), but some orientation in the large world of data clustering algorithms would be useful in order to save some time. Specifically, what clustering algorithms are appropriate for when the number of clusters is unknown.
Edit: I realized the location is irrelevant, in the sense that although events happen all the time, I only need to cluster them per location. So each location has its own time-series of events that can thus be analyzed independently.
Some technical details:
- as the dataset is not that large, it can fit all in memory.
- parallel processing is a nice to have, but not essential. I only have a 4-core machine and MapReduce and Hadoop would be too much.
- the language I'm mostly familiar with is Java. I haven't yet used R and the learning curve for it would probably be too much for what time I was given. I'll have a look at it anyway in my spare time.
- for the time being, using tools to run the analysis is ok, I don't have to produce just code. I'm mentioning this because probably Weka will be suggested.
- visualization would be useful. As the dataset is large enough so it doesn't fit in memory, the visualization should at least support zooming and panning. And to clarify: I don't need to build a visualization GUI, it's just a nice capability to use for checking the results produced with a tool.
Thank you. Questions that I found useful are: How to find center of clusters of numbers? statistics problem?, Clustering Algorithm for Paper Boys, Java Clustering Library, How to cluster objects (without coordinates), Algorithm for detecting "clusters" of dots

I would suggest you to look into Mean Shift Clustering. The basic idea behind mean shift clustering is to take the data and perform a kernel density estimation, then find the modes in the density estimate, the regions of convergence of data points towards modes defines the clusters.
The nice thing about mean shift clustering is that the number of clusters do not have to be specified ahead of time.
I have not used Weka, so I am not sure if it has mean shift clustering. However if you are using MATLAB, here is a toolbox (KDE toolbox) to do it. Hope that helps.

Couldn't you just use hierarchical clustering with the difference in times of strikes as part of the distance metric?

It is too late, but still I would add it:
In R, there is a package fpc and it has a method pamk() which provides you the clusters. Using pamk(), you do not need to mention the number of clusters intially. It calculates itself the number of clusters in the input data.


Evaluating a specific Information retrieval system with P#1

I am working on a information retrieval system which aims to select the first result and to link it to other database. Indeed, our system is based on a Keyword description of a video and try to interlink the video to a DBpedia entity which has the same meaning of the description. In the step of evaluation, i noticid that the majority of evaluation set the minimum of the precision cut-off to 5, whereas in our system is not suitable. I am thinking to put an interval [1,5]: (P#1,...P#5).Will it be possible? !!
Please provide your suggestions and your reference to some notes.. Thanks..
You can definitely calculate P#1 for a retrieval system, if you have truth labels. (In this case, it sounds like they would be [Video, DBPedia] matching pairs generated by humans).
People generally look at this measure for things like Question-Answering or recommendation systems. The only caveat is that you typically wouldn't use it to train a learning to rank system or any other learning system -- it's not "continuous enough" a near miss (best at rank 2) and a total miss (best at rank 4 million) get equivalent scores, so it can be hard to smoothly improve a system by tuning weights in such a case.
For those kinds of tasks, using Mean Reciprocal Rank is pretty common, if you need something tunable. Also NDCG tends to be okay, too, since it has an exponential discounting factor.
But there's nothing in the definition of precision that prevents you from calculating it at rank 1. It may be more correct to describe it as a "success#1" feature, since you're going to get 0/1 or 1/1 as your two options.

Mahout - Naive Bayes Model Very Slow

I have about 44 Million training examples across about 6200 categories.
After training, the model comes out to be ~ 450MB
And while testing, with 5 parallel mappers (each given enough RAM), the classification proceeds at a rate of ~ 4 items a second which is WAY too slow.
How can speed things up?
One way i can think of is to reduce the word corpus, but i fear losing accuracy. I had maxDFPercent set to 80.
Another way i thought of was to run the items through a clustering algorithm and empirically maximize the number of clusters while keeping the items within each category restricted to a single cluster. This would allow me to build separate models for each cluster and thereby (possibly) decrease training and testing time.
Any other thoughts?
Edit :
After some of the answers given below, i started contemplating doing some form of down-sampling by running a clustering algorithm, identifying groups of items that are "highly" close to one another and then taking a union of a few samples from those "highly" close groups and other samples that are not that tightly close to one another.
I also started thinking about using some form of data normalization techniques that involve incorporating edit distances while using n-grams (http://lucene.apache.org/core/4_1_0/suggest/org/apache/lucene/search/spell/NGramDistance.html)
I'm also considering using the hadoop streaming api to leverage some of the ML libraries available in Python from listed here http://pydata.org/downloads/ , and here http://scikit-learn.org/stable/modules/svm.html#svm (These I think use liblinear mentioned in one of the answers below)
Prune stopwords and otherwise useless words (too low support etc.) as early as possible.
Depending on how you use clustering, it may actually make in particular the test phase even more expensive.
Try other tools than Mahout. I found Mahout to be really slow in comparison. It seems that it somewhere comes at a really high overhead.
Using less training exampes would be an option. You will see that after a specific amount of training examples you classification accuracy on unseen examples won't increase. I would recommend to try to train with 100, 500, 1000, 5000, ... examples per category and using 20% for cross validating the accuracy. When it doesn't increase anymore, you have found the amount of data you need which may be a lot less then you use now.
Another approach would be to use another library. For document-classification i find liblinear very very very fast. It's may be more low-level then mahout.
"but i fear losing accuracy" Have you actually tried using less features or less documents? You may not lose as much accuracy as you fear. There may be a few things at play here:
Such a high number of documents are not likely to be from the same time period. Over time, the content of a stream will inevitably drift and words indicative of one class may become indicative of another. In a way, adding data from this year to a classifier trained on last year's data is just confusing it. You may get much better performance if you train on less data.
The majority of features are not helpful, as #Anony-Mousse said already. You might want to perform some form of feature selection before you train your classifier. This will also speed up training. I've had good results in the past with mutual information.
I've previously trained classifiers for a data set of similar scale and found the system worked best with only 200k features, and using any more than 10% of the data for training did not improve accuracy at all.
PS Could you tell us a bit more about your problem and data set?
Edit after question was updated:
Clustering is a good way of selecting representative documents, but it will take a long time. You will also have to re-run it periodically as new data come in.
I don't think edit distance is the way to go. Typical algorithms are quadratic in the length of the input strings, and you might have to run for each pair of words in the corpus. That's a long time!
I would again suggest that you give random sampling a shot. You say you are concerned about accuracy, but are using Naive Bayes. If you wanted the best model money can buy, you would go for a non-linear SVM, and you probably wouldn't live to see it finish training. People resort to classifiers with known issues (there's a reason Naive Bayes is called Naive) because they are much faster than the alternative but performance will often be just a tiny bit worse. Let me give you an example from my experience:
RBF SVM- 85% F1 score - training time ~ month
Linear SVM- 83% F1 score - training time ~ day
Naive Bayes- 82% F1 score - training time ~ day
You find the same thing in the literature: paper . Out of curiosity, what kind of accuracy are you getting?

Efficient data structure for quality threshold clustering algorithm

I'm trying to implement the quality threshold clustering algorithm. The outline of it (taken from here) is listed below:
Initialize the threshold distance allowed for clusters and the minimum cluster size
Build a candidate cluster for each data point by including the closest point, the next closest, and so on, until the distance of the cluster surpasses the threshold
Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration
Repeat with the reduced set of points until no more cluster can be formed having the minimum cluster size
I've been reading up on some nearest neighbor search algorithms and space partitioning data structures, as they seem to be the kind of thing I need, but I cannot determine which one to use or if I'm supposed to be looking at something else.
I want to implement the data structure myself for educational purposes, and I need one that can successively return the nearest points for some point. However, since I don't know the number of times I need to query (i.e. until the threshold is exceeded), I can't use k-nearest neighbor algorithms. I've been looking mostly at quadtrees and k-d trees.
Additionally, since the algorithm constantly builds new candidate clusters, it would be interesting to use a modified data structure that uses cached information to speed up subsequent queries (but also taking point removal into account).
This algorithm sounds like a predecessor of DBSCAN (Wikipedia), which is known to work very well with R*-Tree indexes (Wikipedia). But of course, kd-trees are also an option. The main difference between these two is that R*-trees are meant for database use - they support online insertions and deletions very well, and are block oriented - while kd-trees are more of an in-memory data structure based on binary splits. R*-trees perform rebalancing, while kd-trees will slowly become unbalanced and will need to be rebuilt.
I find nearest neighbor search in R*-trees much more understandable than in k-d-trees, because you have the bounding rectangles are very intuitive.
DBSCAN also "removes" points from further consideration, but simply by marking them as already assigned. That way you don't need to update the index; and it's sufficient to bulk-load it once in the beginning. You should be able to do this for QT, too. So unless I'm mistaken, you can get the QT clustering efficiently by running DBSCAN with epsilon set to the QT clustering and minPts=2 (although one would prefer higher values in proper DBSCAN).
There are a number of DBSCAN implementations around. The one in Weka is exceptionally crappy, so stay away from it. The fpc implementation in R is okay, but could still be a lot faster. ELKI seems to be the only one with full index support, and the speed difference is massive. Their Benchmark shows a 12x speed gain by using an index on this data set, allowing them to cluster in 50 seconds instead of 603 (without index). Weka took incredible 37917 seconds, R fpc 4339 there. That aligns with my experiences, Weka has the reputation of being quite slow, and R only kicks ass at vectorized operations, once the R interpreter has to work, it is significantly slower than anything native. But it is a good example about how different the same algorithm can perform when it is implemented by different people. I would have expected this to be 2x-5x, but apparently the differences can easily be 50x from one programmer implementing the same algorithm to another.

What are good algorithms for detecting abnormality?

Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Modeling distribution of performance measurements

How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario?
My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty).
My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized.
A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers.
I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;)
Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately.
Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle.
Why (since it was asked) -
For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc.
This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above.
[edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.
Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.
If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.
Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.
BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.
ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.
ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)
The problem you describe is called "Distribution Fitting" and has nothing to do with performance measurements, i.e. this is generic problem of fitting suitable distribution to any gathered/measured data sample.
The standard process is something like that:
Guess the best distribution.
Run hypothesis tests to check how well it describes gathered data.
Repeat 1-3 if not well enough.
You can find interesting article describing how this can be done with open-source R software system here. I think especially useful to you may be function fitdistr.
In addition to already given answers consider Empirical Distributions. I have successful experience in using empirical distributions for performance analysis of several distributed systems. The idea is very straightforward. You need to build histogram of performance measurements. Measurements should be discretized with given accuracy. When you have histogram you could do several useful things:
calculate the probability of any given value (you are bound by accuracy only);
build PDF and CDF functions for the performance measurements;
generate sequence of response times according to a distribution. This one is very useful for performance modeling.
Try whit gamma distribution http://en.wikipedia.org/wiki/Gamma_distribution
From wikipedia
The gamma distribution is frequently a probability model for waiting times; for instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.
The standard for randomized Arrival times for performance modelling is either Exponential distribution or Poisson distribution (which is just the distribution of multiple Exponential distributions added together).
Not exactly answering your question, but relevant still: Mor Harchol-Balter did a very nice analysis of the size of jobs submitted to a scheduler, The effect of heavy-tailed job size distributions on computer systems design (1999). She found that the size of jobs submitted to her distributed task assignment system took a power-law distribution, which meant that certain pieces of conventional wisdom she had assumed in the construction of her task assignment system, most importantly that the jobs should be well load balanced, had awful consequences for submitters of jobs. She's done good follor-up work on this issue.
The broader point is, you need to ask such questions as:
What happens if reasonable-seeming assumptions about the distribution of performance, such as that they take a normal distribution, break down?
Are the data sets I'm looking at really representative of the problem I'm trying to solve?
