Cluster similar curves considering "belongingness"? - algorithm

Currently, I have 6 curves shown in 6 different colors as below.
The 6 curves are in fact generated by 6 trials of one same experiment. That means, ideally they should be the same curve, but due to the noise and different trial participants, they just look similar but not exactly the same.
Now I wish to create an algorithm that is able to identify that the 6 curves are essentially the same and cluster them together into one cluster. What similarity metrics should I use?
Note:
The x-axis does NOT matter at all! I simply align them together for visual purpose. Thus, feel free to left/right shift the curves, if doing so helps.
"Sub-curves" that are part of the curves may appear. The "belongingness" is important and thus needs identifying as well. But again, left/right shifting is allowed.
I have attemped to learn some of the clustering algorithm, such as DBSCAN, K-means, Fuzzy C-means, etc. But I don't see their appropriateness in this case, because the "belongingness" needs to be spotted!
Any suggestions or comments are well welcomed. I understand that it is hard to give some exact solutions to this question. I am only expecting some enlightening suggestions here.

Have a look at time series similarity functions, such as dynamic time warping.
They can be used with e.g. DBSCAN but NOT with k-means (you cannot compute a reasonable "mean" for these distances; k-means is really designed for squared Euclidean distances).

Related

Mathematically how does one compare classification result to clustering results

Is there a standard methodology to compare results (for accuracy) of a classification algorithm against a clustering algorithm? I have data that has only two true labels. Easy enough to check accuracy when I run a binary classification on it, but if I run clustering, where I ask it to cluster the data into 5 groups, how can I check the accuracy and compare it to the binary classification. I know clustering is not suitable for (two label) data but how can one prove this mathematically?
Clustering into more than two clusters is one way to do 2-class classification (just pick which ever label is more common in each cluster to be the predicted label for the cluster). However it's a very strange approach because it ignores the labels until the very end after the clustering is computed. Supervised learning (i.e. classification) provides much more powerful tools like random forests for classification.
Don't approach clustering as classification
They have very different objectives, and really should not be compared. Classification is about reproducing known labels, and you need to pay attention to overfitting, train/test splitting etc. Clustering on the other hand is exploratory. Any truly exploratory method will eventually not find anything, or will turn up obvious results only.
By trying to evaluate it the same way as classification, you "overfit" to clustering methods that yield the obvious, if anything.
Instead, evaluate Clustering by looking at the results. If you learn something from the result, then it was good. If not, try again.
Don't try to stick a number on everything
There is more than black, white, and 50 shades of grey. Putting everything into a single number is a grayscale view of the world... it's popular (so is "good vs. evil" thinking); but in science we should do better.

Edge detection : Any performance evaluation technique?

I am working on edge detection in images and would like to evaluate the performance of algorithm, if any any one could give me a reference or method on how to proceed it will be really helpful. :)
I do not have ground truth and data set includes color as well as gray images.
Thank you.
Create a synthetic data set with known edges, for example by 3D rendering, by compositing 2D images with precise masks (as may be obtained in royalty free photosets), or by introducing edges directly (thin/faint lines). Remember to add some confounding non-edges that look like edges, of a type appropriate for what you're tuning for.
Use your (non-synthetic) data set. Run the reference algorithms that you want to compare against. Also produce combinations of the reference algorithms, for example by voting (majority, at least K out of N, etc). Calculate stats on your algo vs reference algo performance, in terms of (a) number of points your algo classifies as edge which each reference algo, or the combination, does not classify as edge (false positive), or (b) number of points which the reference algo classifies as edge that your algo does not (false negative). You can also calculate a rank correlation-type number for algos by looking at each point and looking at which algos do (or don't) classify that as an edge.
Create ground truth manually. Use reference edge-finding algos as a starting point, then fix up by hand. Probably valuable to do for a small number of images in any case.
Good luck!
For comparisons, quantitative measures like what #Alex I explained is best. To do so, you need to define what is "correct" with a ground truth set and a way to consistently determine if a given image is correct or on a more granular level, how correct (some number like a percentage) it is. #Alex I gave a way to do that.
Another option that is often used in graphics research where there is no ground truth is user studies. Usually less desirable as they are time consuming and often more costly. However, if it is a qualitative improvement that you are after or if a quantitative measurement is just too hard to do, a user study is an appropriate solution.
When I mean user study I mean to poll people on how well a result is given the input image. You could give them a scale to rate things on and randomly give them samples from both your results and the results of another algorithm
And of course, if you still want more ideas, be sure to check out edge detection papers to see how they measured their results (I'd actually look here first as they've already gone through this same process and determined what was best for them: google scholar).

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Which data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?

Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is sensitive about the actual problem).
For phase 1, my input is the set of (strike id, strike time, strike location) tuples. The desired output is the set of the clusters of more than 1 event that hit the same location within a short time. The number of clusters is not known in advance (so k-means is not that useful here). What is being considered as 'short' could be predefined for a given clustering attempt. That is, I can set it to, say, 3 minutes, than run the algorithm; later try with 4 minutes or 10 minutes. Perhaps a nice touch would be for the algorithm to determine a 'strength' of clustering and recommend that for a given input, the most compact clustering is achieved by using a particular value for 'short', but this is not required initially.
For phase 2, I'd like to take into consideration the amplitude of the strike (i.e., a real number) and look for clusters that are both within a short time and with similar amplitudes.
I googled and checked the answers here about data clustering. The information is a bit bewildering (below is the list of links I found useful). AFAIK, k-means and related algorithms would not be useful because they require the number of clusters to be specified apriori. I'm not asking for someone to solve my problem (I like solving it), but some orientation in the large world of data clustering algorithms would be useful in order to save some time. Specifically, what clustering algorithms are appropriate for when the number of clusters is unknown.
Edit: I realized the location is irrelevant, in the sense that although events happen all the time, I only need to cluster them per location. So each location has its own time-series of events that can thus be analyzed independently.
Some technical details:
- as the dataset is not that large, it can fit all in memory.
- parallel processing is a nice to have, but not essential. I only have a 4-core machine and MapReduce and Hadoop would be too much.
- the language I'm mostly familiar with is Java. I haven't yet used R and the learning curve for it would probably be too much for what time I was given. I'll have a look at it anyway in my spare time.
- for the time being, using tools to run the analysis is ok, I don't have to produce just code. I'm mentioning this because probably Weka will be suggested.
- visualization would be useful. As the dataset is large enough so it doesn't fit in memory, the visualization should at least support zooming and panning. And to clarify: I don't need to build a visualization GUI, it's just a nice capability to use for checking the results produced with a tool.
Thank you. Questions that I found useful are: How to find center of clusters of numbers? statistics problem?, Clustering Algorithm for Paper Boys, Java Clustering Library, How to cluster objects (without coordinates), Algorithm for detecting "clusters" of dots
I would suggest you to look into Mean Shift Clustering. The basic idea behind mean shift clustering is to take the data and perform a kernel density estimation, then find the modes in the density estimate, the regions of convergence of data points towards modes defines the clusters.
The nice thing about mean shift clustering is that the number of clusters do not have to be specified ahead of time.
I have not used Weka, so I am not sure if it has mean shift clustering. However if you are using MATLAB, here is a toolbox (KDE toolbox) to do it. Hope that helps.
Couldn't you just use hierarchical clustering with the difference in times of strikes as part of the distance metric?
It is too late, but still I would add it:
In R, there is a package fpc and it has a method pamk() which provides you the clusters. Using pamk(), you do not need to mention the number of clusters intially. It calculates itself the number of clusters in the input data.

Comparing a "path" (or GPS trail) of a vehicle

I have a bit of a difficult algorithm question, I can't find any suitable algorithm from a lot of searching, so I am hoping that someone here on stackoverflow might know the answer.
I have a set of x,y coordinates for a vehicle as it moves through a 2D space, the coordinates are recorded at "decision points" in the time period (i.e. they have stopped and made a determination of where to move next).
What I want to do is find a mechanism for comparing these trails efficiently (i.e. not going through each point individually). Compounding this is that I am interested in the "pattern" of their movement, not necessarily the individual points they went to. This means that the "path" is considered the same if you reflect it around an axis, or if you rotate it by 90,180 or 270 degrees.
Basically I am trying to distil some sort of "behaviour" to the way they move through the space, then examine the different "behaviours" for classification purposes.
Cheers,
Aidan
This may be way more complicated than you're looking for, but it sounds like what the guys did at astrometry.net may be similar to what you're looking for. Essentially, you can upload a picture of some stars, and it will figure out the position in the sky it belongs, along with rotation, you may be able to use similar pattern matching in what you're looking for.
They have a great pdf explaining how it works here, and apparently you can email them and they'll send you the source code (details are in the pdf).
Edit: apparently you can download the code directly here.
Hope it helps.
there are several approaches you could make:
Using vector paths and translation matricies together with two algorithms, The A* (a star) algorithm ( to locate best routes from what are called greedy functions ), and the "nearest neighbour" algorithm --- these are both commonly used for comparing path efficiencies for routes.
you may not know it but the issue you have is known as the "travelling salesman" problem and has many many approaches.
so look up
traveling salesman problem
A*
Nearest neighbour
also look at
Random walk algorithm - for the most basic approach
for a learned behaviour approach try neural networks "ANN" or genetic algorithms
the mathematics for this type of problem are covered under what is called "graph theory"
It seems that basically what is needed is some metric to compare two(N in general) paths and choose the best one?
If that's the case then I'd suggest plain statistics. I'd start with heading(orientation) histogram, relative(relative to previous heading) heading histogram and so on. Other thing comes to mind - distance/orientation between points covariance. Or just simply make up some kind of "statistics"(number of turns, etc.) and compare those paths using that.

Resources