Mathematically how does one compare classification result to clustering results - algorithm

Is there a standard methodology to compare results (for accuracy) of a classification algorithm against a clustering algorithm? I have data that has only two true labels. Easy enough to check accuracy when I run a binary classification on it, but if I run clustering, where I ask it to cluster the data into 5 groups, how can I check the accuracy and compare it to the binary classification. I know clustering is not suitable for (two label) data but how can one prove this mathematically?

Clustering into more than two clusters is one way to do 2-class classification (just pick which ever label is more common in each cluster to be the predicted label for the cluster). However it's a very strange approach because it ignores the labels until the very end after the clustering is computed. Supervised learning (i.e. classification) provides much more powerful tools like random forests for classification.

Don't approach clustering as classification
They have very different objectives, and really should not be compared. Classification is about reproducing known labels, and you need to pay attention to overfitting, train/test splitting etc. Clustering on the other hand is exploratory. Any truly exploratory method will eventually not find anything, or will turn up obvious results only.
By trying to evaluate it the same way as classification, you "overfit" to clustering methods that yield the obvious, if anything.
Instead, evaluate Clustering by looking at the results. If you learn something from the result, then it was good. If not, try again.
Don't try to stick a number on everything
There is more than black, white, and 50 shades of grey. Putting everything into a single number is a grayscale view of the world... it's popular (so is "good vs. evil" thinking); but in science we should do better.

Related

Can humans cluster data sets manually?clustering algorithms that are closest to the human clustering

Can human cluster data sets manually? For example, consider the Iris data set, depicted below:
http://i.stack.imgur.com/Ae6qa.png
Instead of using clustering algorithms like connectivity-based clustering (hierarchical clustering), centroid-based clustering, distribution-based clustering, density-based clustering. etc.
Can a human manually cluster the Iris dataset? For our convenience, let us consider it as a two dimensional dataset. By which means and how a human would cluster the dataset?
I am concerned that "human clustering" might not be well-defined and could vary according to different people's intuitions and opinions.I would like to know what are the clustering algorithms that are closest to the human clustering or how the data-set clustering is performed by humans? Is there a clustering algorithm that would perform just like the humans do the clustering?
Humans can and do cluster data manually, but as you say there will be a lot of variation and subjective decisions. Assuming that you could get an algorithm that will use the same features as a human, it's in principle possible to have a computer cluster like a human.
At a first approximation, nearest neighbor algorithms are probably close to how humans cluster in they group things look similar under some measure. Keep in mind that without training and significant ongoing effort, humans really don't do well on consistency. We seem to be biased toward looking for novelty, so we tend to break things into two big clusters, the stuff we encounter all of the time, and everything else.

Classifying Multivariate Time Series

I currently am working on a time series witch 430 attributes and approx. 80k instances. Now I would like to binary classify each instance (not the whole ts). Everything I found about classifying TS talked about labeling the whole thing.
Is it possible to classify each instance with something like a SVM completely disregarding the sequential nature of the data or would that only result in a really bad classifier?
Which other options are there which classify each instance but still look at the data as a time series?
If the data is labeled, you may have luck by concatenating attributes together, so each instance becomes a single long time series, and by applying the so-called Shapelet Transform. This would result in a vector of values for each of time series which can be fed into SVM, Random Forest, or any other classifier. It could be that picking a right shapelets will allow you to focus on a single attribute when classifying instances.
If it is not labeled, you may try the unsupervised shapelets application first to explore your data and proceed with aforementioned shapelet transform after.
It certainly depends on the data within the 430 attributes,
data types, and especially the problem you want to solve.
In time series analysis, you usually want to exploit the dependencies between the neighboring points, i.e., how they change in time. The examples you may find in books usually talk about a single function f(t): Time -> Real. If I understand it correctly, you want to focus just on the dependencies among the 430 attributes (vertical dependencies) and disregard the horizontal dependencies.
If I were you, I would first try to train multiple classifiers (SVM, Maximum entropy model, Multi-layer perceptron, Random forest, Probabilistic Neural Network, ...) and compare their prediction performance in the frame of your problem.
For training, you can start by feeding all 430 attributes as features to Maxent classifier (can easily handle millions of features).
You also need to perform some N-fold cross-validation to see whether the classifiers are not overfitted. Then pick the best that solves your problem "good enough".
Other ideas if this approach does not perform well:
include features from t-1, t-2...
perform feature selection by trying different subsets of features
derive new time series such as moving averages, wavelet spectrum ... and use them as new features
A nice implementation of Maxent classifier can be found in openNLP.

Finding an optimum learning rule for an ANN

How do you find an optimum learning rule for a given problem, say a multiple category classification?
I was thinking of using Genetic Algorithms, but I know there are issues surrounding performance. I am looking for real world examples where you have not used the textbook learning rules, and how you found those learning rules.
Nice question BTW.
classification algorithms can be classified using many Characteristics like:
What does the algorithm strongly prefer (or what type of data that is most suitable for this algorithm).
training overhead. (does it take a lot of time to be trained)
When is it effective. ( large data - medium data - small amount of data ).
the complexity of analyses it can deliver.
Therefore, for your problem classifying multiple categories I will use Online Logistic Regression (FROM SGD) because it's perfect with small to medium data size (less than tens of millions of training examples) and it's really fast.
Another Example:
let's say that you have to classify a large amount of text data. then Naive Bayes is your baby. because it strongly prefers text analysis. even that SVM and SGD are faster, and as I experienced easier to train. but these rules "SVM and SGD" can be applied when the data size is considered as medium or small and not large.
In general any data mining person will ask him self the four afomentioned points when he wants to start any ML or Simple mining project.
After that you have to measure its AUC, or any relevant, to see what have you done. because you might use more than just one classifier in one project. or sometimes when you think that you have found your perfect classifier, the results appear to be not good using some measurement techniques. so you'll start to check your questions again to find where you went wrong.
Hope that I helped.
When you input a vector x to the net, the net will give an output depend on all the weights (vector w). There would be an error between the output and the true answer. The average error (e) is a function of the w, let's say e = F(w). Suppose you have one-layer-two-dimension network, then the image of F may look like this:
When we talk about training, we are actually talking about finding the w which makes the minimal e. In another word, we are searching the minimum of a function. To train is to search.
So, you question is how to choose the method to search. My suggestion would be: It depends on how the surface of F(w) looks like. The wavier it is, the more randomized method should be used, because the simple method based on gradient descending would have bigger chance to guide you trapped by a local minimum - so you lose the chance to find the global minimum. On the another side, if the suface of F(w) looks like a big pit, then forget the genetic algorithm. A simple back propagation or anything based on gradient descending would be very good in this case.
You may ask that how can I know how the surface look like? That's a skill of experience. Or you might want to randomly sample some w, and calculate F(w) to get an intuitive view of the surface.

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm?

Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm? Why or why not?
Basically, I'm trying to figure out why the Wikipedia page for Statistical Classification does not mention LSI. I'm just getting into this stuff and I'm trying to see how all the different approaches for classifying something relate to one another.
No, they're not quite the same. Statistical classification is intended to separate items into categories as cleanly as possible -- to make a clean decision about whether item X is more like the items in group A or group B, for example.
LSI is intended to show the degree to which items are similar or different and, primarily, find items that show a degree of similarity to an specified item. While this is similar, it's not quite the same.
LSI/LSA is eventually a technique for dimensionality reduction, and usually is coupled with a nearest neighbor algorithm to make it a into classification system. Hence in itself, its only a way of "indexing" the data in lower dimension using SVD.
Have you read about LSI on Wikipedia ? It says it uses matrix factorization (SVD), which in turn is sometimes used in classification.
The primary distinction in machine learning is between "supervised" and "unsupervised" modeling.
Usually the words "statistical classification" refer to supervised models, but not always.
With supervised methods the training set contains a "ground-truth" label that you build a model to predict. When you evaluate the model, the goal is to predict the best guess at (or probability distribution of) the true label, which you will not have at time of evaluation. Often there's a performance metric and it's quite clear what the right vs wrong answer is.
Unsupervised classification methods attempt to cluster a large number of data points which may appear to vary in complicated ways into a smaller number of "similar" categories. Data in each category ought to be similar in some kind of 'interesting' or 'deep' way. Since there is no "ground truth" you can't evaluate 'right or wrong', but 'more' vs 'less' interesting or useful.
Similarly evaluation time you can place new examples into potentially one of the clusters (crisp classification) or give some kind of weighting quantifying how similar or different looks like the "archetype" of the cluster.
So in some ways supervised and unsupervised models can yield something which is a "prediction", prediction of class/cluster label, but they are intrinsically different.
Often the goal of an unsupervised model is to provide more intelligent and powerfully compact inputs for a subsequent supervised model.

Resources