Use clustering for prediction in Weka - algorithm

Can I use clustering (e.g. using k-means) to make predictions in Weka?
I have some data based on a research for president elections. I have answers from questionnaires (numeric attributes), and I have one attribute that is the answer for the question Who are you going to vote? (1, 2 or 3)
I make predictions using some classifiers (e.g. Bayes) in Weka. My results are based on that answer(vote intention) and I have about 60% recall(rate of correct predictions).
I understand that clustering is a different thing, but can I use clustering to make predictions? I've already tried so, but I've realized clustering always selects its own centroids, and it does not use my vote intention question.

Explain results of K-means
must be a colleague of yours. He seems to use the same data set, and it would be helpful if we could all have a look at the data.
In general, clustering is not classification or prediction.
However, you can try to improve your classification by using the information gained from clustering. Two such techniques:
substitute your data set with the cluster centers, and use this for classification (at least if your clusters are reasonably pure wrt. to the class label!)
train a separate classifier on each cluster, and build an ensemble out of them (in particular, if your clusters are inhomogenous)
But I belive your understanding of classification or clustering is not yet far enough to try out these. You need to handle them carefully, and know your data very well.

Yes. You can use the Weka interface to do prediction via clustering. First, upload your training data using the Preprocess tab. Then, go to classify tab, under classifier, click choose and under meta, choose ClassificationViaClustering. The default clustering algorithm used by weka is SimpleKMean but you can change that by clicking on the options string (i.e. the text next to the choose button) and weka will display a message box, click choose and a set of clustering algorithms will be listed to choose from (e.g. EM). After that, you can do Cross-Validation or upload a test data by clicking on set as you normally do when you use weka for classification.
Hope this will help anyone having the same question!

Related

some confusions in machine learning

I have two confusions when I use machine learning algorithm. At first, I have to say that I just use it.
There are two categories A and B, if I want to pick as many as A from their mixture, what kind of algorithm should I use ( no need to consider the number of samples) . At first I thought it should be a classification algorithm. And I use for example boost decision tree in a package TMVA, but someone told me that BDT is a regression algorithm indeed.
I find when I have coarse data. If I analysis it ( do some combinations ...) before I throw it to BDT, the result is better than I throw the coarse data into BDT. Since the coarse data contains every information, why do I need analysis it myself?
Is you are not clear, please just add a comment. And hope you can give me any advise.
For 2, you have to perform some manipulation on data and feed it to perform better because from it is not built into algorithm to analyze. It only looks at data and classifies. The problem of analysis as you put it is called feature selection or feature engineering and it has to be done by hand (of course unless you are using some kind of technique that learns features eg. deep learning). In machine learning, it has been seen a lot of times that manipulated/engineered features perform better than raw features.
For 1, I think BDT can be used for regression as well as classification. This looks like a classification problem (to choose or not to choose). Hence you should use a classification algorithm
Are you sure ML is the approach for your problem? In case it is, some classification algorithms would be:
logistic regression, neural networks, support vector machines,desicion trees just to name a few.

Contextual Search: Classifying shopping products

I have got a new task(not traditional) from my client, It is something about machine learning.
As I have never been to "machine learning" except some little Data Mining stuff so I need your help.
My task is to Classify a product present on any Shopping Site, on the basis of gender(whom the product belongs to),agegroup etc, the training data we can have is the product's Title, Keywords(available in the html of the product page), and product description.
I did a lot of R&D , I found Image Recog APIs(cloudsight,vufind) that returned the details of the product image but that did not full fill the need, used google suggestqueries, searched out many machine learning algorithms and finally...
I came to know about the "Decision Tree Learning Algorithm" but cannot figure out, how it is applicable to my problem.
I tried out the "PlayingTennis" dataset but couldn't make the sense what to do.
Can you give me some direction that from where to start this journey? Should I focus on The Decision Tree Learning algorithm or Is there any other algorithm you would suggest I should focus on to categorize the products on the basis of context?
If you say , I would share in detail about what things I searched about to solve my problem.
I would suggest to do the following:
Go through items in your dataset and classify them manually (decide for which gender each item is). Store each decision so that you would be able to somehow link each item in an original dataset with a target class.
Develop an algorithm for converting each item from your dataset into a feature vector. This algorithm should be able to convert each item in your original dataset in a vector of numbers (more about how to do it later).
Convert all your dataset with appropriate classes into a dataset that would look like this:
Feature_1, Feature_2, Feature_3, ..., Gender
value_1, value_2, value_3, ... male
It would be a good decision to store it in CSV file since you would be able to load it and process in different machine learning tools (More about those later).
Load dataset you've created at step 3 in machine learning tool of your choice and try to come up with the best model that can classify items in your dataset by gender.
Store model created at step 4. It will be part of your production system.
Develop a production code that can convert an unclassified product, create feature vector out of it and pass this feature vector to the model you've saved at step 5. The result of this operation should be a predicted gender.
Details
If there too many items (say tens of thousands) in your original dataset it may be impractical to classify them yourself. What you can do is to use Amazon Mechanical Turk to simplify your task. If you are unable to use it (the last time I've checked you had to have a USA address to use it) you can just classify few hundreds of items to start working on your model and classify the rest to improve accuracy of your classification (the more training data you use the better the accuracy, but up to a certain point)
How to extract features from a dataset
If keyword has form like tag=true/false, it's a boolean feature.
If keyword has form like tag=42, it's a numerical one or ordinal. For example it can be price value or price range (0-10, 10-50, 50-100, etc.)
If keyword has form like tag=string_value you can convert it into a categorical value
A class (gender) is simply boolean value 0/1
You can experiment a bit with how you extract your features, since it may influence the result accuracy.
How to extract features from product description
There are different ways to convert a text into a feature vector. Look for TF-IDF algorithms or something similar.
Machine learning tools
You can use one of existing machine learning libraries and hack some code that loads your CSV dataset, trains a model and checks the accuracy, but at first I would suggest to use something like Weka. It has more or less intuitive UI and you can quickly start to experiment with different machine learning algorithms, convert different features in your dataset from string to categories, or from real values to ordinal values, etc. Good thing about Weka is that it has Java API, so you can automate all the process of data conversion, train models programmatically, etc.
What algorithms to choose
I would suggest to use decision tree algorithms like C4.5. It's fast and show good results on wide range of machine learning tasks. Additionally you can use ensemble of classifiers. There are various algorithms that can combine several algorithms like (google for boosting or random forest to find out more) usually they give better results, but work more slowly (since you need to run a single feature vector through several algorithms.
One another trick that you can use to make your algorithm more accurate is to use models that work on different sets of features (say one algorithm uses features extracted from tags and another algorithm uses data extracted from product description). You can then combine them using algorithms like stacking to come up with a final result.
For classification on the basis of features extracted from text, you can try to use Naive Bayes algorithm or SVM. They both show good results in text classification.
Do consider Support Vector Classifier (SVC), or for Google's sake the Support Vector Machine (SVM). If You have a large training set (which I suspect) search for implementations that are "fast" or "scalable".

How to judge performance of algorithms for Text Clustering?

I am using K-Means algorithm for Text Clustering with initial seeding with K-Means++.
I try to make the algorithm more efficient with some changes like changing the stop-word dictionary and increasing the max_no_of_random_iterations.
I get different results. How do i compare them ? I could not apply the idea of confusion matrix here. Output is not in the form of some document getting some value or tag. A document goes to a set. It is just relative "good clustering" or the set that matters.
So Is there some standard way for marking the performance for this output set ?
If confusion matrix is the answer, please explain how to do it ?
Thanks.
You could decide in advance how to measure the quality of the clusters, for example count how many empty ones or some stats like Within Sum of Squares
This paper says
"... three distinctive approaches to cluster validity are possible.
The first approach relies on external criteria that investigate the
existence of some predefined structure in clustered data set. The
second approach makes use of internal criteria and the clustering
results are evaluated by quantities describing the data set such as
proximity matrix etc. Approaches based on internal and external
criteria make use of statistical tests and their disadvantage is
high computational cost. The third approach makes use of relative
criteria and relies on finding the best clustering scheme that meets
certain assumptions and requires predefined input parameters values"
Since clustering is unsupervised, you are asking for something difficult. I suggest researching how people cluster using genetic algorithms and see what fitness criteria they use.

Which classification algorithm can be used for document categorization?

Hey, Here is my problem,
Given a set of documents I need to assign each document to a predefined category.
I was going to use the n-gram approach to represent the text-content of each document and then train an SVM classifier on the training data that I have.
Correct me if I miss understood something please.
The problem now is that the categories should be dynamic. Meaning, my classifier should handle new training data with new category.
So for example, if I trained a classifier to classify a given document as category A, category B or category C, and then I was given new training data with category D. I should be able to incrementally train my classifier by providing it with the new training data for "category D".
To summarize, I do NOT want to combine the old training data (with 3 categories) and the new training data (with the new/unseen category) and train my classifier again. I want to train my classifier on the fly
Is this possible to implement with SVM ? if not, could u recommend me several classification algorithms ? or any book/paper that can help me.
Thanks in Advance.
Naive-Bayes is relatively fast incremental calssification algorithm.
KNN is also incremental by nature, and even simpler to implement and understand.
Both algorithms are implemented in the open source project Weka as NaiveBayes and IBk for KNN.
However, from personal experience - they are both vulnerable to large number of non-informative features (which is usually the case with text classification), and thus some kind of feature selection is usually used to squeeze better performance from these algorithms, which could be problematic to implement as incremental.
This blog post by Edwin Chen describes infinite mixture models to do clustering. I think this method supports automatically determining the number of clusters, but I am still trying to wrap my head all the way around it.
The class of algorithms that matches your criteria are called "Incremental Algorithms". There are incremental versions of almost any methods. The easiest to implement is naive bayes.

Appropriate clustering method for 1 or 2 dimensional data

I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.

Resources