Find most accurate classifier for my data automatically - windows

I am using Weka, and I'm trying to find the most accurate classifier for my dataset.
The interface for selecting a classifier looks like the following:
It works fine, but it only lets me select one classifier at a time, which is not very practical.
Can I somehow make it run all the available classifiers on my data, so I can easily find the most accurate one?

You need to use weka Experimenter. See below image. I have chosen one data set and two different classification algorithm.
See following tutorial for more information ,
WEKA Experimenter Tutorial.

Related

WEKA PartitionMembership filter

I have a question regarding the supervised PartitionMembership filter in WEKA.
When applying this filter using J48 as partition generator, I am able to achieve a much higher accuracy in combination with the KStar classifier.
What does this filter exactly do, because the documentation provided by WEKA is quite limited? And is it valid to use this filter to get an increased accuracy?
When applying this filter on my trainings set, it generates a number of classes. When I try to reapply the model on my test set, the filter generates a different number of classes. Hence, I am not able to use this trained supervised PartitionMembership filter for my test set. How can I use the PartitionMembership filter that was trained on the training set also for the test set?
You are asking two or three questions here. Regarding the first two: What does the PartitionMembership filter do, and how do I use it? - That I don't know to answer properly. Ultimately you can read the source code to check it out.
For the latter question, (~ how do I get it to evaluate my test set), Please use the FilteredClassifier, and there choose your filter and your classification in the dialog box of that classifier.
NAME weka.classifiers.meta.FilteredClassifier
SYNOPSIS Class for running an arbitrary classifier on data that has
been passed through an arbitrary filter. Like the classifier, the
structure of the filter is based exclusively on the training data and
test instances will be processed by the filter without changing their
structure.

Contextual Search: Classifying shopping products

I have got a new task(not traditional) from my client, It is something about machine learning.
As I have never been to "machine learning" except some little Data Mining stuff so I need your help.
My task is to Classify a product present on any Shopping Site, on the basis of gender(whom the product belongs to),agegroup etc, the training data we can have is the product's Title, Keywords(available in the html of the product page), and product description.
I did a lot of R&D , I found Image Recog APIs(cloudsight,vufind) that returned the details of the product image but that did not full fill the need, used google suggestqueries, searched out many machine learning algorithms and finally...
I came to know about the "Decision Tree Learning Algorithm" but cannot figure out, how it is applicable to my problem.
I tried out the "PlayingTennis" dataset but couldn't make the sense what to do.
Can you give me some direction that from where to start this journey? Should I focus on The Decision Tree Learning algorithm or Is there any other algorithm you would suggest I should focus on to categorize the products on the basis of context?
If you say , I would share in detail about what things I searched about to solve my problem.
I would suggest to do the following:
Go through items in your dataset and classify them manually (decide for which gender each item is). Store each decision so that you would be able to somehow link each item in an original dataset with a target class.
Develop an algorithm for converting each item from your dataset into a feature vector. This algorithm should be able to convert each item in your original dataset in a vector of numbers (more about how to do it later).
Convert all your dataset with appropriate classes into a dataset that would look like this:
Feature_1, Feature_2, Feature_3, ..., Gender
value_1, value_2, value_3, ... male
It would be a good decision to store it in CSV file since you would be able to load it and process in different machine learning tools (More about those later).
Load dataset you've created at step 3 in machine learning tool of your choice and try to come up with the best model that can classify items in your dataset by gender.
Store model created at step 4. It will be part of your production system.
Develop a production code that can convert an unclassified product, create feature vector out of it and pass this feature vector to the model you've saved at step 5. The result of this operation should be a predicted gender.
Details
If there too many items (say tens of thousands) in your original dataset it may be impractical to classify them yourself. What you can do is to use Amazon Mechanical Turk to simplify your task. If you are unable to use it (the last time I've checked you had to have a USA address to use it) you can just classify few hundreds of items to start working on your model and classify the rest to improve accuracy of your classification (the more training data you use the better the accuracy, but up to a certain point)
How to extract features from a dataset
If keyword has form like tag=true/false, it's a boolean feature.
If keyword has form like tag=42, it's a numerical one or ordinal. For example it can be price value or price range (0-10, 10-50, 50-100, etc.)
If keyword has form like tag=string_value you can convert it into a categorical value
A class (gender) is simply boolean value 0/1
You can experiment a bit with how you extract your features, since it may influence the result accuracy.
How to extract features from product description
There are different ways to convert a text into a feature vector. Look for TF-IDF algorithms or something similar.
Machine learning tools
You can use one of existing machine learning libraries and hack some code that loads your CSV dataset, trains a model and checks the accuracy, but at first I would suggest to use something like Weka. It has more or less intuitive UI and you can quickly start to experiment with different machine learning algorithms, convert different features in your dataset from string to categories, or from real values to ordinal values, etc. Good thing about Weka is that it has Java API, so you can automate all the process of data conversion, train models programmatically, etc.
What algorithms to choose
I would suggest to use decision tree algorithms like C4.5. It's fast and show good results on wide range of machine learning tasks. Additionally you can use ensemble of classifiers. There are various algorithms that can combine several algorithms like (google for boosting or random forest to find out more) usually they give better results, but work more slowly (since you need to run a single feature vector through several algorithms.
One another trick that you can use to make your algorithm more accurate is to use models that work on different sets of features (say one algorithm uses features extracted from tags and another algorithm uses data extracted from product description). You can then combine them using algorithms like stacking to come up with a final result.
For classification on the basis of features extracted from text, you can try to use Naive Bayes algorithm or SVM. They both show good results in text classification.
Do consider Support Vector Classifier (SVC), or for Google's sake the Support Vector Machine (SVM). If You have a large training set (which I suspect) search for implementations that are "fast" or "scalable".

How compare two images and check whether both images are having same object or not in OpenCV python or JavaCV

I am working on a feature matching project and i am using OpenCV Python as the tool for developed the application.
According to the project requirement, my database have images of some objects like glass, ball,etc ....with their descriptions. User can send images to the back end of the application and back end is responsible for matching the sent image with images which are exist in the database and send the image description to the user.
I had done some research on the above scenario. Unfortunately still i could not find a algorithm for matching two images and identifying both are matching or not.
If any body have that kind of algorithm please send me.(I have to use OpenCV python or JavaCV)
Thank you
This is a very common problem in Computer Vision nowadays. A simple solution is really simple. But there are many, many variants for more sophisticated solutions.
Simple Solution
Feature Detector and Descriptor based.
The idea here being that you get a bunch of keypoints and their descriptors (search for SIFT/SURF/ORB). You can then find matches easily with tools provided in OpenCV. You would match the keypoints in your query image against all keypoints in the training dataset. Because of typical outliers, you would like to add a robust matching technique, like RanSaC. All of this is part of OpenCV.
Bag-of-Word model
If you want just the image that is as much the same as your query image, you can use Nearest-Neighbour search. Be aware that OpenCV comes with the much faster Approximated-Nearest-Neighbour (ANN) algorithm. Or you can use the BruteForceMatcher.
Advanced Solution
If you have many images (many==1 Million), you can look at Locality-Sensitive-Hashing (see Dean et al, 100,000 Object Categories).
If you do use Bag-of-Visual-Words, then you should probably build an Inverted Index.
Have a look at Fisher Vectors for improved accuracy as compared to BOW.
Suggestion
Start by using Bag-Of-Visual-Words. There are tutorials on how to train the dictionary for this
model.
Training:
Extract Local features (just pick SIFT, you can easily change this as OpenCV is very modular) from a subset of your training images. First detect features and then extract them. There are many tutorials on the web about this.
Train Dictionary. Helpful documentation with a reference to a sample implementation in Python (opencv_source_code/samples/python2/find_obj.py)!
Compute Histogram for each training image. (Also in the BOW documentation from previous step)
Put your image descriptors from the step above into a FLANN-Based-matcher.
Querying:
Compute features on your query image.
Use the dictionary from training to build a BOW histogram for your query image.
Use that feature to find the nearest neighbor(s).
I think you are talking about Content Based Image Retrieval
There are many research paper available on Internet.Get any one of them and Implement Best out of them according to your needs.Select Criteria according to your application like Texture based,color based,shape based image retrieval (This is best when you are working with image retrieval on internet for speed).
So you Need python Implementation, I would like to suggest you to go through Chapter 7, 8 of book Computer Vision Book . It Contains Working Example with code of what you are looking for
One question you may found useful : Are there any API's that'll let me search by image?

Use clustering for prediction in Weka

Can I use clustering (e.g. using k-means) to make predictions in Weka?
I have some data based on a research for president elections. I have answers from questionnaires (numeric attributes), and I have one attribute that is the answer for the question Who are you going to vote? (1, 2 or 3)
I make predictions using some classifiers (e.g. Bayes) in Weka. My results are based on that answer(vote intention) and I have about 60% recall(rate of correct predictions).
I understand that clustering is a different thing, but can I use clustering to make predictions? I've already tried so, but I've realized clustering always selects its own centroids, and it does not use my vote intention question.
Explain results of K-means
must be a colleague of yours. He seems to use the same data set, and it would be helpful if we could all have a look at the data.
In general, clustering is not classification or prediction.
However, you can try to improve your classification by using the information gained from clustering. Two such techniques:
substitute your data set with the cluster centers, and use this for classification (at least if your clusters are reasonably pure wrt. to the class label!)
train a separate classifier on each cluster, and build an ensemble out of them (in particular, if your clusters are inhomogenous)
But I belive your understanding of classification or clustering is not yet far enough to try out these. You need to handle them carefully, and know your data very well.
Yes. You can use the Weka interface to do prediction via clustering. First, upload your training data using the Preprocess tab. Then, go to classify tab, under classifier, click choose and under meta, choose ClassificationViaClustering. The default clustering algorithm used by weka is SimpleKMean but you can change that by clicking on the options string (i.e. the text next to the choose button) and weka will display a message box, click choose and a set of clustering algorithms will be listed to choose from (e.g. EM). After that, you can do Cross-Validation or upload a test data by clicking on set as you normally do when you use weka for classification.
Hope this will help anyone having the same question!

Exporting Mahout model output as Weka input

I would like to use the output model of a Mahout decision tree training process as the input model for a Weka based classifier.
As the training of a complex decision tree that is based on millions of training records is almost impractical for a single node Weka classifier, I would like to use Mahout to build the model, using for example Random Forest Partial Implementation.
While the algorithm above can be problematic while training, it is rather simple to use it for prediction with Weka on a single machine.
On Mahout wiki site it is stated that the data formats for import include Weka ARFF format, but not for export.
Is it possible to use some of the existing implementations in Mahout to train models that will be used in production with a simple Weka based system?
I don't think it's possible to do what you're asking: .arff is a data format, as are all of the other options in the import/export menus. The classifiers that Weka can save/load are, in fact, Weka's java Classifier objects written to a file using Java's Serializable interface. They're not so much portable trees as they are Java objects that last longer than the JVMs which create them. Thus, to do what you want, either Mahout or Weka would have to be able to produce/read each other's code, and that's not something I can find any documentation of.
My experience is that with several million training records (consisting of ~45 numeric features/columns each), Weka's Random Forest implementation using the default options is very fast (operating in seconds on a single 2.26GHz core), so it may not be necessary to bother with Mahout. Your data set may well have different results, though.

Resources