Clustering+Regression-the right approach or not? - algorithm

I have a task of prognosing the quickness of selling goods (for example, in one category). E.g, the client inputs the price that he wants his item to be sold and the algorithm should displays that it will be sold with the inputed price for n days. And it should have 3 intervals of quick, medium and long sell. Like in the picture:
The question: how exactly should I prepare the algorithm?
My suggestion: use clustering technics for understanding this three price ranges and then solving regression task for each cluster for predicting the number of days. Is it a right concept to do?

There are two questions here, and I think the answer to each lies in a different domain:
Given an input price, predict how long will it take to sell the item. This is a well defined prediction problem, and can be tackled using ML algorithms. e.g. use your entire dataset to train and test a regression model for prediction.
Translate the prediction into a class: quick-, medium- or slow-sell. This problem is product oriented - there doesn't seem to be any concrete data allowing you to train a classifier on this translation; and I agree with #anony-mousse that using unsupervised learning might not yield easy-to-use results.
You can either consult your users or a product manager on reasonable thresholds to use (there might be considerations here like the type of item, season etc.), or try getting some additional data in order to train a supervised classifier.
E.g. you could ask your users, post-sell, if they think the sell was quick, medium or slow. Then you'll have some data to use for thresholding or for classification.

I suggest you simply define thesholds of 10 days and 31 days. Keep it simple.
Because these are the values the users will want to understand. If you use clustering, you may end up with 0.31415 days or similar nonintuitive values that you cannot explain to the user anyway.

Related

Contextual Search: Classifying shopping products

I have got a new task(not traditional) from my client, It is something about machine learning.
As I have never been to "machine learning" except some little Data Mining stuff so I need your help.
My task is to Classify a product present on any Shopping Site, on the basis of gender(whom the product belongs to),agegroup etc, the training data we can have is the product's Title, Keywords(available in the html of the product page), and product description.
I did a lot of R&D , I found Image Recog APIs(cloudsight,vufind) that returned the details of the product image but that did not full fill the need, used google suggestqueries, searched out many machine learning algorithms and finally...
I came to know about the "Decision Tree Learning Algorithm" but cannot figure out, how it is applicable to my problem.
I tried out the "PlayingTennis" dataset but couldn't make the sense what to do.
Can you give me some direction that from where to start this journey? Should I focus on The Decision Tree Learning algorithm or Is there any other algorithm you would suggest I should focus on to categorize the products on the basis of context?
If you say , I would share in detail about what things I searched about to solve my problem.
I would suggest to do the following:
Go through items in your dataset and classify them manually (decide for which gender each item is). Store each decision so that you would be able to somehow link each item in an original dataset with a target class.
Develop an algorithm for converting each item from your dataset into a feature vector. This algorithm should be able to convert each item in your original dataset in a vector of numbers (more about how to do it later).
Convert all your dataset with appropriate classes into a dataset that would look like this:
Feature_1, Feature_2, Feature_3, ..., Gender
value_1, value_2, value_3, ... male
It would be a good decision to store it in CSV file since you would be able to load it and process in different machine learning tools (More about those later).
Load dataset you've created at step 3 in machine learning tool of your choice and try to come up with the best model that can classify items in your dataset by gender.
Store model created at step 4. It will be part of your production system.
Develop a production code that can convert an unclassified product, create feature vector out of it and pass this feature vector to the model you've saved at step 5. The result of this operation should be a predicted gender.
Details
If there too many items (say tens of thousands) in your original dataset it may be impractical to classify them yourself. What you can do is to use Amazon Mechanical Turk to simplify your task. If you are unable to use it (the last time I've checked you had to have a USA address to use it) you can just classify few hundreds of items to start working on your model and classify the rest to improve accuracy of your classification (the more training data you use the better the accuracy, but up to a certain point)
How to extract features from a dataset
If keyword has form like tag=true/false, it's a boolean feature.
If keyword has form like tag=42, it's a numerical one or ordinal. For example it can be price value or price range (0-10, 10-50, 50-100, etc.)
If keyword has form like tag=string_value you can convert it into a categorical value
A class (gender) is simply boolean value 0/1
You can experiment a bit with how you extract your features, since it may influence the result accuracy.
How to extract features from product description
There are different ways to convert a text into a feature vector. Look for TF-IDF algorithms or something similar.
Machine learning tools
You can use one of existing machine learning libraries and hack some code that loads your CSV dataset, trains a model and checks the accuracy, but at first I would suggest to use something like Weka. It has more or less intuitive UI and you can quickly start to experiment with different machine learning algorithms, convert different features in your dataset from string to categories, or from real values to ordinal values, etc. Good thing about Weka is that it has Java API, so you can automate all the process of data conversion, train models programmatically, etc.
What algorithms to choose
I would suggest to use decision tree algorithms like C4.5. It's fast and show good results on wide range of machine learning tasks. Additionally you can use ensemble of classifiers. There are various algorithms that can combine several algorithms like (google for boosting or random forest to find out more) usually they give better results, but work more slowly (since you need to run a single feature vector through several algorithms.
One another trick that you can use to make your algorithm more accurate is to use models that work on different sets of features (say one algorithm uses features extracted from tags and another algorithm uses data extracted from product description). You can then combine them using algorithms like stacking to come up with a final result.
For classification on the basis of features extracted from text, you can try to use Naive Bayes algorithm or SVM. They both show good results in text classification.
Do consider Support Vector Classifier (SVC), or for Google's sake the Support Vector Machine (SVM). If You have a large training set (which I suspect) search for implementations that are "fast" or "scalable".

evaluating the performance of item-based collaborative filtering for binary (yes/no) product recommendations

I'm attempting to write some code for item based collaborative filtering for product recommendations. The input has buyers as rows and products as columns, with a simple 0/1 flag to indicate whether or not a buyer has bought an item. The output is a list similar items for a given purchased, ranked by cosine similarities.
I am attempting to measure the accuracy of a few different implementations, but I am not sure of the best approach. Most of the literature I find mentions using some form of mean square error, but this really seems more applicable when your collaborative filtering algorithm predicts a rating (e.g. 4 out of 5 stars) instead of recommending which items a user will purchase.
One approach I was considering was as follows...
split data into training/holdout sets, train on training data
For each item (A) in the set, select data from the holdout set where users bought A
Determine which percentage of A-buyers bought one of the top 3 recommendations for A-buyers
The above seems kind of arbitrary, but I think it could be useful for comparing two different algorithms when trained on the same data.
Actually your approach is quiet similar with the literature but I think you should consider to use recall and precision as most of the papers do.
http://en.wikipedia.org/wiki/Precision_and_recall
Moreover if you will use Apache Mahout there is an implementation for recall and precision in this class; GenericRecommenderIRStatsEvaluator
Best way to test a recommender is always to manually verify that the results. However some kind of automatic verification is also good.
In the spirit of a recommendation system, you should split your data in time, and see if you algorithm can predict what future buys the user does. this should be done for all users.
Don't expect that it can predict everything, a 100% correctness is usually a sign of over-fitting.

how to categorize but don't use Classification or Clustering algorithms?

I have a crawler program that stores sport data from 7 difference news agencies every day. it stores about 1200 sport news every day.
I want to categorize news of last two days into sub-categories. So every two days I have about 2400 news that are exactly for these days and many of their topics are talking exactly about the same event.
for example:
70 news are talking about 500 miles racing of Brad Keselowski.
120 news are talking about US swimmer Nyad that begins swimming.
28 new are talking about the match between Man United and Man City.
. . .
In other words, I want to make something like Google News.
The problem is that this situation is not a classification problem, because I don't have special classes. for example, my classes are not swimming, golf, football, etc. my classes are a special events in every field that happened in these two years. So I cannot use classification algorithms such as Naive Bayes.
On the other hand, my problem is not solving with clustering algorithms too. Because I don't want to force them to put into n clusters. Maybe one of the news doesn't have any similar news or maybe in one pack of two days, there are 12 different stories, but in other two days, there are 30 different issues. So I cannot use clustering algorithms such as "Single Link( Maximum Similarity)", "Complete Link( Minimum Similarity)", "Maximum Weighted Matching" or "Group Average( Average Intra Similarity)".
I have some ideas myself to do this, for example, each two news that have 10 common words, should be in the same class. But if we don't consider some parameters such as length of documents, influence of common and rare words and some other things, this will not work well.
I have read this paper, but it was not my answer.
Is there any known algorithm to solve this problem?
The problem strikes me as a clustering problem with an unknown quality measure for the clusters. That points to an unsupervised method, which is ultimately based on detecting correlations using redundancy in the data. Perhaps something like principal component analysis or latent semantic analysis could be useful. The different dimensions (principal components or singular vectors) would indicate distinct major themes, with the terms corresponding to the vector components hopefully being the words appearing in the description. One drawback is that there's no guarantee that the strongest correlations would lead easily to a sensible description.
Take a look at "topic models" and "Latent Dirichlet Allocation". These are popular and you'll find code in a variety of languages.
You might use hierarchical clustering algorithms to investigate relationships between your items - the closest items (news with almost the same description) would be in the same clusters, and the closest clusters (groups of similar news) would be in the same super-cluster etc.
Also, there is pretty nice and fast algorithm called CLOPE - http://www.google.com.ua/url?sa=t&source=web&cd=11&sqi=2&ved=0CF0QFjAK&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.13.7142%26rep%3Drep1%26type%3Dpdf&rct=j&q=CLOPE&ei=gvo_Tsi4AsKa-gas-uCkAw&usg=AFQjCNGcR9sFqhsEkAJowEjIGbDBwSjeXw&cad=rja
There are many document clustering algorithms out there. Take a look at "Hierarchical document clustering using frequent itemsets", for example, and see if that is similar to what you want. If you're programming in Java, you may get some mileage out of the S-space package, which includes algorithms for latent semantic analysis (LSA) among others.

Which data mining algorithm would you suggest for this particular scenario?

This is not a directly programming related question, but it's about selecting the right data mining algorithm.
I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:
there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
young people tend to live in highly populated regions whereas old people prefer countrysides, and
Internet is used more by young people than by old people.
I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with
approx. 500 different names (nominal input variable with too many classes)
20 different regions (nominal input variable)
Internet Yes/No (binary input variable)
91 distinct birthyears (numerical target variable with range: 1910-1992)
Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?
I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.
One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.
One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).
However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.
New answer
I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.
I have 4 women. Here are their data:
Gertrude, Internet, 57
Jennifer, Internet, 23
Gertrude, No Internet, 60
Mary, No Internet, 35
I would generate a matrix, A, like this (each row represents a respective woman in my list):
[[1,0,0,1,0],
[0,1,0,1,0],
[1,0,0,0,1],
[0,0,1,0,1]]
The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent
[Gertrude, Jennifer, Mary, Internet, No Internet]
You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is
b=[[57],
[23],
[60],
[35]]
You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.
You might check out the babynamewizard. It shows the changes in name frequency over time and should help convert your names to a numeric input. Also, you should be able to use population density from census.gov data to get a numeric value associated with your regions. I would suggest an additional flag regarding the availability of DSL access - many rural areas don't have DSL coverage. No coverage = less demand for internet services.
My first inclination would be to divide your response into two groups, those very likely to have used computers in school or work and those much less likely. The exposure to computer use at an age early in their career or schooling probably has some effect on their likelihood to use a computer later in their life. Then you might consider regressions on the groups separately. This should eliminate some of the natural correlation of your inputs.
I would use a classification algorithm that accepts nominal attributes and numeric class, like M5 (for trees or rules). Perhaps I would combine it with the bagging meta classifier to reduce variance. The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements.
The algorithm is implemented in R (library RWeka)
It also can be found in the open source machine learning software Weka
For more information see:
Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.
Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.
I think slightly different from you, I believe that trees are excellent algorithms to deal with nominal data because they can help you build a model that you can easily interpret and identify the influence of each one of these nominal variables and it's different values.
You can also use regression with dummy variables in order to represent the nominal attributes, this is also a good solution.
But you can also use other algorithms such as SVM(smo), with the previous transformation of the nominal variables to binary dummy ones, same as in regression.

Algorithms to find stuff a user would like based on other users likes

I'm thinking of writing an app to classify movies in an HTPC based on what the family members like.
I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do.
Here's what I want to accomplish:
Compose a set of samples from each users likes, rating each sample attribute separately. For example, maybe a user likes western movies a lot, so the western genre would carry a bit more weight for that user (and so on for other attributes, like actors, director, etc).
A user can get suggestions based on the likes of the other users. For example, if both user A and B like Spielberg (connection between the users), and user B loves Batman Begins, but user A loathes Katie Holmes, weigh the movie for user A accordingly (again, each attribute separately, for example, maybe user A doesn't like action movies so much, so bring the rating down a bit, and since Katie Holmes isn't the main star, don't take that into account as much as the other attributes).
Basically, comparing sets from user A similar to sets from user B, and come up with a rating for user A.
I have a crude idea about how to implement this, but I'm certain some bright minds have already thought of a far better solution already, so... any suggestions?
Actually, after a quick research, it seems a Bayesian filter would work. If so, would this be the better approach? Would it be as simple as just "normalizing" movie data, training a classifier for each user, and then just classify each movie?
If your suggestion includes some brain melting concepts (I'm not experienced in these subjects, specially in AI), I'd appreciate it if you also included a list of some basics for me to research before diving into the meaty stuff.
Thanks!
Matthew Podwysocki had some interesting articles on this stuff
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/03/30/functional-programming-and-collective-intelligence.aspx
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/04/01/functional-programming-and-collective-intelligence-ii.aspx
http://weblogs.asp.net/podwysocki/archive/2009/04/07/functional-programming-and-collective-intelligence-iii.aspx
This is similar to this question where the OP wanted to build a recommendation system. In a nutshell, we are given a set of training data consisting of users ratings to movies (1-5 star rating for example) and a set of attributes for each movie (year, genre, actors, ..). We want to build a recommender so that it will output for unseen movies a possible rating. So the inpt data looks like:
user movie year genre ... | rating
---------------------------------------------
1 1 2006 action | 5
3 2 2008 drama | 3.5
...
and for an unrated movie X:
10 20 2009 drama ?
we want to predict a rating. Doing this for all unseen movies then sorting by predicted movie rating and outputting the top 10 gives you a recommendation system.
The simplest approach is to use a k-nearest neighbor algorithm. Among the rated movies, search for the "closest" ones to movie X, and combine their ratings to produce a prediction.
This approach has the advantage of being very simple to easy implement from scratch.
Other more sophisticated approaches exist. For example you can build a decision tree, fit a set of rules on the training data. You can also use Bayesian networks, artificial neural networks, support vector machines, among many others... Going through each of these wont be easy for someone without the proper background.
Still I expect you would be using an external tool/library. Now you seem to be familiar with Bayesian Networks, so a simple naive bayes net, could in fact be very powerful. One advantage is that it allow for prediction under missing data.
The main idea would be somewhat the same; take the input data you have, train a model, then use it to predict the class of new instances.
If you want to play around with different algorithms in simple intuitive package which requires no programming, I suggest you take a look at Weka (my 1st choice), Orange, or RapidMiner. The most difficult part would be to prepare the dataset to the required format. The rest is as easy as choosing what algorithm and applying it (all in a few clicks!)
I guess for someone not looking to go into too much details, I would recommend going with the nearest neighbor method as it is intuitive and easy to implement.. Still the option of using Weka (or one of the other tools) is worth looking into.
There are a few algorithms that are good for this:
ARTMAP: groups via probability against each other (this isn't fast but its the best thing for your problem IMO)
ARTMAP holds a group of common attributes and determines likelyhood of simliarity via a percentages.
ARTMAP
KMeans: This seperates out the vectors by the distance that they are from each other
KMeans: Wikipedia
PCA: will seperate the average of all the values from the varing bits. This is what you would use to do face detection, and background subtraction in Computer Vision.
PCA
The K-nearest neighbor algorithm may be right up your alley.
Check out some of the work of the top teams for the netflix prize.

Resources