What estimator to use in scikit-learn? - algorithm

This is my first brush with machine learning, so I'm trying to figure out how this all works. I have a dataset where I've compiled all the statistics of each player to play with my high school baseball team. I also have a list of all the players that have ever made it to the MLB from my high school. What I'd like to do is split the data into a training set and a test set, and then feed it to some algorithm in the scikit-learn package and predict the probability of making the MLB.
So I looked through a number of sources and found this cheat sheet that suggests I start with linear SVC.
So, then as I understand it I need to break my data into training samples where each row is a player and each column is a piece of data about the player (batting average, on base percentage, yada, yada), X_train; and a corresponding truth matrix of a single row per player that is simply 1 (played in MLB) or 0 (did not play in MLB), Y_train. From there, I just do Fit(X,Y) and then I can use predict(X_test) to see if it gets the right values for Y_test.
Does this seem a logical choice of algorithm, method, and application?
EDIT to provide more information:
The data is made of 20 features such as number of games played, number of hits, number of Home Runs, number of Strike Outs, etc. Most are basic counting statistics about the players career; a few are rates such as batting average.
I have about 10k total rows to work with, so I can split the data based on that; but I have no idea how to optimally split the data, given that <1% have made the MLB.

Alright, here are a few steps that might want to make:
Prepare your data set. In practice, you might want to scale the features, but we'll leave it out to make the first working model as simple as possible. So will just need to split the dataset into test/train set. You could shuffle the records manually and take the first X% of the examples as the train set, but there's already a function for it in scikit-learn library: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. You might want to make sure that both: positive and negative examples are present in the train and test set. To do so, you can separate them before the test/train split to make sure that, say 70% of negative examples and 70% of positive examples go the training set.
Let's pick a simple classifier. I'll use logistic regression here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html, but other classifiers have a similar API.
Creating the classifier and training it is easy:
clf = LogisticRegression()
clf.fit(X_train, y_train)
Now it's time to make our first predictions:
y_pred = clf.predict(X_test)
A very important part of the model is its evaluation. Using accuracy is not a good idea here: the number of positive examples is very small, so the model that unconditionally returns 0 can get a very high score. We can use the f1 score instead: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html.
If you want to predict probabilities instead of labels, you can just use the predict_proba method of the classifier.
That's it. We have a working model! Of course, there are a lot thing you may try to improve, such as scaling the features, trying different classifiers, tuning their hyperparameters, but this should be enough to get started.

If you don't have a lot of experience in ML, in scikit learn you have classification algorithms (if the target of your dataset is a boolean or a categorical variable) or regression algorithms (if the target is a continuous variable).
If you have a classification problem, and your variables are in a very different scale a good starting point is a decision tree:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
The classifier is a Tree and you can see the decisions that are taking in the nodes.
After that you can use random forest, that is a group of decision trees that average results:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
After that you can put the same scale in every feature:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
And you can use other algorithms like SVMs.
For every algorithm you need a technique to select its parameters, for example cross validation:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)
But a good course is the best option to learn. In coursera you can find several good courses like this:
https://www.coursera.org/learn/machine-learning

Related

Negative Training Image Examples for CNN

I am using the Caffe framework for CNN training. My aim is to perform simple object recognition for a few basic object categories. Since pretrained networks are not an alternative for my proposed usage I prepared an own training- and testset with about 1000 images for each of 2 classes (say chairs and cars).
The results are quite good. If I present an yet unseen image of a chair it is likely classified as such, same for a car image. My problem is that the results on miscellaneous images that do not show any of these classes often shows a very high confidence (=1) for one random class (which is not surprising regarding the onesided training data but a problem for my application). I thought about different solutions:
1) Adding a third class with also about 1000 negative examples that shows any objects except a chair and a car.
2) Adding more object categories in general, just to let the network classify other objects as such and not any more as a chair or car (of course this would require much effort). Maybe also the broader prediction results would show a more uniform distribution at negative images, allowing to evaluate the target objects presence based on a threshold?
Because it was not much time-consuming to grab random images as negative examples from the internet, I already tested my first solution with about 1200 negative examples. It helped, but the problem remains, perhaps because it were just too few? My concern is that if I increment the number of negative examples, the imbalance of the number of examples for each class leads to less accurate detection of the original classes.
After some research I found one person with a similar problem, but there was no solution:
Convolutional Neural Networks with Caffe and NEGATIVE IMAGES
My question is: Has anyone had the same problem and knows how to deal with it? What way would you recommend, adding more negative examples or more object categories or do you have any other recommendation?
The problem is not unique to Caffe or ConvNets. Any Machine Learning technique runs this risk. In the end, all classifiers take a vector in some input space (usually very high-dimensional), which means they partition that input space. You've given examples of two partitions, which helps to estimate the boundary between the two, but only that boundary. Both partitions have very, very large boundaries, precisely because the input space is so high-dimensional.
ConvNets do try to tackle the high-dimensionality of image data by having fairly small convolution kernels. Realistic negative data helps in training those, and the label wouldn't really matter. You could even use the input image as goal (i.e. train it as an autoencoder) when training the convolution kernels.
One general reason why you don't want to lump all counterexamples is because they may be too varied. If you have a class A with some feature value from the range [-1,+1] on some scale, with counterexamples B [-2,-1] and C [+1,+2], lumping B and C together creates a range [-2,+2] for counterexamples which overlaps the real real range. Given enough data and powerful enough classifiers, this is not fatal, but for instance an SVM can fail badly on this.

Machine learning: optimal parameter values in reasonable time

Sorry if this is a duplicate.
I have a two-class prediction model; it has n configurable (numeric) parameters. The model can work pretty well if you tune those parameters properly, but the specific values for those parameters are hard to find. I used grid search for that (providing, say, m values for each parameter). This yields m ^ n times to learn, and it is very time-consuming even when run in parallel on a machine with 24 cores.
I tried fixing all parameters but one and changing this only one parameter (which yields m × n times), but it's not obvious for me what to do with the results I got. This is a sample plot of precision (triangles) and recall (dots) for negative (red) and positive (blue) samples:
Simply taking the "winner" values for each parameter obtained this way and combining them doesn't lead to best (or even good) prediction results. I thought about building regression on parameter sets with precision/recall as dependent variable, but I don't think that regression with more than 5 independent variables will be much faster than grid search scenario.
What would you propose to find good parameter values, but with reasonable estimation time? Sorry if this has some obvious (or well-documented) answer.
I would use a randomized grid search (pick random values for each of your parameters in a given range that you deem reasonable and evaluate each such randomly chosen configuration), which you can run for as long as you can afford to. This paper runs some experiments that show this is at least as good as a grid search:
Grid search and manual search are the most widely used strategies for hyper-parameter optimization.
This paper shows empirically and theoretically that randomly chosen trials are more efficient
for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison
with a large previous study that used grid search and manual search to configure neural networks
and deep belief networks. Compared with neural networks configured by a pure grid search,
we find that random search over the same domain is able to find models that are as good or better
within a small fraction of the computation time.
For what it's worth, I have used scikit-learn's random grid search for a problem that required optimizing about 10 hyper-parameters for a text classification task, with very good results in only around 1000 iterations.
I'd suggest the Simplex Algorithm with Simulated Annealing:
Very simple to use. Simply give it n + 1 points, and let it run up to some configurable value (either number of iterations, or convergence).
Implemented in every possible language.
Doesn't require derivatives.
More resilient to local optimum than the method you're currently using.

Classifying Multivariate Time Series

I currently am working on a time series witch 430 attributes and approx. 80k instances. Now I would like to binary classify each instance (not the whole ts). Everything I found about classifying TS talked about labeling the whole thing.
Is it possible to classify each instance with something like a SVM completely disregarding the sequential nature of the data or would that only result in a really bad classifier?
Which other options are there which classify each instance but still look at the data as a time series?
If the data is labeled, you may have luck by concatenating attributes together, so each instance becomes a single long time series, and by applying the so-called Shapelet Transform. This would result in a vector of values for each of time series which can be fed into SVM, Random Forest, or any other classifier. It could be that picking a right shapelets will allow you to focus on a single attribute when classifying instances.
If it is not labeled, you may try the unsupervised shapelets application first to explore your data and proceed with aforementioned shapelet transform after.
It certainly depends on the data within the 430 attributes,
data types, and especially the problem you want to solve.
In time series analysis, you usually want to exploit the dependencies between the neighboring points, i.e., how they change in time. The examples you may find in books usually talk about a single function f(t): Time -> Real. If I understand it correctly, you want to focus just on the dependencies among the 430 attributes (vertical dependencies) and disregard the horizontal dependencies.
If I were you, I would first try to train multiple classifiers (SVM, Maximum entropy model, Multi-layer perceptron, Random forest, Probabilistic Neural Network, ...) and compare their prediction performance in the frame of your problem.
For training, you can start by feeding all 430 attributes as features to Maxent classifier (can easily handle millions of features).
You also need to perform some N-fold cross-validation to see whether the classifiers are not overfitted. Then pick the best that solves your problem "good enough".
Other ideas if this approach does not perform well:
include features from t-1, t-2...
perform feature selection by trying different subsets of features
derive new time series such as moving averages, wavelet spectrum ... and use them as new features
A nice implementation of Maxent classifier can be found in openNLP.

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Which data mining algorithm would you suggest for this particular scenario?

This is not a directly programming related question, but it's about selecting the right data mining algorithm.
I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:
there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
young people tend to live in highly populated regions whereas old people prefer countrysides, and
Internet is used more by young people than by old people.
I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with
approx. 500 different names (nominal input variable with too many classes)
20 different regions (nominal input variable)
Internet Yes/No (binary input variable)
91 distinct birthyears (numerical target variable with range: 1910-1992)
Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?
I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.
One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.
One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).
However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.
New answer
I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.
I have 4 women. Here are their data:
Gertrude, Internet, 57
Jennifer, Internet, 23
Gertrude, No Internet, 60
Mary, No Internet, 35
I would generate a matrix, A, like this (each row represents a respective woman in my list):
[[1,0,0,1,0],
[0,1,0,1,0],
[1,0,0,0,1],
[0,0,1,0,1]]
The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent
[Gertrude, Jennifer, Mary, Internet, No Internet]
You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is
b=[[57],
[23],
[60],
[35]]
You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.
You might check out the babynamewizard. It shows the changes in name frequency over time and should help convert your names to a numeric input. Also, you should be able to use population density from census.gov data to get a numeric value associated with your regions. I would suggest an additional flag regarding the availability of DSL access - many rural areas don't have DSL coverage. No coverage = less demand for internet services.
My first inclination would be to divide your response into two groups, those very likely to have used computers in school or work and those much less likely. The exposure to computer use at an age early in their career or schooling probably has some effect on their likelihood to use a computer later in their life. Then you might consider regressions on the groups separately. This should eliminate some of the natural correlation of your inputs.
I would use a classification algorithm that accepts nominal attributes and numeric class, like M5 (for trees or rules). Perhaps I would combine it with the bagging meta classifier to reduce variance. The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements.
The algorithm is implemented in R (library RWeka)
It also can be found in the open source machine learning software Weka
For more information see:
Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.
Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.
I think slightly different from you, I believe that trees are excellent algorithms to deal with nominal data because they can help you build a model that you can easily interpret and identify the influence of each one of these nominal variables and it's different values.
You can also use regression with dummy variables in order to represent the nominal attributes, this is also a good solution.
But you can also use other algorithms such as SVM(smo), with the previous transformation of the nominal variables to binary dummy ones, same as in regression.

Resources