Machine learning: optimal parameter values in reasonable time - time

Sorry if this is a duplicate.
I have a two-class prediction model; it has n configurable (numeric) parameters. The model can work pretty well if you tune those parameters properly, but the specific values for those parameters are hard to find. I used grid search for that (providing, say, m values for each parameter). This yields m ^ n times to learn, and it is very time-consuming even when run in parallel on a machine with 24 cores.
I tried fixing all parameters but one and changing this only one parameter (which yields m × n times), but it's not obvious for me what to do with the results I got. This is a sample plot of precision (triangles) and recall (dots) for negative (red) and positive (blue) samples:
Simply taking the "winner" values for each parameter obtained this way and combining them doesn't lead to best (or even good) prediction results. I thought about building regression on parameter sets with precision/recall as dependent variable, but I don't think that regression with more than 5 independent variables will be much faster than grid search scenario.
What would you propose to find good parameter values, but with reasonable estimation time? Sorry if this has some obvious (or well-documented) answer.

I would use a randomized grid search (pick random values for each of your parameters in a given range that you deem reasonable and evaluate each such randomly chosen configuration), which you can run for as long as you can afford to. This paper runs some experiments that show this is at least as good as a grid search:
Grid search and manual search are the most widely used strategies for hyper-parameter optimization.
This paper shows empirically and theoretically that randomly chosen trials are more efficient
for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison
with a large previous study that used grid search and manual search to configure neural networks
and deep belief networks. Compared with neural networks configured by a pure grid search,
we find that random search over the same domain is able to find models that are as good or better
within a small fraction of the computation time.
For what it's worth, I have used scikit-learn's random grid search for a problem that required optimizing about 10 hyper-parameters for a text classification task, with very good results in only around 1000 iterations.

I'd suggest the Simplex Algorithm with Simulated Annealing:
Very simple to use. Simply give it n + 1 points, and let it run up to some configurable value (either number of iterations, or convergence).
Implemented in every possible language.
Doesn't require derivatives.
More resilient to local optimum than the method you're currently using.

Related

What estimator to use in scikit-learn?

This is my first brush with machine learning, so I'm trying to figure out how this all works. I have a dataset where I've compiled all the statistics of each player to play with my high school baseball team. I also have a list of all the players that have ever made it to the MLB from my high school. What I'd like to do is split the data into a training set and a test set, and then feed it to some algorithm in the scikit-learn package and predict the probability of making the MLB.
So I looked through a number of sources and found this cheat sheet that suggests I start with linear SVC.
So, then as I understand it I need to break my data into training samples where each row is a player and each column is a piece of data about the player (batting average, on base percentage, yada, yada), X_train; and a corresponding truth matrix of a single row per player that is simply 1 (played in MLB) or 0 (did not play in MLB), Y_train. From there, I just do Fit(X,Y) and then I can use predict(X_test) to see if it gets the right values for Y_test.
Does this seem a logical choice of algorithm, method, and application?
EDIT to provide more information:
The data is made of 20 features such as number of games played, number of hits, number of Home Runs, number of Strike Outs, etc. Most are basic counting statistics about the players career; a few are rates such as batting average.
I have about 10k total rows to work with, so I can split the data based on that; but I have no idea how to optimally split the data, given that <1% have made the MLB.
Alright, here are a few steps that might want to make:
Prepare your data set. In practice, you might want to scale the features, but we'll leave it out to make the first working model as simple as possible. So will just need to split the dataset into test/train set. You could shuffle the records manually and take the first X% of the examples as the train set, but there's already a function for it in scikit-learn library: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. You might want to make sure that both: positive and negative examples are present in the train and test set. To do so, you can separate them before the test/train split to make sure that, say 70% of negative examples and 70% of positive examples go the training set.
Let's pick a simple classifier. I'll use logistic regression here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html, but other classifiers have a similar API.
Creating the classifier and training it is easy:
clf = LogisticRegression()
clf.fit(X_train, y_train)
Now it's time to make our first predictions:
y_pred = clf.predict(X_test)
A very important part of the model is its evaluation. Using accuracy is not a good idea here: the number of positive examples is very small, so the model that unconditionally returns 0 can get a very high score. We can use the f1 score instead: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html.
If you want to predict probabilities instead of labels, you can just use the predict_proba method of the classifier.
That's it. We have a working model! Of course, there are a lot thing you may try to improve, such as scaling the features, trying different classifiers, tuning their hyperparameters, but this should be enough to get started.
If you don't have a lot of experience in ML, in scikit learn you have classification algorithms (if the target of your dataset is a boolean or a categorical variable) or regression algorithms (if the target is a continuous variable).
If you have a classification problem, and your variables are in a very different scale a good starting point is a decision tree:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
The classifier is a Tree and you can see the decisions that are taking in the nodes.
After that you can use random forest, that is a group of decision trees that average results:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
After that you can put the same scale in every feature:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
And you can use other algorithms like SVMs.
For every algorithm you need a technique to select its parameters, for example cross validation:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)
But a good course is the best option to learn. In coursera you can find several good courses like this:
https://www.coursera.org/learn/machine-learning

Genetic Algorithm Results Presentation

I have data of an experiment that I ran using the genetic algorithm and am trying to present it in a paper. What is a good/ classic way of representing the results of the genetic algorithm. I was thinking of doing a scatterplot representing the maximum fit individuals by their generations. Is this a good representation of the results?
When you gauge the performance of a Genetic Algorithm (or any other stochastic algorithm), you run it multiple times and then aggregate the results to eliminate the effect of some runs being "lucky" or "unlucky". Then it's about presenting such aggregated results.
For a single run (among many of those), you are typically concerned with the best individual in the fitness only (unless you are analyzing the population dynamics which I think you don't), because that's the output of the algorithm at any given time during its runtime.
When you have such best individuals for each run, you present the results. Typical visual representation of a GA is an "evolution plot" or "progress plot" (I personally use the first term and other researchers use it too) and it looks something like this (from my master thesis):
I know, it's a little bit messy. However, the solid lines are the medians of the aggregated runs. That means that at X evaluations, for each algorithm the solid line is at the median fitness of all the best individuals from each of the run of the particular algorithm (mean is also used sometimes, but it is not resistant to outliers). The error bars stretch from 1st to 3rd quartile, in my case (standard deviation is also used sometimes, but then the error bars are symmetrical about the solid line and do not show the distribution as much as the quantiles).
If you are not interested in the progress of the evolution but rather in the final results, you can use e.g. boxplot to properly show the distribution of final values of the algorithms. It looks something like this (again, from my master thesis, corresponds to the evolution plot above):
This one was created in MATLAB. There is an online tool for creating boxplots: http://boxplot.bio.ed.ac.uk
If you have only a single algorithm to present, you can also combine the evolution plot with boxplot - an evolution plot made of boxplots! You just put a boxplot every N-th evaluation (N depends on the figure size to be readable). The quartile error bars and median solid line is a sort-of a boxplot, in a (distorted) way.
The last option is to present the results textually (or in a table) supported by some statistical tests. For comparison of two algorithms (the final values), you can use e.g. the Mann-Whitney U-test. Comparing more than two algorithms becomes tricky and you need to find a friendly statistician to help you out :).

Multiple parameter optimization with lots of local minima

I'm looking for algorithms to find a "best" set of parameter values. The function in question has a lot of local minima and changes very quickly. To make matters even worse, testing a set of parameters is very slow - on the order of 1 minute - and I can't compute the gradient directly.
Are there any well-known algorithms for this kind of optimization?
I've had moderate success with just trying random values. I'm wondering if I can improve the performance by making the random parameter chooser have a lower chance of picking parameters close to ones that had produced bad results in the past. Is there a name for this approach so that I can search for specific advice?
More info:
Parameters are continuous
There are on the order of 5-10 parameters. Certainly not more than 10.
How many parameters are there -- eg, how many dimensions in the search space? Are they continuous or discrete - eg, real numbers, or integers, or just a few possible values?
Approaches that I've seen used for these kind of problems have a similar overall structure - take a large number of sample points, and adjust them all towards regions that have "good" answers somehow. Since you have a lot of points, their relative differences serve as a makeshift gradient.
Simulated
Annealing: The classic approach. Take a bunch of points, probabalistically move some to a neighbouring point chosen at at random depending on how much better it is.
Particle
Swarm Optimization: Take a "swarm" of particles with velocities in the search space, probabalistically randomly move a particle; if it's an improvement, let the whole swarm know.
Genetic Algorithms: This is a little different. Rather than using the neighbours information like above, you take the best results each time and "cross-breed" them hoping to get the best characteristics of each.
The wikipedia links have pseudocode for the first two; GA methods have so much variety that it's hard to list just one algorithm, but you can follow links from there. Note that there are implementations for all of the above out there that you can use or take as a starting point.
Note that all of these -- and really any approach to this large-dimensional search algorithm - are heuristics, which mean they have parameters which have to be tuned to your particular problem. Which can be tedious.
By the way, the fact that the function evaluation is so expensive can be made to work for you a bit; since all the above methods involve lots of independant function evaluations, that piece of the algorithm can be trivially parallelized with OpenMP or something similar to make use of as many cores as you have on your machine.
Your situation seems to be similar to that of the poster of Software to Tune/Calibrate Properties for Heuristic Algorithms, and I would give you the same advice I gave there: consider a Metropolis-Hastings like approach with multiple walkers and a simulated annealing of the step sizes.
The difficulty in using a Monte Carlo methods in your case is the expensive evaluation of each candidate. How expensive, compared to the time you have at hand? If you need a good answer in a few minutes this isn't going to be fast enough. If you can leave it running over night, it'll work reasonably well.
Given a complicated search space, I'd recommend a random initial distributed. You final answer may simply be the best individual result recorded during the whole run, or the mean position of the walker with the best result.
Don't be put off that I was discussing maximizing there and you want to minimize: the figure of merit can be negated or inverted.
I've tried Simulated Annealing and Particle Swarm Optimization. (As a reminder, I couldn't use gradient descent because the gradient cannot be computed).
I've also tried an algorithm that does the following:
Pick a random point and a random direction
Evaluate the function
Keep moving along the random direction for as long as the result keeps improving, speeding up on every successful iteration.
When the result stops improving, step back and instead attempt to move into an orthogonal direction by the same distance.
This "orthogonal direction" was generated by creating a random orthogonal matrix (adapted this code) with the necessary number of dimensions.
If moving in the orthogonal direction improved the result, the algorithm just continued with that direction. If none of the directions improved the result, the jump distance was halved and a new set of orthogonal directions would be attempted. Eventually the algorithm concluded it must be in a local minimum, remembered it and restarted the whole lot at a new random point.
This approach performed considerably better than Simulated Annealing and Particle Swarm: it required fewer evaluations of the (very slow) function to achieve a result of the same quality.
Of course my implementations of S.A. and P.S.O. could well be flawed - these are tricky algorithms with a lot of room for tweaking parameters. But I just thought I'd mention what ended up working best for me.
I can't really help you with finding an algorithm for your specific problem.
However in regards to the random choosing of parameters I think what you are looking for are genetic algorithms. Genetic algorithms are generally based on choosing some random input, selecting those, which are the best fit (so far) for the problem, and randomly mutating/combining them to generate a next generation for which again the best are selected.
If the function is more or less continous (that is small mutations of good inputs generally won't generate bad inputs (small being a somewhat generic)), this would work reasonably well for your problem.
There is no generalized way to answer your question. There are lots of books/papers on the subject matter, but you'll have to choose your path according to your needs, which are not clearly spoken here.
Some things to know, however - 1min/test is way too much for any algorithm to handle. I guess that in your case, you must really do one of the following:
get 100 computers to cut your parameter testing time to some reasonable time
really try to work out your parameters by hand and mind. There must be some redundancy and at least some sanity check so you can test your case in <1min
for possible result sets, try to figure out some 'operations' that modify it slightly instead of just randomizing it. For example, in TSP some basic operator is lambda, that swaps two nodes and thus creates new route. Your can be shifting some number up/down for some value.
then, find yourself some nice algorithm, your starting point can be somewhere here. The book is invaluable resource for anyone who starts with problem-solving.

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Modeling distribution of performance measurements

How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario?
My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty).
My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized.
A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers.
I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;)
Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately.
Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle.
Why (since it was asked) -
For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc.
This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above.
[edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.
Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.
If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.
Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.
BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.
ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.
ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)
The problem you describe is called "Distribution Fitting" and has nothing to do with performance measurements, i.e. this is generic problem of fitting suitable distribution to any gathered/measured data sample.
The standard process is something like that:
Guess the best distribution.
Run hypothesis tests to check how well it describes gathered data.
Repeat 1-3 if not well enough.
You can find interesting article describing how this can be done with open-source R software system here. I think especially useful to you may be function fitdistr.
In addition to already given answers consider Empirical Distributions. I have successful experience in using empirical distributions for performance analysis of several distributed systems. The idea is very straightforward. You need to build histogram of performance measurements. Measurements should be discretized with given accuracy. When you have histogram you could do several useful things:
calculate the probability of any given value (you are bound by accuracy only);
build PDF and CDF functions for the performance measurements;
generate sequence of response times according to a distribution. This one is very useful for performance modeling.
Try whit gamma distribution http://en.wikipedia.org/wiki/Gamma_distribution
From wikipedia
The gamma distribution is frequently a probability model for waiting times; for instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.
The standard for randomized Arrival times for performance modelling is either Exponential distribution or Poisson distribution (which is just the distribution of multiple Exponential distributions added together).
Not exactly answering your question, but relevant still: Mor Harchol-Balter did a very nice analysis of the size of jobs submitted to a scheduler, The effect of heavy-tailed job size distributions on computer systems design (1999). She found that the size of jobs submitted to her distributed task assignment system took a power-law distribution, which meant that certain pieces of conventional wisdom she had assumed in the construction of her task assignment system, most importantly that the jobs should be well load balanced, had awful consequences for submitters of jobs. She's done good follor-up work on this issue.
The broader point is, you need to ask such questions as:
What happens if reasonable-seeming assumptions about the distribution of performance, such as that they take a normal distribution, break down?
Are the data sets I'm looking at really representative of the problem I'm trying to solve?

Resources