Automatic probability densities - probability

I have found automatic differentiation to be extremely useful when writing mathematical software. I now have to work with random variables and functions of the random variables, and it seems to me that an approach similar to automatic differentiation could be used for this, too.
The idea is to start with a basic random vector with given multivariate distribution and then you want to work with the implied probability distributions of functions of components of the random vector. The idea is to define operators that automatically combine two probability distributions appropriately when you add, multiply, divide two random variables and transform the distribution appropriately when you apply scalar functions such as exponentiation. You could then combine these to build any function you need of the original random variables and automatically have the corresponding probability distribution available.
Does this sound feasible? If not, why not? If so and since it's not a particularly original thought, could someone point me to an existing implementation, preferably in C

There has been a lot of work on probabilistic programming. One issue is that as your distribution gets more complicated you start needing more complex techniques to sample from it.
There are a number of ways this is done. Probabilistic graphical models gives one vocabulary for expressing these models, and you can then sample from them using various Metropolis-Hastings-style methods. Here is a crash course.
Another model is Probabilistic Programming, which can be done through an embedded domain specific language, directly. Oleg Kiselyov's HANSEI is an example of this approach. Once they have the program they can inspect the tree of decisions and expand them out by a form of importance sampling to gain the most information possible at each step.
You may also want to read "Nonstandard Interpretations of Probabilistic
Programs for Efficient Inference" by Wingate et al. which describes one way to use extra information about the derivative of your distribution to accelerate Metropolis-Hastings-style sampling techniques. I personally use automatic differentiation to calculate those derivatives and this brings the topic back to automatic-differentiation. ;)

Related

Will non-linear regression algorithms perform better if trained with normally distributed target values?

After finding out about many transformations that can be applied on the target values(y column), of a data set, such as box-cox transformations I learned that linear regression models need to be trained with normally distributed target values in order to be efficient.(https://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va)
I'd like to know if the same applies for non-linear regression algorithms. For now I've seen people on kaggle use log transformation for mitigation of heteroskedasticity, by using xgboost, but they never mention if it is also being done for getting normally distributed target values.
I've tried to do some research and I found in Andrew Ng's lecture notes(http://cs229.stanford.edu/notes/cs229-notes1.pdf) on page 11 that the least squares cost function, used by many algorithms linear and non-linear, is derived by assuming normal distribution of the error. I believe if the error should be normally distributed then the target values should be as well.
If this is true then all the regression algorithms using least squares cost function should work better with normally distributed target values.
Since xgboost uses least squares cost function for node splitting(http://cilvr.cs.nyu.edu/diglib/lsml/lecture03-trees-boosting.pdf - slide 13) then maybe this algorithm would work better if I transform the target values using box-cox transformations for training the model and then apply inverse box-cox transformations on the output in order to get the predicted values.
Will this theoretically speaking give better results?
Your conjecture "I believe if the error should be normally distributed then the target values should be as well." is totally wrong. So your question does not have any answer at all since it is not a valid question.
There are no assumptions on the target variable to be Normal at all.
Getting the target variable transformed does not mean the errors are normally distributed. In fact, that may ruin normality.
I have no idea what this is supposed to mean: "linear regression models need to be trained with normally distributed target values in order to be efficient." Efficient in what way?
Linear regression models are global models. They simply fit a surface to the overall data. The operations are matrix operations, so the time to "train" the model depends only on the size of data. The distribution of the target has nothing to do with model building performance. And, it has nothing to do with model scoring performance either.
Because targets are generally not normally distributed, I would certainly hope that such a distribution is not required for a machine learning algorithm to work effectively.

algorithm to combine data for linear fit?

I'm not sure if this is the best place to ask this, but you guys have been helpful with plenty of my CS homework in the past so I figure I'll give it a shot.
I'm looking for an algorithm to blindly combine several dependent variables into an index that produces the best linear fit with an external variable. Basically, it would combine the dependent variables using different mathematical operators, include or not include each one, etc. until an index is developed that best correlates with my external variable.
Has anyone seen/heard of something like this before? Even if you could point me in the right direction or to the right place to ask, I would appreciate it. Thanks.
Sounds like you're trying to do Multivariate Linear Regression or Multiple Regression. The simplest method (Read: less accurate) to do this is to individually compute the linear regression lines of each of the component variables and then do a weighted average of each of the lines. Beyond that I am afraid I will be of little help.
This appears to be simple linear regression using multiple explanatory variables. As the implication here is that you are using a computational approach you could do something as simple apply a linear model to your data using every possible combination of your explanatory variables that you have (whether you want to include interaction effects is your choice), choose a goodness of fit measure (R^2 being just one example) and use that to rank the fit of each model you fit?? The quality of a model is also somewhat subjective in many fields - you could reject a model containing 15 variables if it only moderately improves the fit over a far simpler model just containing 3 variables. If you have not read it already I don't doubt that you will find many useful suggestions in the following text :
Draper, N.R. and Smith, H. (1998).Applied Regression Analysis Wiley Series in Probability and Statistics
You might also try doing a google for the LASSO method of model selection.
The thing you're asking for is essentially the entirety of regression analysis.
this is what linear regression does, and this is a good portion of what "machine learning" does (machine learning is basically just a name for more complicated regression and classification algorithms). There are hundreds or thousands of different approaches with various tradeoffs, but the basic ones frequently work quite well.
If you want to learn more, the coursera course on machine learning is a great place to get a deeper understanding of this.

What are the differences between genetic algorithms and evolution strategies?

I've read a couple of introductory sections of books as well as a few papers on both topics, and it looks to me that these two methods are pretty much exactly the same. That said, I haven't had the time to actually deeply research the topics yet, so I might be wrong.
What are the distinctions between genetic algorithms and evolution strategies? What makes them different, and where are they similar?
In evolution strategies, the individuals are coded as vectors of real numbers. On reproduction, parents are selected randomly and the fittest offsprings are selected and inserted in the next generation. ES individuals are self-adapting. The step size or "mutation strength" is encoded in the individual, so good parameters get to the next generation by selecting good individuals.
In genetic algorithms, the individuals are coded as integers. The selection is done by selecting parents proportional to their fitness. So individuals must be evaluated before the first selection is done. Genetic operators work on the bit-level (e.g. cutting a bit string into multiple pieces and interchange them with the pieces of the other parent or switching single bits).
That's the theory. In practice, it is sometimes hard to distinguish between both evolutionary algorithms, and you need to create hybrid algorithms (e.g. integer (bit-string) individuals that encodes the parameters of the genetic operators).
Just stumbled on this thread when researching Evolution Strategies (ES).
As Paul noticed before, the encoding is not really the difference here, as this is an implementation detail of specific algorithms, although it seems more common in ES.
To answer the question, we first need to do a small step back and look at internals of an ES algorithm.
In ES there is a concept of endogenous and exogenous parameters of the evolution. Endogenous parameters are associated with individuals and therefore are evolved together with them, exogenous are provided from "outside" (e.g. set constant by the developer, or there can be a function/policy which sets their value depending on the iteration no).
The individual k consists therefore of two parts:
y(k) - a set of object parameters (e.g. a vector of real/int values) which denote the individual genotype
s(k) - a set of strategy parameters (e.g. a vector of real/int values again) which e.g. can control statistical properties of mutation)
Those two vectors are being selected, mutated, recombined together.
The main difference between GA and ES is that in classic GA there is no distinction between types of algorithm parameters. In fact all the parameters are set from "outside", so in ES terms are exogenous.
There are also other minor differences, e.g. in ES the selection policy is usually one and the same and in GA there are multiple different approaches which can be interchanged.
You can find a more detailed explanation here (see Chapter 3): Evolution strategies. A comprehensive introduction
In most newer textbooks on GA, real-valued coding is introduced as an alternative to the integer one, i.e. individuals can be coded as vectors of real numbers. This is called continuous parameter GA (see e.g. Haupt & Haupt, "Practical Genetic Algorithms", J.Wiley&Sons, 1998). So this is practically identical to ES real number coding.
With respect to parent selection, there are many different strategies published for GA's. I don't know them all, but I assume selection among all (not only the best has been used for some applications).
The main difference seems to be that a genetic algorithm represents a solution using a sequence of integers, whereas an evolution strategy uses a sequence of real numbers -- reference: http://en.wikipedia.org/wiki/Evolutionary_algorithm#
As the wikipedia source (http://en.wikipedia.org/wiki/Genetic_algorithm) and #Vaughn Cato said the difference in both techniques relies on the implementation. EA use
real numbers and GA use integers.
However, in practice I think you could use integers or real numbers in the formulation of your problem and in your program. It depends on you. For instance, for protein folding you can say the set of dihedral angles form a vector. This is a vector of real numbers, but the entries
are labeled by integers so I think you can formulate your problem and write you program based
on an integer arithmetic. It is just an idea.

Bimodal distribution characterization algorithm?

What algorithms can be used to characterize an expected clearly bimodal distribution, say a mixture of 2 normal distributions with well separated peaks, in an array of samples? Something that spits out 2 means, 2 standard deviations, and some sort of robustness estimate, would be the desired result.
I am interested in an algorithm that can be implemented in any programming language (for an embedded controller), not an existing C or Python library or stat package.
Would it be easier if I knew that the two modal means differ by a ratio of approximately 3:1 +- 50%, the standard deviations are "small" relative to the peak separation, but the pair of peaks could be anywhere in a 100:1 range?
There are two separate possibilities here. One is that you have a single distribution that is bimodal. The other is that you are observing data from two different distributions. The usual way to estimate the later is in something called, unsurprisingly, a mixture model.
Your approaches for estimating are to use a maximum likelihood approach or use Markov chain Monte Carlo methods if you want to take a Bayesian view of the problem. If you state your assumptions in a bit more detail I'd be willing to help try and figure out what objective function you'd want to try and maximize.
These type of models can be computationally intensive, so I am not sure you'd want to try and do the whole statistical approach in an embedded controller. A hack might be a better fit. If the peaks are in fact well separated, I think it would be easier to try and identify the two peaks and split your data between them and do the estimation of the mean and standard deviation for each distribution independently.

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Resources