Representing continuous probability distributions

Representing continuous probability distributions - algorithm

I have a problem involving a collection of continuous probability distribution functions, most of which are determined empirically (e.g. departure times, transit times). What I need is some way of taking two of these PDFs and doing arithmetic on them. E.g. if I have two values x taken from PDF X, and y taken from PDF Y, I need to get the PDF for (x+y), or any other operation f(x,y).
An analytical solution is not possible, so what I'm looking for is some representation of PDFs that allows such things. An obvious (but computationally expensive) solution is monte-carlo: generate lots of values of x and y, and then just measure f(x, y). But that takes too much CPU time.
I did think about representing the PDF as a list of ranges where each range has a roughly equal probability, effectively representing the PDF as the union of a list of uniform distributions. But I can't see how to combine them.
Does anyone have any good solutions to this problem?
Edit: The goal is to create a mini-language (aka Domain Specific Language) for manipulating PDFs. But first I need to sort out the underlying representation and algorithms.
Edit 2: dmckee suggests a histogram implementation. That is what I was getting at with my list of uniform distributions. But I don't see how to combine them to create new distributions. Ultimately I need to find things like P(x < y) in cases where this may be quite small.
Edit 3: I have a bunch of histograms. They are not evenly distributed because I'm generating them from occurance data, so basically if I have 100 samples and I want ten points in the histogram then I allocate 10 samples to each bar, and make the bars variable width but constant area.
I've figured out that to add PDFs you convolve them, and I've boned up on the maths for that. When you convolve two uniform distributions you get a new distribution with three sections: the wider uniform distribution is still there, but with a triangle stuck on each side the width of the narrower one. So if I convolve each element of X and Y I'll get a bunch of these, all overlapping. Now I'm trying to figure out how to sum them all and then get a histogram that is the best approximation to it.
I'm beginning to wonder if Monte-Carlo wasn't such a bad idea after all.
Edit 4: This paper discusses convolutions of uniform distributions in some detail. In general you get a "trapezoid" distribution. Since each "column" in the histograms is a uniform distribution, I had hoped that the problem could be solved by convolving these columns and summing the results.
However the result is considerably more complex than the inputs, and also includes triangles. Edit 5: [Wrong stuff removed]. But if these trapezoids are approximated to rectangles with the same area then you get the Right Answer, and reducing the number of rectangles in the result looks pretty straightforward too. This might be the solution I've been trying to find.
Edit 6: Solved! Here is the final Haskell code for this problem:
-- | Continuous distributions of scalars are represented as a
-- | histogram where each bar has approximately constant area but
-- | variable width and height. A histogram with N bars is stored as
-- | a list of N+1 values.
data Continuous = C {
cN :: Int,
-- ^ Number of bars in the histogram.
cAreas :: [Double],
-- ^ Areas of the bars. #length cAreas == cN#
cBars :: [Double]
-- ^ Boundaries of the bars. #length cBars == cN + 1#
} deriving (Show, Read)
{- | Add distributions. If two random variables #vX# and #vY# are
taken from distributions #x# and #y# respectively then the
distribution of #(vX + vY)# will be #(x .+. y).
This is implemented as the convolution of distributions x and y.
Each is a histogram, which is to say the sum of a collection of
uniform distributions (the "bars"). Therefore the convolution can be
computed as the sum of the convolutions of the cross product of the
components of x and y.
When you convolve two uniform distributions of unequal size you get a
trapezoidal distribution. Let p = p2-p1, q - q2-q1. Then we get:
> | |
> | ______ |
> | | | with | _____________
> | | | | | |
> +-----+----+------- +--+-----------+-
> p1 p2 q1 q2
>
> gives h|....... _______________
> | /: :\
> | / : : \ 1
> | / : : \ where h = -
> | / : : \ q
> | / : : \
> +--+-----+-------------+-----+-----
> p1+q1 p2+q1 p1+q2 p2+q2
However we cannot keep the trapezoid in the final result because our
representation is restricted to uniform distributions. So instead we
store a uniform approximation to the trapezoid with the same area:
> h|......___________________
> | | / \ |
> | |/ \|
> | | |
> | /| |\
> | / | | \
> +-----+-------------------+--------
> p1+q1+p/2 p2+q2-p/2
-}
(.+.) :: Continuous -> Continuous -> Continuous
c .+. d = C {cN = length bars - 1,
cBars = map fst bars,
cAreas = zipWith barArea bars (tail bars)}
where
-- The convolve function returns a list of two (x, deltaY) pairs.
-- These can be sorted by x and then sequentially summed to get
-- the new histogram. The "b" parameter is the product of the
-- height of the input bars, which was omitted from the diagrams
-- above.
convolve b c1 c2 d1 d2 =
if (c2-c1) < (d2-d1) then convolve1 b c1 c2 d1 d2 else convolve1 b d1
d2 c1 c2
convolve1 b p1 p2 q1 q2 =
[(p1+q1+halfP, h), (p2+q2-halfP, (-h))]
where
halfP = (p2-p1)/2
h = b / (q2-q1)
outline = map sumGroup $ groupBy ((==) `on` fst) $ sortBy (comparing fst)
$ concat
[convolve (areaC*areaD) c1 c2 d1 d2 |
(c1, c2, areaC) <- zip3 (cBars c) (tail $ cBars c) (cAreas c),
(d1, d2, areaD) <- zip3 (cBars d) (tail $ cBars d) (cAreas d)
]
sumGroup pairs = (fst $ head pairs, sum $ map snd pairs)
bars = tail $ scanl (\(_,y) (x2,dy) -> (x2, y+dy)) (0, 0) outline
barArea (x1, h) (x2, _) = (x2 - x1) * h
Other operators are left as an exercise for the reader.

No need for histograms or symbolic computation: everything can be done at the language level in closed form, if the right point of view is taken.
[I shall use the term "measure" and "distribution" interchangeably. Also, my Haskell is rusty and I ask you to forgive me for being imprecise in this area.]
Probability distributions are really codata.
Let mu be a probability measure. The only thing you can do with a measure is integrate it against a test function (this is one possible mathematical definition of "measure"). Note that this is what you will eventually do: for instance integrating against identity is taking the mean:
mean :: Measure -> Double
mean mu = mu id
another example:
variance :: Measure -> Double
variance mu = (mu $ \x -> x ^ 2) - (mean mu) ^ 2
another example, which computes P(mu < x):
cdf :: Measure -> Double -> Double
cdf mu x = mu $ \z -> if z < x then 1 else 0
This suggests an approach by duality.
The type Measure shall therefore denote the type (Double -> Double) -> Double. This allows you to model results of MC simulation, numerical/symbolic quadrature against a PDF, etc. For instance, the function
empirical :: [Double] -> Measure
empirical h:t f = (f h) + empirical t f
returns the integral of f against an empirical measure obtained by eg. MC sampling. Also
from_pdf :: (Double -> Double) -> Measure
from_pdf rho f = my_favorite_quadrature_method rho f
construct measures from (regular) densities.
Now, the good news. If mu and nu are two measures, the convolution mu ** nu is given by:
(mu ** nu) f = nu $ \y -> (mu $ \x -> f $ x + y)
So, given two measures, you can integrate against their convolution.
Also, given a random variable X of law mu, the law of a * X is given by:
rescale :: Double -> Measure -> Measure
rescale a mu f = mu $ \x -> f(a * x)
Also, the distribution of phi(X) is given by the image measure phi_* X, in our framework:
apply :: (Double -> Double) -> Measure -> Measure
apply phi mu f = mu $ f . phi
So now you can easily work out an embedded language for measures. There are much more things to do here, particularly with respect to sample spaces other than the real line, dependencies between random variables, conditionning, but I hope you get the point.
In particular, the pushforward is functorial:
newtype Measure a = (a -> Double) -> Double
instance Functor Measure a where
fmap f mu = apply f mu
It is a monad too (exercise -- hint: this very much looks like the continuation monad. What is return ? What is the analog of call/cc ?).
Also, combined with a differential geometry framework, this can probably be turned into something which compute Bayesian posterior distributions automatically.
At the end of the day, you can write stuff like
m = mean $ apply cos ((from_pdf gauss) ** (empirical data))
to compute the mean of cos(X + Y) where X has pdf gauss and Y has been sampled by a MC method whose results are in data.

Probability distributions form a monad; see eg the work of Claire Jones and also the LICS 1989 paper, but the ideas go back to a 1982 paper by Giry (DOI 10.1007/BFb0092872) and to a 1962 note by Lawvere that I cannot track down (http://permalink.gmane.org/gmane.science.mathematics.categories/6541).
But I don't see the comonad: there's no way to get an "a" out of an "(a->Double)->Double". Perhaps if you make it polymorphic - (a->r)->r for all r? (That's the continuation monad.)

Is there anything that stops you from employing a mini-language for this?
By that I mean, define a language that lets you write f = x + y and evaluates f for you just as written. And similarly for g = x * z, h = y(x), etc. ad nauseum. (The semantics I'm suggesting call for the evaluator to select a random number on each innermost PDF appearing on the RHS at evaluation time, and not to try to understand the composted form of the resulting PDFs. This may not be fast enough...)
Assuming that you understand the precision limits you need, you can represent a PDF fairly simply with a histogram or spline (the former being a degenerate case of the later). If you need to mix analytically defined PDFs with experimentally determined ones, you'll have to add a type mechanism.
A histogram is just an array, the contents of which represent the incidence in a particular region of the input range. You haven't said if you have a language preference, so I'll assume something c-like. You need to know the bin-structure (uniorm sizes are easy, but not always best) including the high and low limits and possibly the normalizatation:
struct histogram_struct {
int bins; /* Assumed to be uniform */
double low;
double high;
/* double normalization; */
/* double *errors; */ /* if using, intialize with enough space,
* and store _squared_ errors
*/
double contents[];
};
This kind of thing is very common in scientific analysis software, and you might want to use an existing implementation.

I worked on similar problems for my dissertation.
One way to compute approximate convolutions is to take the Fourier transform of the density functions (histograms in this case), multiply them, then take the inverse Fourier transform to get the convolution.
Look at Appendix C of my dissertation for formulas for various special cases of operations on probability distributions. You can find the dissertation at: http://riso.sourceforge.net
I wrote Java code to carry out those operations. You can find the code at: https://sourceforge.net/projects/riso

Autonomous mobile robotics deals with similar issue in localization and navigation, in particular the Markov localization and Kalman filter (sensor fusion). See An experimental comparison of localization methods continued for example.
Another approach you could borrow from mobile robots is path planning using potential fields.

A couple of responses:
1) If you have empirically determined PDFs they either you have histograms or you have an approximation to a parametric PDF. A PDF is a continuous function and you don't have infinite data...
2) Let's assume that the variables are independent. Then if you make the PDF discrete then P(f(x,y)) = f(x,y)p(x,y) = f(x,y)p(x)p(y) summed over all the combinations of x and y such that f(x,y) meets your target.
If you are going to fit the empirical PDFs to standard PDFs, e.g. the normal distribution, then you can use already-determined functions to figure out the sum, etc.
If the variables are not independent, then you have more trouble on your hands and I think you have to use copulas.
I think that defining your own mini-language, etc., is overkill. you can do this with arrays...

Some initial thoughts:
First, Mathematica has a nice facility for doing this with exact distributions.
Second, representation as histograms (ie, empirical PDFs) is problematic since you have to make choices about bin size. That can be avoided by storing a cumulative distribution instead, ie, an empirical CDF. (In fact, you then retain the ability to recreate the full data set of samples that the empirical distribution is based on.)
Here's some ugly Mathematica code to take a list of samples and return an empirical CDF, namely a list of value-probability pairs. Run the output of this through ListPlot to see a plot of the empirical CDF.
empiricalCDF[t_] :=
Flatten[{{#[[2,1]],#[[1,2]]},#[[2]]}&/#Partition[Prepend[Transpose[{#[[1]],
Rest[FoldList[Plus,0,#[[2]]]]/Length[t]}&[Transpose[{First[#],Length[#]}&/#
Split[Sort[t]]]]],{Null,0}],2,1],1]
Finally, here's some information on combining discrete probability distributions:
http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter7.pdf

I think the histograms or the list of 1/N area regions is a good idea. For the sake of argument, I'll assume that you'll have a fixed N for all distributions.
Use the paper you linked edit 4 to generate the new distribution. Then, approximate it with a new N-element distribution.
If you don't want N to be fixed, it's even easier. Take each convex polygon (trapezoid or triangle) in the new generated distribution and approximate it with a uniform distribution.

Another suggestion is to use kernel densities. Especially if you use Gaussian kernels, then they can be relatively easy to work with... except that the distributions quickly explode in size without care. Depending on the application, there are additional approximation techniques like importance sampling that can be used.

If you want some fun, try representing them symbolically like Maple or Mathemetica would do. Maple uses directed acyclic graphs, while Matematica uses a list/lisp like appoach (I believe, but it's been a loooong time, since I even thought about this).
Do all your manipulations symbolically, then at the end push through numerical values. (Or just find a way to launch off in a shell and do the computations).
Paul.

Related

Extremely small or NaN values appear in training neural network

I'm trying to implement a neural network architecture in Haskell, and use it on MNIST.
I'm using the hmatrix package for linear algebra.
My training framework is built using the pipes package.
My code compiles and doesn't crash. But the problem is, certain combinations of layer size (say, 1000), minibatch size, and learning rate give rise to NaN values in the computations. After some inspection, I see that extremely small values (order of 1e-100) eventually appear in the activations. But, even when that doesn't happen, the training still doesn't work. There's no improvement over its loss or accuracy.
I checked and rechecked my code, and I'm at a loss as to what the root of the problem could be.
Here's the backpropagation training, which computes the deltas for each layer:
backward lf n (out,tar) das = do
let δout = tr (derivate lf (tar, out)) -- dE/dy
deltas = scanr (\(l, a') δ ->
let w = weights l
in (tr a') * (w <> δ)) δout (zip (tail $ toList n) das)
return (deltas)
lf is the loss function, n is the network (weight matrix and bias vector for each layer), out and tar are the actual output of the network and the target (desired) output, and das are the activation derivatives of each layer.
In batch mode, out, tar are matrices (rows are output vectors), and das is a list of the matrices.
Here's the actual gradient computation:
grad lf (n, (i,t)) = do
-- Forward propagation: compute layers outputs and activation derivatives
let (as, as') = unzip $ runLayers n i
(out) = last as
(ds) <- backward lf n (out, t) (init as') -- Compute deltas with backpropagation
let r = fromIntegral $ rows i -- Size of minibatch
let gs = zipWith (\δ a -> tr (δ <> a)) ds (i:init as) -- Gradients for weights
return $ GradBatch ((recip r .*) <$> gs, (recip r .*) <$> squeeze <$> ds)
Here, lf and n are the same as above, i is the input, and t is the target output (both in batch form, as matrices).
squeeze transforms a matrix into a vector by summing over each row. That is, ds is a list of matrices of deltas, where each column corresponds to the deltas for a row of the minibatch. So, the gradients for the biases are the average of the deltas over all the minibatch. The same thing for gs, which corresponds to the gradients for the weights.
Here's the actual update code:
move lr (n, (i,t)) (GradBatch (gs, ds)) = do
-- Update function
let update = (\(FC w b af) g δ -> FC (w + (lr).*g) (b + (lr).*δ) af)
n' = Network.fromList $ zipWith3 update (Network.toList n) gs ds
return (n', (i,t))
lr is the learning rate. FC is the layer constructor, and af is the activation function for that layer.
The gradient descent algorithm makes sure to pass in a negative value for the learning rate. The actual code for the gradient descent is simply a loop around a composition of grad and move, with a parameterized stop condition.
Finally, here's the code for a mean square error loss function:
mse :: (Floating a) => LossFunction a a
mse = let f (y,y') = let gamma = y'-y in gamma**2 / 2
f' (y,y') = (y'-y)
in Evaluator f f'
Evaluator just bundles a loss function and its derivative (for calculating the delta of the output layer).
The rest of the code is up on GitHub: NeuralNetwork.
So, if anyone has an insight into the problem, or even just a sanity check that I'm correctly implementing the algorithm, I'd be grateful.

Do you know about "vanishing" and "exploding" gradients in backpropagation? I'm not too familiar with Haskell so I can't easily see what exactly your backprop is doing, but it does look like you are using a logistic curve as your activation function.
If you look at the plot of this function you'll see that the gradient of this function is nearly 0 at the ends (as input values get very large or very small, the slope of the curve is almost flat), so multiplying or dividing by this during backpropagation will result in a very big or very small number. Doing this repeatedly as you pass through multiple layers causes the activations to approach zero or infinity. Since backprop updates your weights by doing this during training, you end up with a lot of zeros or infinities in your network.
Solution: there are loads of methods out there that you can search for to solve the vanishing gradient problem, but one easy thing to try is to change the type of activation function you are using to a non-saturating one. ReLU is a popular choice as it mitigates this particular problem (but might introduce others).

How to generate correlated Uniform[0,1] variables

(This question is related to how to generate a dataset of correlated variables with different distributions?)
In Stata, say that I create a random variable following a Uniform[0,1] distribution:
set seed 100
gen random1 = runiform()
I now want to create a second random variable that is correlated with the first (the correlation should be .75, say), but is bounded by 0 and 1. I would like this second variable to also be more-or-less Uniform[0,1]. How can I do this?

This won't be exact, but the NORTA/copula method should be pretty close and easy to implement.
The relevant citation is:
Cario, Marne C., and Barry L. Nelson. Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois, 1997.
The paper can be found here.
The general recipe to generate correlated random variables from any distribution is:
Draw two (or more) correlated variables from a joint standard normal distribution using corr2data
Calculate the univariate normal CDF of each of these variables using normal()
Apply the inverse CDF of any distribution to simulate draws from that distribution.
The third step is pretty easy with the [0,1] uniform: you don't even need it. Typically, the magnitude of the correlations you get will be less than the magnitudes of the original (normal) correlations, so it might be useful to bump those up a bit.
Stata Code for 2 uniformish variables that have a correlation of 0.75:
clear
// Step 1
matrix C = (1, .75 \ .75, 1)
corr2data x y, n(10000) corr(C) double
corr x y, means
// Steps 2-3
replace x = normal(x)
replace y = normal(y)
// Make sure things worked
corr x y, means
stack x y, into(z) clear
lab define vars 1 "x" 2 "y"
lab val _stack vars
capture ssc install bihist
bihist z, by(_stack) density tw1(yline(-1 0 1))
If you want to improve the approximation for the uniform case, you can transform the correlations like this (see section 5 of the linked paper):
matrix C = (1,2*sin(.75*_pi/6)\2*sin(.75*_pi/6),1)
This is 0.76536686 instead of the 0.75.
Code for the question in the comments
The correlation matrix C written more compactly, and I am applying the transformation:
clear
matrix C = ( 1, ///
2*sin(-.46*_pi/6), 1, ///
2*sin(.53*_pi/6), 2*sin(-.80*_pi/6), 1, ///
2*sin(0*_pi/6), 2*sin(-.41*_pi/6), 2*sin(.48*_pi/6), 1 )
corr2data v1 v2 v3 v4, n(10000) corr(C) cstorage(lower)
forvalues i=1/4 {
replace v`i' = normal(v`i')
}

Generate random numbers according to distributions

I want to generate random numbers according some distributions. How can I do this?

The standard random number generator you've got (rand() in C after a simple transformation, equivalents in many languages) is a fairly good approximation to a uniform distribution over the range [0,1]. If that's what you need, you're done. It's also trivial to convert that to a random number generated over a somewhat larger integer range.
Conversion of a Uniform distribution to a Normal distribution has already been covered on SO, as has going to the Exponential distribution.
[EDIT]: For the triangular distribution, converting a uniform variable is relatively simple (in something C-like):
double triangular(double a,double b,double c) {
double U = rand() / (double) RAND_MAX;
double F = (c - a) / (b - a);
if (U <= F)
return a + sqrt(U * (b - a) * (c - a));
else
return b - sqrt((1 - U) * (b - a) * (b - c));
}
That's just converting the formula given on the Wikipedia page. If you want others, that's the place to start looking; in general, you use the uniform variable to pick a point on the vertical axis of the cumulative density function of the distribution you want (assuming it's continuous), and invert the CDF to get the random value with the desired distribution.

The right way to do this is to decompose the distribution into n-1 binary distributions. That is if you have a distribution like this:
A: 0.05
B: 0.10
C: 0.10
D: 0.20
E: 0.55
You transform it into 4 binary distributions:
1. A/E: 0.20/0.80
2. B/E: 0.40/0.60
3. C/E: 0.40/0.60
4. D/E: 0.80/0.20
Select uniformly from the n-1 distributions, and then select the first or second symbol based on the probability if each in the binary distribution.
Code for this is here

It actually depends on distribution. The most general way is the following. Let P(X) be the probability that random number generated according to your distribution is less than X.
You start with generating uniform random X between zero and one. After that you find Y such that P(Y) = X and output Y. You could find such Y using binary search (since P(X) is an increasing function of X).
This is not very efficient, but works for distributions where P(X) could be efficiently computed.

You can look up inverse transform sampling, rejection sampling as well as the book by Devroye "Nonuniform random variate generation"/Springer Verlag 1986

You can convert from discrete bins to float/double with interpolation. Simple linear works well. If your table memory is constrained other interpolation methods can be used. -jlp

It's a standard textbook matter. See here for some code, or here at Section 3.2 for some reference mathematical background (actually very quick and simple to read).

Trying to build algorithm for optimal tower placement in a game

This is going to be a long post and just for fun, so if you don't have much time better go help folks with more important questions instead :)
There is a game called "Tower Bloxx" recently released on xbox. One part of the game is to place different colored towers on a field in a most optimal way in order to maximize number of most valuable towers. I wrote an algorithm that would determine the most efficient tower placement but it is not very efficient and pretty much just brute forcing all possible combinations. For 4x4 field with 4 tower types it solves it in about 1 hr, 5 tower types would take about 40 hours which is too much.
Here are the rules:
There are 5 types of towers that could be placed on a field. There are several types of fields, the easiest one is just 4x4 matrix, others fields have some "blanks" where you can't build. Your aim is to put as many the most valuable towers on a field as possible to maximize total tower value on a field (lets assume that all towers are built at once, there is no turns).
Tower types (in order from less to most valuable):
Blue - can be placed anywhere, value = 10
Red - can be placed only besides blue, value = 20
Green - placed besides red and blue, value = 30
Yellow - besides green, red and blue, value = 40
White - besides yellow, green, red and blue, value = 100
Which means that for example green tower should have at least 1 red and 1 blue towers at either north, south, west or east neighbor cells (diagonals don't count). White tower should be surrounded with all other colors.
Here is my algorithm for 4 towers on 4x4 field:
Total number of combinations = 4^16
Loop through [1..4^16] and convert every number to base4 string in order to encode tower placement, so 4^16 = "3333 3333 3333 3333" which would represent our tower types (0=blue,...,3=yellow)
Convert tower placement string into matrix.
For every tower in a matrix check its neighbors and if any of requirements fails this whole combination fails.
Put all correct combinations into an array and then sort this array as strings in lexicographic order to find best possible combination (first need to sort characters in a string).
The only optimization I came up with is to skip combinations that don't contain any most valuable towers. It skips some processing but I still loop through all 4^16 combinations.
Any thought how this can be improved? Code samples would be helpful if in java or php.
-------Update--------
After adding more illegal states (yellow cannot be built in the corners, white cannot be built in corners and on the edges, field should contain at least one tower of each type), realizing that only 1 white tower could be possibly built on 4x4 field and optimizing java code the total time was brought down from 40 to ~16 hours. Maybe threading would bring it down to 10 hrs but that's probably brute forcing limit.

I found this question intriguing, and since I'm teaching myself Haskell, I decided to try my hand at implementing a solution in that language.
I thought about branch-and-bound, but couldn't come up with a good way to bound the solutions, so I just did some pruning by discarding boards that violate the rules.
My algorithm works by starting with an "empty" board. It places each possible color of tower in the first empty slot and in each case (each color) then recursively calls itself. The recursed calls try each color in the second slot, recursing again, until the board is full.
As each tower is placed, I check the just-placed tower and all of it's neighbors to verify that they're obeying the rules, treating any empty neighbors as wild cards. So if a white tower has four empty neighbors, I consider it valid. If a placement is invalid, I do not recurse on that placement, effectively pruning the entire tree of possibilities under it.
The way the code is written, I generate a list of all possible solutions, then look through the list to find the best one. In actuality, thanks to Haskell's lazy evaluation, the list elements are generated as the search function needs them, and since they're never referred to again they become available for garbage collection right away, so even for a 5x5 board memory usage is fairly small (2 MB).
Performance is pretty good. On my 2.1 GHz laptop, the compiled version of the program solves the 4x4 case in ~50 seconds, using one core. I'm running a 5x5 example right now to see how long it will take. Since functional code is quite easy to parallelize, I'm also going to experiment with parallel processing. There's a parallelized Haskell compiler that will not only spread the work across multiple cores, but across multiple machines as well, and this is a very parallelizable problem.
Here's my code so far. I realize that you specified Java or PHP, and Haskell is quite different. If you want to play with it, you can modify the definition of the variable "bnd" just above the bottom to set the board size. Just set it to ((1,1),(x, y)), where x and y are the number of columns and rows, respectively.
import Array
import Data.List
-- Enumeration of Tower types. "Empty" isn't really a tower color,
-- but it allows boards to have empty cells
data Tower = Empty | Blue | Red | Green | Yellow | White
deriving(Eq, Ord, Enum, Show)
type Location = (Int, Int)
type Board = Array Location Tower
-- towerScore omputes the score of a single tower
towerScore :: Tower -> Int
towerScore White = 100
towerScore t = (fromEnum t) * 10
-- towerUpper computes the upper bound for a single tower
towerUpper :: Tower -> Int
towerUpper Empty = 100
towerUpper t = towerScore t
-- boardScore computes the score of a board
boardScore :: Board -> Int
boardScore b = sum [ towerScore (b!loc) | loc <- range (bounds b) ]
-- boardUpper computes the upper bound of the score of a board
boardUpper :: Board -> Int
boardUpper b = sum [ bestScore loc | loc <- range (bounds b) ]
where
bestScore l | tower == Empty =
towerScore (head [ t | t <- colors, canPlace b l t ])
| otherwise = towerScore tower
where
tower = b!l
colors = reverse (enumFromTo Empty White)
-- Compute the neighbor locations of the specified location
neighborLoc :: ((Int,Int),(Int,Int)) -> (Int,Int) -> [(Int,Int)]
neighborLoc bounds (col, row) = filter valid neighborLoc'
where
valid loc = inRange bounds loc
neighborLoc' = [(col-1,row),(col+1,row),(col,row-1),(col,row+1)]
-- Array to store all of the neighbors of each location, so we don't
-- have to recalculate them repeatedly.
neighborArr = array bnd [(loc, neighborLoc bnd loc) | loc <- range bnd]
-- Get the contents of neighboring cells
neighborTowers :: Board -> Location -> [Tower]
neighborTowers board loc = [ board!l | l <- (neighborArr!loc) ]
-- The tower placement rule. Yields a list of tower colors that must
-- be adjacent to a tower of the specified color.
requiredTowers :: Tower -> [Tower]
requiredTowers Empty = []
requiredTowers Blue = []
requiredTowers Red = [Blue]
requiredTowers Green = [Red, Blue]
requiredTowers Yellow = [Green, Red, Blue]
requiredTowers White = [Yellow, Green, Red, Blue]
-- cellValid determines if a cell satisfies the rule.
cellValid :: Board -> Location -> Bool
cellValid board loc = null required ||
null needed ||
(length needed <= length empties)
where
neighbors = neighborTowers board loc
required = requiredTowers (board!loc)
needed = required \\ neighbors
empties = filter (==Empty) neighbors
-- canPlace determines if 'tower' can be placed in 'cell' without
-- violating the rule.
canPlace :: Board -> Location -> Tower -> Bool
canPlace board loc tower =
let b' = board // [(loc,tower)]
in cellValid b' loc && and [ cellValid b' l | l <- neighborArr!loc ]
-- Generate a board full of empty cells
cleanBoard :: Array Location Tower
cleanBoard = listArray bnd (replicate 80 Empty)
-- The heart of the algorithm, this function takes a partial board
-- (and a list of empty locations, just to avoid having to search for
-- them) and a score and returns the best board obtainable by filling
-- in the partial board
solutions :: Board -> [Location] -> Int -> Board
solutions b empties best | null empties = b
solutions b empties best =
fst (foldl' f (cleanBoard, best) [ b // [(l,t)] | t <- colors, canPlace b l t ])
where
f :: (Board, Int) -> Board -> (Board, Int)
f (b1, best) b2 | boardUpper b2 <= best = (b1, best)
| otherwise = if newScore > lstScore
then (new, max newScore best)
else (b1, best)
where
lstScore = boardScore b1
new = solutions b2 e' best
newScore = boardScore new
l = head empties
e' = tail empties
colors = reverse (enumFromTo Blue White)
-- showBoard converts a board to a printable string representation
showBoard :: Board -> String
showBoard board = unlines [ printRow row | row <- [minrow..maxrow] ]
where
((mincol, minrow), (maxcol, maxrow)) = bounds board
printRow row = unwords [ printCell col row | col <- [mincol..maxcol] ]
printCell col row = take 1 (show (board!(col,row)))
-- Set 'bnd' to the size of the desired board.
bnd = ((1,1),(4,4))
-- Main function generates the solutions, finds the best and prints
-- it out, along with its score
main = do putStrLn (showBoard best); putStrLn (show (boardScore best))
where
s = solutions cleanBoard (range (bounds cleanBoard)) 0
best = s
Also, please remember this is my first non-trivial Haskell program. I'm sure it can be done much more elegantly and succinctly.
Update: Since it was still very time-consuming to do a 5x5 with 5 colors (I waited 12 hours and it hadn't finished), I took another look at how to use bounding to prune more of the search tree.
My first approach was to estimate the upper bound of a partially-filled board by assuming every empty cell is filled with a white tower. I then modified the 'solution' function to track the best score seen and to ignore any board whose upper bound is less than than that best score.
That helped some, reducing a 4x4x5 board from 23s to 15s. To improve it further, I modified the upper bound function to assume that each Empty is filled with the best tower possible, consistent with the existing non-empty cell contents. That helped a great deal, reducing the 4x4x5 time to 2s.
Running it on 5x5x5 took 2600s, giving the following board:
G B G R B
R B W Y G
Y G R B R
B W Y G Y
G R B R B
with a score of 730.
I may make another modification and have it find all of the maximal-scoring boards, rather than just one.

If you don't want to do A*, use a branch and bound approach. The problem should be relatively easy to code up because your value functions are well defined. I imagine you should be able to prune off huge sections of the search space with relative ease. However because your search space is pretty large it may still take some time. Only one way to find out :)
The wiki article isn't the best in the world. Google can find you a ton of nice examples and trees and stuff to further illustrate the approach.

One easy way to improve the brute force method is to explore only legal states. For example, if you are trying all possible states, you will be testing many states where the top right corner is a white tower. All of these states will be illegal. It doesn't make sense to generate and test all of those states. So you want to generate your states one block at a time, and only go deeper into the tree when you are actually at a potentially valid state. This will cut down your search tree by many orders of magnitude.
There may be further fancy things you can do, but this is an easy to understand (hopefully) improvement to your current solution.

I think you will want to use a branch-and-bound algorithm because I think coming up with a good heuristic for an A* implementation will be hard (but, that's just my intuitition).
The pseudo-code for a branch-and-bound implementation is:
board = initial board with nothing on it, probably a 2D array
bestBoard = {}
function findBest(board)
if no more pieces can be added to board then
if score(board) > score(bestBoard) then
bestBoard = board
return
else
for each piece P we can legally add to board
newBoard = board with piece P added
//loose upper bound, could be improved
if score(newBoard) + 100*number of blanks in newBoard > score(bestBoard)
findBestHelper(newBoard)
The idea is that we search all possible boards, in order, but we keep track of the best one we have found so far (this is the bound). Then, if we find a partial board which we know will never be better than the best one so far then we stop looking working on that partial board: we trim that branch of the search tree.
In the code above I am doing the check by assuming that all the blanks would be filled by the white pieces, as we can't do better than that. I am sure that with a little bit of thought you can come up with a tighter bound than that.
Another place where you can try to optimize is in the order of the for-each loop. You want to try pieces in the order correct order. That is, optimally you want the first solution found to be the best one, or at least one with a really high score.

It seems like a good approach would be to start with a white tower and then build a set of towers around it based on the requirements, trying to find the smallest possible colored set of shapes which can act as interlocking tiles.

I wanted to advocate linear programming with integer unknowns, but it turns out that it's NP-hard even in the binary case. However, you can still get great success at optimizing a problem like yours, where there are many valid solutions and you simply want the best one.
Linear programming, for this kind of problem, essentially amounts to having a lot of variables (for example, the number of red towers present in cell (M, N)) and relationships among the variables (for example, the number of white towers in cell (M, N) must be less than or equal to the number of towers of the non-white color that has the smallest such number, among all its neighbors). It's kind of a pain to write up a linear program, but if you want a solution that runs in seconds, it's probably your best bet.

You've received a lot of good advice on the algorithmic side of things, so I don't have a lot to add. But, assuming Java as the language, here are a few fairly obvious suggestions for performance improvement.
Make sure you're not instantiating any objects inside that 4^16 loop. It's much, much cheaper for the JVM to re-initialize an existing object than to create a new one. Even cheaper to use arrays of primitives. :)
If you can help it, step away from the collection classes. They'll add a lot of overhead that you probably don't need.
Make sure you're not concatenating any strings. Use StringBuilder.
And lastly, consider re-writing the whole thing in C.

Converting a Uniform Distribution to a Normal Distribution

How can I convert a uniform distribution (as most random number generators produce, e.g. between 0.0 and 1.0) into a normal distribution? What if I want a mean and standard deviation of my choosing?

There are plenty of methods:
Do not use Box Muller. Especially if you draw many gaussian numbers. Box Muller yields a result which is clamped between -6 and 6 (assuming double precision. Things worsen with floats.). And it is really less efficient than other available methods.
Ziggurat is fine, but needs a table lookup (and some platform-specific tweaking due to cache size issues)
Ratio-of-uniforms is my favorite, only a few addition/multiplications and a log 1/50th of the time (eg. look there).
Inverting the CDF is efficient (and overlooked, why ?), you have fast implementations of it available if you search google. It is mandatory for Quasi-Random numbers.

The Ziggurat algorithm is pretty efficient for this, although the Box-Muller transform is easier to implement from scratch (and not crazy slow).

Changing the distribution of any function to another involves using the inverse of the function you want.
In other words, if you aim for a specific probability function p(x) you get the distribution by integrating over it -> d(x) = integral(p(x)) and use its inverse: Inv(d(x)). Now use the random probability function (which have uniform distribution) and cast the result value through the function Inv(d(x)). You should get random values cast with distribution according to the function you chose.
This is the generic math approach - by using it you can now choose any probability or distribution function you have as long as it have inverse or good inverse approximation.
Hope this helped and thanks for the small remark about using the distribution and not the probability itself.

Here is a javascript implementation using the polar form of the Box-Muller transformation.
/*
* Returns member of set with a given mean and standard deviation
* mean: mean
* standard deviation: std_dev
*/
function createMemberInNormalDistribution(mean,std_dev){
return mean + (gaussRandom()*std_dev);
}
/*
* Returns random number in normal distribution centering on 0.
* ~95% of numbers returned should fall between -2 and 2
* ie within two standard deviations
*/
function gaussRandom() {
var u = 2*Math.random()-1;
var v = 2*Math.random()-1;
var r = u*u + v*v;
/*if outside interval [0,1] start over*/
if(r == 0 || r >= 1) return gaussRandom();
var c = Math.sqrt(-2*Math.log(r)/r);
return u*c;
/* todo: optimize this algorithm by caching (v*c)
* and returning next time gaussRandom() is called.
* left out for simplicity */
}

Where R1, R2 are random uniform numbers:
NORMAL DISTRIBUTION, with SD of 1:
sqrt(-2*log(R1))*cos(2*pi*R2)
This is exact... no need to do all those slow loops!
Reference: dspguide.com/ch2/6.htm

Use the central limit theorem wikipedia entry mathworld entry to your advantage.
Generate n of the uniformly distributed numbers, sum them, subtract n*0.5 and you have the output of an approximately normal distribution with mean equal to 0 and variance equal to (1/12) * (1/sqrt(N)) (see wikipedia on uniform distributions for that last one)
n=10 gives you something half decent fast. If you want something more than half decent go for tylers solution (as noted in the wikipedia entry on normal distributions)

I would use Box-Muller. Two things about this:
You end up with two values per iteration
Typically, you cache one value and return the other. On the next call for a sample, you return the cached value.
Box-Muller gives a Z-score
You have to then scale the Z-score by the standard deviation and add the mean to get the full value in the normal distribution.

It seems incredible that I could add something to this after eight years, but for the case of Java I would like to point readers to the Random.nextGaussian() method, which generates a Gaussian distribution with mean 0.0 and standard deviation 1.0 for you.
A simple addition and/or multiplication will change the mean and standard deviation to your needs.

The standard Python library module random has what you want:
normalvariate(mu, sigma)
Normal distribution. mu is the mean, and sigma is the standard deviation.
For the algorithm itself, take a look at the function in random.py in the Python library.
The manual entry is here

This is a Matlab implementation using the polar form of the Box-Muller transformation:
Function randn_box_muller.m:
function [values] = randn_box_muller(n, mean, std_dev)
if nargin == 1
mean = 0;
std_dev = 1;
end
r = gaussRandomN(n);
values = r.*std_dev - mean;
end
function [values] = gaussRandomN(n)
[u, v, r] = gaussRandomNValid(n);
c = sqrt(-2*log(r)./r);
values = u.*c;
end
function [u, v, r] = gaussRandomNValid(n)
r = zeros(n, 1);
u = zeros(n, 1);
v = zeros(n, 1);
filter = r==0 | r>=1;
% if outside interval [0,1] start over
while n ~= 0
u(filter) = 2*rand(n, 1)-1;
v(filter) = 2*rand(n, 1)-1;
r(filter) = u(filter).*u(filter) + v(filter).*v(filter);
filter = r==0 | r>=1;
n = size(r(filter),1);
end
end
And invoking histfit(randn_box_muller(10000000),100); this is the result:
Obviously it is really inefficient compared with the Matlab built-in randn.

This is my JavaScript implementation of Algorithm P (Polar method for normal deviates) from Section 3.4.1 of Donald Knuth's book The Art of Computer Programming:
function normal_random(mean,stddev)
{
var V1
var V2
var S
do{
var U1 = Math.random() // return uniform distributed in [0,1[
var U2 = Math.random()
V1 = 2*U1-1
V2 = 2*U2-1
S = V1*V1+V2*V2
}while(S >= 1)
if(S===0) return 0
return mean+stddev*(V1*Math.sqrt(-2*Math.log(S)/S))
}

I thing you should try this in EXCEL: =norminv(rand();0;1). This will product the random numbers which should be normally distributed with the zero mean and unite variance. "0" can be supplied with any value, so that the numbers will be of desired mean, and by changing "1", you will get the variance equal to the square of your input.
For example: =norminv(rand();50;3) will yield to the normally distributed numbers with MEAN = 50 VARIANCE = 9.

Q How can I convert a uniform distribution (as most random number generators produce, e.g. between 0.0 and 1.0) into a normal distribution?
For software implementation I know couple random generator names which give you a pseudo uniform random sequence in [0,1] (Mersenne Twister, Linear Congruate Generator). Let's call it U(x)
It is exist mathematical area which called probibility theory.
First thing: If you want to model r.v. with integral distribution F then you can try just to evaluate F^-1(U(x)). In pr.theory it was proved that such r.v. will have integral distribution F.
Step 2 can be appliable to generate r.v.~F without usage of any counting methods when F^-1 can be derived analytically without problems. (e.g. exp.distribution)
To model normal distribution you can cacculate y1*cos(y2), where y1~is uniform in[0,2pi]. and y2 is the relei distribution.
Q: What if I want a mean and standard deviation of my choosing?
You can calculate sigma*N(0,1)+m.
It can be shown that such shifting and scaling lead to N(m,sigma)

I have the following code which maybe could help:
set.seed(123)
n <- 1000
u <- runif(n) #creates U
x <- -log(u)
y <- runif(n, max=u*sqrt((2*exp(1))/pi)) #create Y
z <- ifelse (y < dnorm(x)/2, -x, NA)
z <- ifelse ((y > dnorm(x)/2) & (y < dnorm(x)), x, z)
z <- z[!is.na(z)]

It is also easier to use the implemented function rnorm() since it is faster than writing a random number generator for the normal distribution. See the following code as prove
n <- length(z)
t0 <- Sys.time()
z <- rnorm(n)
t1 <- Sys.time()
t1-t0

function distRandom(){
do{
x=random(DISTRIBUTION_DOMAIN);
}while(random(DISTRIBUTION_RANGE)>=distributionFunction(x));
return x;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio