Extremely small or NaN values appear in training neural network - algorithm

I'm trying to implement a neural network architecture in Haskell, and use it on MNIST.
I'm using the hmatrix package for linear algebra.
My training framework is built using the pipes package.
My code compiles and doesn't crash. But the problem is, certain combinations of layer size (say, 1000), minibatch size, and learning rate give rise to NaN values in the computations. After some inspection, I see that extremely small values (order of 1e-100) eventually appear in the activations. But, even when that doesn't happen, the training still doesn't work. There's no improvement over its loss or accuracy.
I checked and rechecked my code, and I'm at a loss as to what the root of the problem could be.
Here's the backpropagation training, which computes the deltas for each layer:
backward lf n (out,tar) das = do
let δout = tr (derivate lf (tar, out)) -- dE/dy
deltas = scanr (\(l, a') δ ->
let w = weights l
in (tr a') * (w <> δ)) δout (zip (tail $ toList n) das)
return (deltas)
lf is the loss function, n is the network (weight matrix and bias vector for each layer), out and tar are the actual output of the network and the target (desired) output, and das are the activation derivatives of each layer.
In batch mode, out, tar are matrices (rows are output vectors), and das is a list of the matrices.
Here's the actual gradient computation:
grad lf (n, (i,t)) = do
-- Forward propagation: compute layers outputs and activation derivatives
let (as, as') = unzip $ runLayers n i
(out) = last as
(ds) <- backward lf n (out, t) (init as') -- Compute deltas with backpropagation
let r = fromIntegral $ rows i -- Size of minibatch
let gs = zipWith (\δ a -> tr (δ <> a)) ds (i:init as) -- Gradients for weights
return $ GradBatch ((recip r .*) <$> gs, (recip r .*) <$> squeeze <$> ds)
Here, lf and n are the same as above, i is the input, and t is the target output (both in batch form, as matrices).
squeeze transforms a matrix into a vector by summing over each row. That is, ds is a list of matrices of deltas, where each column corresponds to the deltas for a row of the minibatch. So, the gradients for the biases are the average of the deltas over all the minibatch. The same thing for gs, which corresponds to the gradients for the weights.
Here's the actual update code:
move lr (n, (i,t)) (GradBatch (gs, ds)) = do
-- Update function
let update = (\(FC w b af) g δ -> FC (w + (lr).*g) (b + (lr).*δ) af)
n' = Network.fromList $ zipWith3 update (Network.toList n) gs ds
return (n', (i,t))
lr is the learning rate. FC is the layer constructor, and af is the activation function for that layer.
The gradient descent algorithm makes sure to pass in a negative value for the learning rate. The actual code for the gradient descent is simply a loop around a composition of grad and move, with a parameterized stop condition.
Finally, here's the code for a mean square error loss function:
mse :: (Floating a) => LossFunction a a
mse = let f (y,y') = let gamma = y'-y in gamma**2 / 2
f' (y,y') = (y'-y)
in Evaluator f f'
Evaluator just bundles a loss function and its derivative (for calculating the delta of the output layer).
The rest of the code is up on GitHub: NeuralNetwork.
So, if anyone has an insight into the problem, or even just a sanity check that I'm correctly implementing the algorithm, I'd be grateful.

Do you know about "vanishing" and "exploding" gradients in backpropagation? I'm not too familiar with Haskell so I can't easily see what exactly your backprop is doing, but it does look like you are using a logistic curve as your activation function.
If you look at the plot of this function you'll see that the gradient of this function is nearly 0 at the ends (as input values get very large or very small, the slope of the curve is almost flat), so multiplying or dividing by this during backpropagation will result in a very big or very small number. Doing this repeatedly as you pass through multiple layers causes the activations to approach zero or infinity. Since backprop updates your weights by doing this during training, you end up with a lot of zeros or infinities in your network.
Solution: there are loads of methods out there that you can search for to solve the vanishing gradient problem, but one easy thing to try is to change the type of activation function you are using to a non-saturating one. ReLU is a popular choice as it mitigates this particular problem (but might introduce others).

Related

Latent factor recovery with probabilistic matrix factorization using Edward

I implemented a probabilistic matrix factorization model (R = U'V) following the example in Edward's repo:
# data
U_true = np.random.randn(D, N)
V_true = np.random.randn(D, M)
R_true = np.dot(np.transpose(U_true), V_true) + np.random.normal(0, 0.1, size=(N, M))
# model
I = tf.placeholder(tf.float32, [N, M])
U = Normal(loc=tf.zeros([D, N]), scale=tf.ones([D, N]))
V = Normal(loc=tf.zeros([D, M]), scale=tf.ones([D, M]))
R = Normal(loc=tf.matmul(tf.transpose(U), V), scale=tf.ones([N, M]))
I get a good performance when predicting the data in matrix R. However, when I evaluate the inferred traits in U and V, the error varies a lot and can get very high.
I tried with a latent space of small dimension (e.g. 2) and checked if latent traits weren't simply permuted. They sometimes get permuted but even after realigning them the error is still significant.
To throw some numbers: for a synthetic R matrix generated from U and V both normally distributed (mean 0 and variance 1), I can achieve a mean absolute error of 0.003 on R, but on U and V it's usually around 0.5.
I know this model is symmetric, but I am not sure about the implications. I would like to ask:
Is it actually possible to guarantee the recovery of the original latent traits in some way?
If so, how could it be achieved, preferably using Edward?

Frequency computation and fast fourier transform in Matlab

I have a question related to Fast Fourier transform. I want to calculate the phase and make FFT to draw power spectral density. However when I calculate the frequency f, there are some errors. This is my program code:
n = 1:32768;
T = 0.2*10^-9; % Sampling period
Fs = 1/T; % Sampling frequency
Fn = Fs/2; % Nyquist frequency
omega = 2*pi*200*10^6; % Carrier frequency
L = 32768; % % Length of signal
t = (0:L-1)*T; % Time vector
x_signal(n) = cos(omega*T*n + 0.1*randn(size(n))); % Additive phase noise (random)
y_signal(n) = sin(omega*T*n + 0.1*randn(size(n))); % Additive phase noise (random)
theta(n) = atan(y_signal(n)/x_signal(n));
f = (theta(n)-theta(n-1))/(2*pi)
Y = fft(f,t);
PSD = Y.*conj(Y); % Power Spectral Density
%Fv = linspace(0, 1, fix(L/2)+1)*Fn; % Frequency Vector
As posted, you would get the error 
error: subscript indices must be either positive integers less than 2^31 or logicals
which refers to the operation theta(n-1) when n=1 which results in an index of 0 (which is out of bounds since Matlab uses 1-based indexing). To avoid that could use a subset of indices in n:
f = (theta(n(2:end))-theta(n(1:end-1)))/(2*pi);
That said, if you are doing this to try to obtain an instantaneous measure of the frequency, then you will have a few more issues to deal with. The most trivial one is that you should also divide by T. Not as obvious is the fact that as given, theta is a scalar due to the use of the / operator (see Matlab's mrdivide) rather than the ./ operator which performs element-wise division. So a better expression would be:
theta(n) = atan(y_signal(n)./x_signal(n));
Now, the next problem you might notice is that you are actually losing some phase information since the result of atan is [-pi/2,pi/2] instead of the full [-pi,pi] range. To avoid this you should instead be using atan2:
theta(n) = atan2(y_signal(n), x_signal(n));
Even with this, you are likely to notice that the estimated frequency regularly has spikes whenever the phase jumps between near -pi and near pi. This can be avoided by computing the phase difference modulo 2*pi:
f = mod(theta(n(2:end))-theta(n(1:end-1)),2*pi)/(2*pi*T);
A final thing to note: when calling the fft, you should not be passing in a time variable (the input is implicitly assumed to be sampled at regular time intervals). You may however specify the desired length of the FFT. So, you would thus compute Y as follow:
Y = fft(f, L);
And you could then plot the resulting PSD using:
Fv = linspace(0, 1, fix(L/2)+1)*Fn; % Frequency Vector
plot(Fv, abs(PSD(1:L/2+1)));

Finding parameters of exponentially decaying sinusoids (Matrix Pencil Method)

The matrix pencil method is an algorithm which can be used to find the individual exponential decaying sinusoids' parameters (frequency, amplitude, decay factor and initial phase) in a signal consisting of multiple such signals added. I am trying to implement the algorithm. The algorithm can be found in the paper from this link:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=370583 OR
http://krein.unica.it/~cornelis/private/IEEE/IEEEAntennasPropagMag_37_48.pdf
In order to test the algorithm, I created a synthetic signal composed of four exponentially decaying sinusoids generated as follows:
fs=2205;
t=0:1/fs:249/fs;
f(1)=80;
f(2)=120;
f(3)=250;
f(4)=560;
a(1)=.4;
a(2)=1;
a(3)=0.89;
a(4)=.65;
d(1)=70;
d(2)=50;
d(3)=90;
d(4)=80;
for i=1:4
x(i,:)=a(i)*exp(-d(i)*t).*cos(2*pi*f(i)*t);
end
y=x(1,:)+x(2,:)+x(3,:)+x(4,:);
I then feed this signal to the algorithm described in the paper as follows:
function [f d] = mpencil(y)
%construct hankel matrix
N = size(y,2);
L1 = ceil(1/3 * N);
L2 = floor(2/3 * N);
L = ceil((L1 + L2) / 2);
fs=2205;
for i=1:1:(N-L)
Y(i,:)=y(i:(i+L));
end
Y1=Y(:,1:L);
Y2=Y(:,2:(L+1));
[U,S,V] = svd(Y);
D=diag(S);
tol=1e-3;
m=0;
l=length(D);
for i=1:l
if( abs(D(i)/D(1)) >= tol)
m=m+1;
end
end
Ss=S(:,1:m);
Vnew=V(:,1:m);
a=size(Vnew,1);
Vs1=Vnew(1:(a-1),:);
Vs2=Vnew(2:end,:);
Y1=U*Ss*(Vs1');
Y2=U*Ss*(Vs2');
D_fil=(pinv(Y1))*Y2;
z = eig(D_fil);
l=length(z);
for i=1:2:l
f((i+1)/2)= (angle(z(i))*fs)/(2*pi);
d((i+1)/2)=-real(z(i))*fs;
end
In the output from the above code, I am correctly getting the four constituent frequency components but am not getting their decaying factors. If anybody has prior experience with this algorithm or has some understanding about why this discrepancy might be there, I would be very grateful for your help. I have tried rewriting the code from a scratch multiple times but it has been of no help, giving the same results.
Any help would be highly appreciated.
I found the problem.
There are two small glitches in the code:
SVD output is a complex conjugate of the right singular matrix—i.e, Vh—and according to IEEE, it needs to be converted to V first.
Now, this V is filtered for reducing the dimension.
After reducing the dimensions of V, V1 and V2 are calculated from V. (In your case, you are using Vh directly for calculating V1 and V2!)
When calculating Y1 and Y2, the complex conjugates of V1 and V2 are used.
You did not consider the absolute magnitude of complex eigen values, but only the real part.
damping coefficient "zeta"= log(magnitude(z))/Ts

Continuous Fourier transform on discrete data using Mathematica?

I have some periodic data, but the amount of data is not a multiple of
the period. How can I Fourier analyze this data? Example:
% Let's create some data for testing:
data = Table[N[753+919*Sin[x/623-125]], {x,1,25000}]
% I now receive this data, but have no idea that it came from the
formula above. I'm trying to reconstruct the formula just from 'data'.
% Looking at the first few non-constant terms of the Fourier series:
ListPlot[Table[Abs[Fourier[data]][[x]], {x,2,20}], PlotJoined->True,
PlotRange->All]
shows an expected spike at 6 (since the number of periods is really
25000/(623*2*Pi) or about 6.38663, though we don't know this).
% Now, how do I get back 6.38663? One way is to "convolve" the data with
arbitrary multiples of Cos[x].
convolve[n_] := Sum[data[[x]]*Cos[n*x], {x,1,25000}]
% And graph the "convolution" near n=6:
Plot[convolve[n],{n,5,7}, PlotRange->All]
we see a spike roughly where expected.
% We try FindMaximum:
FindMaximum[convolve[n],{n,5,7}]
but the result is useless and inaccurate:
FindMaximum::fmmp:
Machine precision is insufficient to achieve the requested accuracy or
precision.
Out[119]= {98.9285, {n -> 5.17881}}
because the function is very wiggly.
% By refining our interval (using visual analysis on the plots), we
finally find an interval where convolve[] doesn't wiggle too much:
Plot[convolve[n],{n,6.2831,6.2833}, PlotRange->All]
and FindMaximum works:
FindMaximum[convolve[n],{n,6.2831,6.2833}] // FortranForm
List(1.984759605826571e7,List(Rule(n,6.2831853071787975)))
% However, this process is ugly, requires human intervention, and
computing convolve[] is REALLY slow. Is there a better way to do this?
% Looking at the Fourier series of the data, can I somehow divine the
"true" number of periods is 6.38663? Of course, the actual result
would be 6.283185, since my data fits that better (because I'm only
sampling at a finite number of points).
Based on Mathematica help for the Fourier function / Applications / Frequency Identification:
Checked on version 7
n = 25000;
data = Table[N[753 + 919*Sin[x/623 - 125]], {x, 1, n}];
pdata = data - Total[data]/Length[data];
f = Abs[Fourier[pdata]];
pos = Ordering[-f, 1][[1]]; (*the position of the first Maximal value*)
fr = Abs[Fourier[pdata Exp[2 Pi I (pos - 2) N[Range[0, n - 1]]/n],
FourierParameters -> {0, 2/n}]];
frpos = Ordering[-fr, 1][[1]];
N[(pos - 2 + 2 (frpos - 1)/n)]
returns 6.37072
Look for the period length using autocorrelation to get an estimate:
autocorrelate[data_, d_] :=
Plus ## (Drop[data, d]*Drop[data, -d])/(Length[data] - d)
ListPlot[Table[{d, autocorrelate[data, d]}, {d, 0, 5000, 100}]]
A smart search for the first maximum away from d=0 may be the best estimate you can get form the available data?
(* the data *)
data = Table[N[753+919*Sin[x/623-125]], {x,1,25000}];
(* Find the position of the largest Fourier coefficient, after
removing the last half of the list (which is redundant) and the
constant term; the [[1]] is necessary because Ordering returns a list *)
f2 = Ordering[Abs[Take[Fourier[data], {2,Round[Length[data]/2+1]}]],-1][[1]]
(* Result: 6 *)
(* Directly find the least squares difference between all functions of
the form a+b*Sin[c*n-d], with intelligent starting values *)
sol = FindMinimum[Sum[((a+b*Sin[c*n-d]) - data[[n]])^2, {n,1,Length[data]}],
{{a,Mean[data]},{b,(Max[data]-Min[data])/2},{c,2*f2*Pi/Length[data]},d}]
(* Result (using //InputForm):
FindMinimum::sszero:
The step size in the search has become less than the tolerance prescribed by
the PrecisionGoal option, but the gradient is larger than the tolerance
specified by the AccuracyGoal option. There is a possibility that the method
has stalled at a point that is not a local minimum.
{2.1375902350021628*^-19, {a -> 753., b -> -919., c -> 0.0016051364365971107,
d -> 2.477886509998064}}
*)
(* Create a table of values for the resulting function to compare to 'data' *)
tab = Table[a+b*Sin[c*x-d], {x,1,Length[data]}] /. sol[[2]];
(* The maximal difference is effectively 0 *)
Max[Abs[data-tab]] // InputForm
(* Result: 7.73070496506989*^-12 *)
Although the above doesn't necessarily fully answer my question, I found it
somewhat remarkable.
Earlier, I'd tried using FindFit[] with Method -> NMinimize (which is
supposed to give a better global fit), but that didn't work well,
possibly because you can't give FindFit[] intelligent starting values.
The error I get bugs me but appears to be irrelevant.

Representing continuous probability distributions

I have a problem involving a collection of continuous probability distribution functions, most of which are determined empirically (e.g. departure times, transit times). What I need is some way of taking two of these PDFs and doing arithmetic on them. E.g. if I have two values x taken from PDF X, and y taken from PDF Y, I need to get the PDF for (x+y), or any other operation f(x,y).
An analytical solution is not possible, so what I'm looking for is some representation of PDFs that allows such things. An obvious (but computationally expensive) solution is monte-carlo: generate lots of values of x and y, and then just measure f(x, y). But that takes too much CPU time.
I did think about representing the PDF as a list of ranges where each range has a roughly equal probability, effectively representing the PDF as the union of a list of uniform distributions. But I can't see how to combine them.
Does anyone have any good solutions to this problem?
Edit: The goal is to create a mini-language (aka Domain Specific Language) for manipulating PDFs. But first I need to sort out the underlying representation and algorithms.
Edit 2: dmckee suggests a histogram implementation. That is what I was getting at with my list of uniform distributions. But I don't see how to combine them to create new distributions. Ultimately I need to find things like P(x < y) in cases where this may be quite small.
Edit 3: I have a bunch of histograms. They are not evenly distributed because I'm generating them from occurance data, so basically if I have 100 samples and I want ten points in the histogram then I allocate 10 samples to each bar, and make the bars variable width but constant area.
I've figured out that to add PDFs you convolve them, and I've boned up on the maths for that. When you convolve two uniform distributions you get a new distribution with three sections: the wider uniform distribution is still there, but with a triangle stuck on each side the width of the narrower one. So if I convolve each element of X and Y I'll get a bunch of these, all overlapping. Now I'm trying to figure out how to sum them all and then get a histogram that is the best approximation to it.
I'm beginning to wonder if Monte-Carlo wasn't such a bad idea after all.
Edit 4: This paper discusses convolutions of uniform distributions in some detail. In general you get a "trapezoid" distribution. Since each "column" in the histograms is a uniform distribution, I had hoped that the problem could be solved by convolving these columns and summing the results.
However the result is considerably more complex than the inputs, and also includes triangles. Edit 5: [Wrong stuff removed]. But if these trapezoids are approximated to rectangles with the same area then you get the Right Answer, and reducing the number of rectangles in the result looks pretty straightforward too. This might be the solution I've been trying to find.
Edit 6: Solved! Here is the final Haskell code for this problem:
-- | Continuous distributions of scalars are represented as a
-- | histogram where each bar has approximately constant area but
-- | variable width and height. A histogram with N bars is stored as
-- | a list of N+1 values.
data Continuous = C {
cN :: Int,
-- ^ Number of bars in the histogram.
cAreas :: [Double],
-- ^ Areas of the bars. #length cAreas == cN#
cBars :: [Double]
-- ^ Boundaries of the bars. #length cBars == cN + 1#
} deriving (Show, Read)
{- | Add distributions. If two random variables #vX# and #vY# are
taken from distributions #x# and #y# respectively then the
distribution of #(vX + vY)# will be #(x .+. y).
This is implemented as the convolution of distributions x and y.
Each is a histogram, which is to say the sum of a collection of
uniform distributions (the "bars"). Therefore the convolution can be
computed as the sum of the convolutions of the cross product of the
components of x and y.
When you convolve two uniform distributions of unequal size you get a
trapezoidal distribution. Let p = p2-p1, q - q2-q1. Then we get:
> | |
> | ______ |
> | | | with | _____________
> | | | | | |
> +-----+----+------- +--+-----------+-
> p1 p2 q1 q2
>
> gives h|....... _______________
> | /: :\
> | / : : \ 1
> | / : : \ where h = -
> | / : : \ q
> | / : : \
> +--+-----+-------------+-----+-----
> p1+q1 p2+q1 p1+q2 p2+q2
However we cannot keep the trapezoid in the final result because our
representation is restricted to uniform distributions. So instead we
store a uniform approximation to the trapezoid with the same area:
> h|......___________________
> | | / \ |
> | |/ \|
> | | |
> | /| |\
> | / | | \
> +-----+-------------------+--------
> p1+q1+p/2 p2+q2-p/2
-}
(.+.) :: Continuous -> Continuous -> Continuous
c .+. d = C {cN = length bars - 1,
cBars = map fst bars,
cAreas = zipWith barArea bars (tail bars)}
where
-- The convolve function returns a list of two (x, deltaY) pairs.
-- These can be sorted by x and then sequentially summed to get
-- the new histogram. The "b" parameter is the product of the
-- height of the input bars, which was omitted from the diagrams
-- above.
convolve b c1 c2 d1 d2 =
if (c2-c1) < (d2-d1) then convolve1 b c1 c2 d1 d2 else convolve1 b d1
d2 c1 c2
convolve1 b p1 p2 q1 q2 =
[(p1+q1+halfP, h), (p2+q2-halfP, (-h))]
where
halfP = (p2-p1)/2
h = b / (q2-q1)
outline = map sumGroup $ groupBy ((==) `on` fst) $ sortBy (comparing fst)
$ concat
[convolve (areaC*areaD) c1 c2 d1 d2 |
(c1, c2, areaC) <- zip3 (cBars c) (tail $ cBars c) (cAreas c),
(d1, d2, areaD) <- zip3 (cBars d) (tail $ cBars d) (cAreas d)
]
sumGroup pairs = (fst $ head pairs, sum $ map snd pairs)
bars = tail $ scanl (\(_,y) (x2,dy) -> (x2, y+dy)) (0, 0) outline
barArea (x1, h) (x2, _) = (x2 - x1) * h
Other operators are left as an exercise for the reader.
No need for histograms or symbolic computation: everything can be done at the language level in closed form, if the right point of view is taken.
[I shall use the term "measure" and "distribution" interchangeably. Also, my Haskell is rusty and I ask you to forgive me for being imprecise in this area.]
Probability distributions are really codata.
Let mu be a probability measure. The only thing you can do with a measure is integrate it against a test function (this is one possible mathematical definition of "measure"). Note that this is what you will eventually do: for instance integrating against identity is taking the mean:
mean :: Measure -> Double
mean mu = mu id
another example:
variance :: Measure -> Double
variance mu = (mu $ \x -> x ^ 2) - (mean mu) ^ 2
another example, which computes P(mu < x):
cdf :: Measure -> Double -> Double
cdf mu x = mu $ \z -> if z < x then 1 else 0
This suggests an approach by duality.
The type Measure shall therefore denote the type (Double -> Double) -> Double. This allows you to model results of MC simulation, numerical/symbolic quadrature against a PDF, etc. For instance, the function
empirical :: [Double] -> Measure
empirical h:t f = (f h) + empirical t f
returns the integral of f against an empirical measure obtained by eg. MC sampling. Also
from_pdf :: (Double -> Double) -> Measure
from_pdf rho f = my_favorite_quadrature_method rho f
construct measures from (regular) densities.
Now, the good news. If mu and nu are two measures, the convolution mu ** nu is given by:
(mu ** nu) f = nu $ \y -> (mu $ \x -> f $ x + y)
So, given two measures, you can integrate against their convolution.
Also, given a random variable X of law mu, the law of a * X is given by:
rescale :: Double -> Measure -> Measure
rescale a mu f = mu $ \x -> f(a * x)
Also, the distribution of phi(X) is given by the image measure phi_* X, in our framework:
apply :: (Double -> Double) -> Measure -> Measure
apply phi mu f = mu $ f . phi
So now you can easily work out an embedded language for measures. There are much more things to do here, particularly with respect to sample spaces other than the real line, dependencies between random variables, conditionning, but I hope you get the point.
In particular, the pushforward is functorial:
newtype Measure a = (a -> Double) -> Double
instance Functor Measure a where
fmap f mu = apply f mu
It is a monad too (exercise -- hint: this very much looks like the continuation monad. What is return ? What is the analog of call/cc ?).
Also, combined with a differential geometry framework, this can probably be turned into something which compute Bayesian posterior distributions automatically.
At the end of the day, you can write stuff like
m = mean $ apply cos ((from_pdf gauss) ** (empirical data))
to compute the mean of cos(X + Y) where X has pdf gauss and Y has been sampled by a MC method whose results are in data.
Probability distributions form a monad; see eg the work of Claire Jones and also the LICS 1989 paper, but the ideas go back to a 1982 paper by Giry (DOI 10.1007/BFb0092872) and to a 1962 note by Lawvere that I cannot track down (http://permalink.gmane.org/gmane.science.mathematics.categories/6541).
But I don't see the comonad: there's no way to get an "a" out of an "(a->Double)->Double". Perhaps if you make it polymorphic - (a->r)->r for all r? (That's the continuation monad.)
Is there anything that stops you from employing a mini-language for this?
By that I mean, define a language that lets you write f = x + y and evaluates f for you just as written. And similarly for g = x * z, h = y(x), etc. ad nauseum. (The semantics I'm suggesting call for the evaluator to select a random number on each innermost PDF appearing on the RHS at evaluation time, and not to try to understand the composted form of the resulting PDFs. This may not be fast enough...)
Assuming that you understand the precision limits you need, you can represent a PDF fairly simply with a histogram or spline (the former being a degenerate case of the later). If you need to mix analytically defined PDFs with experimentally determined ones, you'll have to add a type mechanism.
A histogram is just an array, the contents of which represent the incidence in a particular region of the input range. You haven't said if you have a language preference, so I'll assume something c-like. You need to know the bin-structure (uniorm sizes are easy, but not always best) including the high and low limits and possibly the normalizatation:
struct histogram_struct {
int bins; /* Assumed to be uniform */
double low;
double high;
/* double normalization; */
/* double *errors; */ /* if using, intialize with enough space,
* and store _squared_ errors
*/
double contents[];
};
This kind of thing is very common in scientific analysis software, and you might want to use an existing implementation.
I worked on similar problems for my dissertation.
One way to compute approximate convolutions is to take the Fourier transform of the density functions (histograms in this case), multiply them, then take the inverse Fourier transform to get the convolution.
Look at Appendix C of my dissertation for formulas for various special cases of operations on probability distributions. You can find the dissertation at: http://riso.sourceforge.net
I wrote Java code to carry out those operations. You can find the code at: https://sourceforge.net/projects/riso
Autonomous mobile robotics deals with similar issue in localization and navigation, in particular the Markov localization and Kalman filter (sensor fusion). See An experimental comparison of localization methods continued for example.
Another approach you could borrow from mobile robots is path planning using potential fields.
A couple of responses:
1) If you have empirically determined PDFs they either you have histograms or you have an approximation to a parametric PDF. A PDF is a continuous function and you don't have infinite data...
2) Let's assume that the variables are independent. Then if you make the PDF discrete then P(f(x,y)) = f(x,y)p(x,y) = f(x,y)p(x)p(y) summed over all the combinations of x and y such that f(x,y) meets your target.
If you are going to fit the empirical PDFs to standard PDFs, e.g. the normal distribution, then you can use already-determined functions to figure out the sum, etc.
If the variables are not independent, then you have more trouble on your hands and I think you have to use copulas.
I think that defining your own mini-language, etc., is overkill. you can do this with arrays...
Some initial thoughts:
First, Mathematica has a nice facility for doing this with exact distributions.
Second, representation as histograms (ie, empirical PDFs) is problematic since you have to make choices about bin size. That can be avoided by storing a cumulative distribution instead, ie, an empirical CDF. (In fact, you then retain the ability to recreate the full data set of samples that the empirical distribution is based on.)
Here's some ugly Mathematica code to take a list of samples and return an empirical CDF, namely a list of value-probability pairs. Run the output of this through ListPlot to see a plot of the empirical CDF.
empiricalCDF[t_] :=
Flatten[{{#[[2,1]],#[[1,2]]},#[[2]]}&/#Partition[Prepend[Transpose[{#[[1]],
Rest[FoldList[Plus,0,#[[2]]]]/Length[t]}&[Transpose[{First[#],Length[#]}&/#
Split[Sort[t]]]]],{Null,0}],2,1],1]
Finally, here's some information on combining discrete probability distributions:
http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter7.pdf
I think the histograms or the list of 1/N area regions is a good idea. For the sake of argument, I'll assume that you'll have a fixed N for all distributions.
Use the paper you linked edit 4 to generate the new distribution. Then, approximate it with a new N-element distribution.
If you don't want N to be fixed, it's even easier. Take each convex polygon (trapezoid or triangle) in the new generated distribution and approximate it with a uniform distribution.
Another suggestion is to use kernel densities. Especially if you use Gaussian kernels, then they can be relatively easy to work with... except that the distributions quickly explode in size without care. Depending on the application, there are additional approximation techniques like importance sampling that can be used.
If you want some fun, try representing them symbolically like Maple or Mathemetica would do. Maple uses directed acyclic graphs, while Matematica uses a list/lisp like appoach (I believe, but it's been a loooong time, since I even thought about this).
Do all your manipulations symbolically, then at the end push through numerical values. (Or just find a way to launch off in a shell and do the computations).
Paul.

Resources