Importance of a Random Variable using Entropy or other method - random

I have a two-dimensional random vector x = [x1, x2]T with a known joint probability density function (PDF). The PDF is non-Gaussian and the two entries of the random vector are statistically dependent. I need to show that for example x1 is more important than x2, in terms of the amount of information that it carries. Is there a classical solution for this problem? Can I show that for example n% of the total information carried by x is in x1 and 100-n% is carried by x2?
I assume that the standard way of measuring the amount of information is by calculating the Entropy. Any clues?

Related

Where is the gaussian distribution function in the pseudocode below?

I was working on my final assignment, and I raised Box Muller Gaussian Distribution method to look for random numbers in unity software.
I am very confused about the gaussian distribution function on the pseudocode that I found in one of the journals.
Pseudocode algoritma Box-Muller(Sukajaya dkk., 2012) :
a. Generate uniform random number u, v in range [-1, 1]
b. Calculate s = u2 + v2
c. Looping step 2 until s < 1
d. Find normal random numbers `z0 = u. √((-2lns)/s)` and z1 = v . √(- (-2lns)/s)
I think the pseudocode only talks about the Box Muller and the Gaussian Distribution function is only for displaying diagrams of randomized numbers.
The Box-Muller algorithm does not contain a direct implementation of the Gaussian density formula. Instead, it produces outcomes which (cumulatively) follow that density. The results z0 and z1 produced by the algorithm are two independent Gaussian random values. If you iterate the algorithm hundreds or thousands of times and build a histogram of all the z values, it will start looking like the bell-shaped curve of a Gaussian distribution. The math behind it is beyond the scope of a StackOverflow post, so I'm going to advise that you just push the "I believe!" button, or see the Wikipedia article if you want more explanation and links to various original sources.
I'm not sure what you mean when you say "the Gaussian Distribution function is only for displaying diagrams of randomized numbers." The Gaussian is one of the most important modeling distributions out there because sums of values from all other distributions with finite variance will converge to the Gaussian in distribution. That means if you're studying averages (which are built from sums) or aggregates of lots of little errors, the Gaussian distribution does a great job of characterizing the results.

optimize integral f(x)exp(-x) from x=0,infinity

I need a robust integration algorithm for f(x)exp(-x) between x=0 and infinity, with f(x) a positive, differentiable function.
I do not know the array x a priori (it's an intermediate output of my routine). The x array is typically ~log-equispaced, but highly irregular.
Currently, I'm using the Simpson algorithm, buy my problem is that often the domain is highly undersampled by the x array, which produces unrealistic values for the integral.
On each run of my code I need to do this integration thousands of times (each with a different set of x values), so I need to find an efficient and robust way to integrate this function.
More details:
The x array can have between 2 and N points (N known). The first value is always x[0] = 0.0. The last point is always a value greater than a tunable threshold x_max (such that exp(x_max) approx 0). I only know the values of f at the points x[i] (though the function is a smooth function).
My first idea was to do a Laguerre-Gauss quadrature integration. However, this algorithm seems to be highly unreliable when one does not use the optimal quadrature points.
My current idea is to add a set of auxiliary points, interpolating f, such that the Simpson algorithm becomes more stable. If I do this, is there an optimal selection of auxiliary points?
I'd appreciate any advice,
Thanks.
Set t=1-exp(-x), then dt = exp(-x) dx and the integral value is equal to
integral[ f(-log(1-t)) , t=0..1 ]
which you can evaluate with the standard Simpson formula and hopefully get good results.
Note that piecewise linear interpolation will always result in an order 2 error for the integral, as the result amounts to a trapezoid formula even if the method was Simpson. For better errors in the Simpson method you will need higher interpolation degrees, ideally cubic splines. Cubic Bezier polynomials with estimated derivatives to compute the control points could be a fast compromise.

Two different Covariance Matrices?

I am a little bit confused!
Assume we have observed the Data X = [x1,..,xn] and they are vectors in R^d (with zero mean)
X^T denotes the transposed of X
Sometimes i see that the covariance matrix is in the form of 1/n * X*X^T (e.g. Principal Component Analysis) and sometimes is see it in the form 1/n * X^T*X (e.g. Kernel-Covariance matrix with kernel k(x,y) = x^T*y)
So why are 2 different ways or am i mixing up some things? Thank you for your help.
Well, the results differ in their dimension. One is a nxn-matrix, the other is a dxd-matrix.
I don't know the application for nxn-result, but when I used the covariance matrix to denote the variation of a vector in R^d (with measurements X = [x1,..,xn]) the result has to be a dxd-matrix, whose eigenvectors and -values indicate the main axes and extends of an "variance ellipsoid" (which must be given in dxd)
PS: Only half an answer, I know
Addendum:
Kernels are used for creating inner products of pairwise features, thus reducing the dimension to 1 to find patterns more easily. Have a look at
http://en.wikipedia.org/wiki/Kernel_principal_component_analysis#Introduction_of_the_Kernel_to_PCA
to get an impression, what the kernel covariance matrix is used for

Epanechnikov multivariate density

I have data which consists of vectors of size 1x5, each representing a pikel: [x,y,r,g,b], x and y are the position:0 <= x <= M, 0 <= y <= N. r,g,b are the colors of the pixel: 0 <= r,g,b <= 255.
I want to estimate density estimation using the multivariate Epanechnikov kernel. I read that there are 2 ways to basically do that:
Multiplicative method - calculate the kernel for each dimension and then multiply them.
Calculate the norm of the vector and calculate the kernel for that value.
How exactly would each of the two methods work with my data? What do I need to normalize knowing that the Epanechnikov kernel yields 0 for normalized values > 1 or < -1.
I am programming in C++.
Multiplicative method - calculate the kernel for each dimension and then multiply them.
Calculate the norm of the vector and calculate the kernel for that value.
assumes that your x variable and y are statistically independent, which does not hold for 2. On the other hand, 2. is a radially symmetric kernel.
How exactly would each of the two methods work with my data?
I would try both and see which one gives a better result (e.g. which one gives a better likelihood on the data but taking care not to overfit the data e.g. by using cross validation).
In its most basic form this means that you split your sample, use one part to calculate the density estimation function (i.e. place kernels around data points) and evaluate the likelihood on the other part (product of the values of the density estimation function at the points used for testing or better the log of the product of probabilities) and see which one gives the higher probability product on the 'other' sample (the one NOT used for calculating the estimate).
The same argument (cross validation) also applies to the choice of the width of the kernel ('scaling factor', make the kernel narrow or broad).
You can of course just select a kernel width by hand to start with. Choosing the kernel width too small will give a 'spiky' density estimate, choosing it too large will 'wash out' the important features of your data.
What do I need to normalize knowing that the Epanechnikov kernel yields 0 for normalized values > 1 or < -1.
The feature you mention is not related to the normalization. You should use a normalized expression for the kernel itself, i.e. the integral over the range where the kernel is non-zero should be one. For your case 1., if the 1D kernels are normalized (which is the case for example for 3/4*(1-u^2) on [-1..1], also the 2D product will be normalized. For the case 2. one has to calculate the 2D integral.
Assuming the kernel is normalized, you then can normalize the density estimate as follows:
where N is the number of data points. This will be normalized, i.e. the integral of p(x,y) over the 2D plane is one.
Note that neither of the functional forms you mentioned allow arbitrary covariance matrices. One way to work around this is to first 'decorrelate' the dataset (i.e. apply a matrix transformation such that the covariance matrix of the dataset becomes the unit matrix), then perform the density estimate and then apply the inverse transformation.
Also there are extensions such as adaptive kernel density estimation where the width of the kernel varies itself as function of x and y if at some point you want to refine your estimate etc.

Pseudorandom Number Generation with Specific Non-Uniform Distributions

I'm writing a program that simulates various random walks (with differing distributions). At each timestep, I need randomly generated, two dimensional step distances and angles from the distribution of the random walk. I'm hoping someone can check my understanding of how to generate these random numbers.
As I understand it I can use Inverse Transform Sampling as follows:
If f(x) is the pdf of our random walk that has a non-uniform distribution, and y is a random number from a uniform distribution.
Then if we let f(x) = y and solve to find x then we have a random number from the non-uniform distribution.
Is this a feasible solution?
Not quite. The function that needs to be inverted is not f(x), the pdf, but F(x)=P(X<=x)=int_{-inf}^{x}f(t)dt, the cdf. The good thing is that F is monotone, so actually has a unique inverse (unlike f).
There are multiple other ways of generating random numbers according to a given distribution. For example, if the cdf F is difficult to compute or to invert, rejection sampling can be a good option if f is easy to compute.
You are close, but not quite. Every probability density function (pdf) has a corresponding cumulative density function (cdf). An important property about CDF(x) is that they are always between 0 and 1. Because it is relatively easy to draw a random number between 0 and 1, we can use that to work our way backwards to the distribution. So changing the word pdf to CDF in your question makes the statement correct.
As an aside for this to make sense computationally you need to find an easy to calculate inverse of the CDF. One way to do this is to fit a polynomial approximation to the CDF and find the inverse of that function. There are more advanced techniques for simulating probability distributions with messy distributions. See this book chapter for the details.

Resources