I am doing calculations that involves too many for-loops. I would appreciate any idea that could eliminates some of the loops to make the algorithm more efficient. Here is the mathematical expression I want to get: A discrete distribution of random variable Y.
Pr(Y=y )=
∑_Pr(Z=z) ∙∑_Pr((X=x) ∑_Pr(W=w) ∙∑_Pr(R=r│W=w) ∙Pr(S=z+y-x-r|W=w)
Y,Z,X,W,R,S are discrete random variable, they are dependent. I know the expression for each term, but there are just probability calculations – not close-form distributions.
array Y[max_Y+1]; % store the distribution of Y
temp1=0, temp2=0, temp3=0, temp4=0; % summation for partial distributions
for y = 0 max_Y
temp1=0;
for z = 0 : 5- y
temp2=0;
for x=0:5
temp3=0;
for w=0:5
temp4=0
for r=0:w
temp4=temp4+Pr(R=r│W=w)∙Pr(S=z+y-x-r|W=w);
end
temp3=temp3+temp4*Pr(W=w);
end
temp2= temp2+temp3*Pr(X=x);
end
temp1=temp1+temp2* P(Z=z);
end
Y[y]=temp1;
end
Thanks a lot!
Ester
From what I notice in every iteration only the term Pr(S=z+y-x-r|W=w) & Pr(Z=z) is dependent on your function input variable Y so all other value can be precomputed using separate for-loops and then just compute Pr(S=z+y-x-r|W=w)*Pr(Z=z)*precomputed
Related
I am going through my book , it states that " Write a sampling algorithm for this density function"
y=x^2+(2/3)*x+1/3; 0 < 𝑥 < 1
Or I can use Monte Carlo?
Any help would be appreciated!
I'm assuming you mean you want to generate random x values that have the distribution specified by density y(x).
It's often desirable to derive the cumulative distribution function by integrating the density, and use inverse transform sampling to generate x values. In your case the the CDF is a third order polynomial which doesn't factor to yield a simple cube-root solution, so you would have to use a numerical solver to find the inverse. Time to consider alternatives.
Another option is to use the acceptance/rejection method. After checking the derivative, it's clear that your density is convex, so it's easy to create a bounding function b(x) by drawing a straight line from f(0) to f(1). This yields b(x) = 1/3 + 5x/3. This bounding function has area 7/6, while your f(x) has an area of 1, since it is a valid density. Consequently, 6/7 of points generated uniformly under b(x) will also fall under f(x), and only 1 out of 7 attempts will fail in the rejection scheme. Here's a plot of f(x) and b(x):
Since b(x) is linear, it is easy to generate x values using it as a distribution after scaling by 6/7 to make it a valid distribution function. The algorithm, expressed in pseudocode, then becomes:
function generate():
while TRUE:
x <- (sqrt(1 + 35 * U(0,1)) - 1) / 5 # inverse CDF transform of b(x)
if U(0, b(x)) <= f(x):
return x
end while
end function
where U(a,b) means generate a value uniformly distributed between a and b, f(x) is your density, and b(x) is the bounding function described above.
I implemented the algorithm described above to generate 100,000 candidate values, of which 14,199 (~1/7) were rejected, as expected. The end results are presented in the following histogram, which you can compare to f(x) in the plot above.
I'm assuming that you have a function y(x), which takes a value between [0,1] and returns the value of y. You just need to provide a random value of x and return the corresponding value of y.
def getSample():
#get uniform random number
x = numpy.random.random()
#sample my custom function
return y(x)
I have small question and I will be very happy if you can give me a solution or any idea for solution of probability distribution of the following idea:
I have a random variable x which follows exponntial distribution with parameter lambda1,I have one more variable y which follows exponential distribution with parameter lambda2. z is a discrete value, how can I define the probability distribution of k in the following formula ?
k=z-x-y
Thank you so much
Ok, lets start with rewriting formula a bit:
k = z-x-y = -(x-y) + z = - (x + y + -z)
That parts in the parentheses looks manageable. Let's start with x+y. For random variable x and y if one wants to find out their sum, answer is PDFs convolution.
q = x+y
PDF(q) = S PDFx(q-t) PDFy(t) dt
where S denotes integration. For x and y being exponential, the convolution integral is known and equal to expression here when lambdas are different, or to Gamma(2,lambda) when lambdas are equal, Gamma being Gamma distribution.
If z is some constant discrete value, then we could express it as continuous RV with PDF
PDF(t) = 𝛿(t+z)
where 𝛿 is Delta function, and we take into account that peak would be at -z as expected. It is normalized, so integral over t is eqaul to 1. It could be easily extended to discrete RV, as sum of 𝛿-functions at those values, multiplied by probabilities such that sum of them is equal to 1.
Again, we have sum of two RV, with known PDFs, and solution is convolution, which is easy to compute due to property of 𝛿-function. So final PDF of x + y + -z would be
PDF(q+z) dq
where PDF is taken from sum expression from Exponential distribution wiki, of Gamma distribution from Gamma wiki.
You just have to negate, and that's it
Calculation of Average clustering coefficient of a graph
I am getting correct result but it takes huge time when the graph dimension increases need some alternative way so that it takes less time to execute. Is there any way to simplify the code??
%// A is adjacency matrix N X N,
%// d is degree ,
N=100;
d=10;
rand('state',0)
A = zeros(N,N);
kv=d*(d-1)/2;
%% Creating A matrix %%%
for i = 1:(d*N/2)
j = floor(N*rand)+1;
k = floor(N*rand)+1;
while (j==k)||(A(j,k)==1)
j = floor(N*rand)+1;
k = floor(N*rand)+1;
end
A(j,k)=1;
A(k,j)=1;
end
%% Calculation of clustering Coeff %%
for i=1:N
J=find(A(i,:));
et=0;
for ii=1:(size(J,2))-1
for jj=ii+1:size(J,2)
et=et+A(J(ii),J(jj));
end
end
Cv(i)=et/kv;
end
Avg_clustering_coeff=sum(Cv)/n;
Output I got.
Avg_clustering_coeff = 0.1107
That Calculation of clustering Coeff part could be vectorized using nchoosek to remove the innermost two nested loops, like so -
CvOut = zeros(1,N);
for k=1:N
J=find(A(k,:));
if numel(J)>1
idx = nchoosek(J,2);
CvOut(k) = sum(A(sub2ind([N N],idx(:,1),idx(:,2))));
end
end
CvOut=CvOut/kv;
Hopefully, this would boost up the performance quite a bit!
To speed up your code you can read my comment, but you are not going to reduce drastically the computation time, because the time complexity doesn't change.
But if you don't need to get an absolut result you can use the probability.
probnum = cumsum(1:d);
probnum = mean(probnum(end-1:end)); %theorical number of elements created by your second loop (for each row).
probfind = d*N/(N^2); %probability of finding a non zero value.
coeff = probnum*probfind/kv;
This probabilistic coeff is going to be equal to Avg_clustering_coeff for big N.
So you can use the normal method for small N and this method for big N.
I have a scalar function f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2) which receives two 2-dimensional vectors as input (norm here implements the Euclidean norm). The values of x,i range in 1:w and the values y,j range in 1:h. I want to create a cell array X such that X{x,y} will contain a w x h matrix such that X{x,y}(i,j) = f([x,y],[i,j]). This can obviously be done using 4 nested loops like so:
for x=1:w;
for y=1:h;
X{x,y}=zeros(w,h);
for i=1:w
for j=1:h
X{x,y}(i,j)=f([x,y],[i,j])
end
end
end
end
This is however extremely inefficient. I would very much appreciate an efficient way to create X.
The one way to do this is to remove the 2 innermost loops and replace then with a vectorised version. By the look of your f function this shouldn't be too bad
First we need to construct two matrices containing the 1 to w on every row and 1 to h on every column like so
wMat=repmat(1:w,h,1);
hMat=repmat(1:h,w,1)';
This is going to represent the inner two loops, and the transpose will allow us to get all combinations. Now we can vectorise the calculation (f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2)):
for x=1:w;
for y=1:h;
temp1=sqrt((x-wMat).^2+(y-hMat).^2);
X{x,y}=exp(temp1/(sigma^2));
end
end
Where we have computed the Euclidean norm for all pairs of nodes in the inner loops at once.
Some discussion and code
The trick here is to perform the norm-calculations with numeric arrays and save the results into a cell array version as late as possible. For performing the norm-calculations you can take help of ndgrid, bsxfun and some permute + reshape to give it the "shape" as needed for the final cell array version. So, here's the vectorized approach to perform these tasks -
%// Create x-y/i-j values to be used for calculation of function values
[xi,yi] = ndgrid(1:w,1:h);
%// Get the norm values
normvals = sqrt(bsxfun(#minus,xi(:),xi(:).').^2 + ...
bsxfun(#minus,yi(:),yi(:).').^2);
%// Get the actual function values
vals = exp(-normvals.^2/sigma^2);
%// Get the values into blocks of a 4D array and then re-arrange to match
%// with the shape of numeric array version of X
blks = reshape(permute(reshape(vals, w*h, h, []), [2 1 3]), h, w, h, w);
arranged_blks = reshape(permute(blks,[2 3 1 4]),w,h,w,h);
%// Finally get the cell array version
X = squeeze(mat2cell(arranged_blks,w,h,ones(1,w),ones(1,h)));
Benchmarking and runtimes
After improving the original loopy code with pre-allocation for X and function-inling f, runtime-benchmarks were performed with it against the proposed vectorized approach with datasizes as w, h = 60 and the runtime results thus obtained were -
----------- With Improved loopy code
Elapsed time is 41.227797 seconds.
----------- With Vectorized code
Elapsed time is 2.116782 seconds.
This suggested a whooping close to 20x speedup with the proposed solution!
For extremely huge datasizes
If you are dealing with huge datasizes, essentially you are not giving enough memory for bsxfun to work with, and bsxfun is known to use up a lot of memory for giving you a performance-efficient vectorized solution. So, for such huge-datasize cases, you can use the following loopy approach to replace normvals calculations that was listed in the earlier bsxfun based solution -
%// Get the norm values
nx = numel(xi);
normvals = zeros(nx,nx);
for ii = 1:nx
normvals(:,ii) = sqrt( (xi(:) - xi(ii)).^2 + (yi(:) - yi(ii)).^2 );
end
It seems to me that when you run through the cycle for x=w, y=h, you are calculating all the values you need at once. So you don't need recalculate them. Once you have this:
for i=1:w
for j=1:h
temp(i,j)=f([x,y],[i,j])
end
end
Then, e.g. X{1,1} is just temp(1,1), X{2,2} is just temp(1:2,1:2), and so on. If you can vectorise the calculation of f (norm here is just the Euclidean norm of that vector?) then it will get even simpler.
Suppose I have a simple program that simulates a coin toss, with a given probability specified by an expression. It might look something like this:
# This is the probability that you will get heads.
$expr = "rand < 0.5"
def get_result(expr)
eval(expr)
end
def toss_coin
if get_result($expr)
return "Head"
else
return "Tail"
end
end
Now, I also want to tell the user what the probability of getting Head is.
For the given expression
"rand < 0.5"
We can eye-ball it and say the probability is 50%, because rand returns a number between 0 and 1, and therefore the expression evaluates to true 50% of the time on average.
However, if I decided to provide a rigged coin toss where the expression used to determine the outcome is
"rand < 0.3"
Now, I have a 30% chance of getting Head.
Is it possible to write a method that will take an arbitrary expression (that evaluates to a boolean!) and return the probability that it resolves to true?
def get_expected_probability(expr)
# Returns the probability the `expr` returns true
# `rand < 0.5` would return 0.5
# `rand < 0.3` would return 0.3
# `true` would return 1
# `false` would return 0
end
My guess would be that it would be theoreticially possible to write such a method, assuming you restricted yourself to rand and deterministic mathematical functions and had complete knowledge of the systems floating point implementation, etc.
It would be much more straightforward, however, to approximate the probability by executing the expression a large number of times and keeping track of the percentage of times it succeeded.
For simple comparisons to a uniform random number, yes, but in general, no. It depends on the distribution of the expression you're using to determine your boolean, and you could write arbitrarily complex expressions with bizarre distributions. However, it's pretty straightforward to estimate the probability empirically.
Create a Bernoulli (0/1) outcome based on the expression, yielding 1 when the expression is true and 0 when it is false. Generate a large number (n) of them. The long run average of the Bernoulli outcomes will converge to the probability of getting a true. If you call that p-hat and the true value is p, then p-hat should fall within the range p +/- (1.96 * sqrt(p*(1-p)/n)) 95% of the time. You can see from this that the larger the sample size n is, the more precise your estimate is.
An incredibly slow way of approximating this would be to evaluate the expression a very large number of times and estimate the probability it converges to. The Law of Large Numbers guarantees that as n approaches infinity, it will be that probability.
$expr = "rand < 0.5"
def get_result(expr)
eval(expr)
end
n = 1000000
a = Array.new(n)
n.times do |i|
a[i] = eval($expr)
end
puts a.count(true)/n.to_f
Returned 0.499899 for me.