Optimizing algorithm calculating (sin(x)-x)*x^{-3} (in matlab) - algorithm

My task is to write optimal program that calculates matrix Y, given matrix X, where:
y = (sin(x)-x) x-3
Here's the code I have written so far:
n = size(X, 1);
m = size(X, 2);
Y = zeros(n, m);
d = n*m;
for i = 1:d
x = X(i);
if abs(x)<0.1
Y(i) = -1/6+x.^2/120-x.^4/5040+x.^6/362880;
else
Y(i) = (sin(x)-x).*(x.^(-3));
end
end
So, generally the formula was inaccurate around 0, so I have approximated it using Taylor theorem.
Unfortunately this program has accuracy of 91% and efficiency of only 24% (so it's 4 times slower than the optimal solution).
The tests are around 13 million samples, out of which around 6 million have value of less than 0.1. The range of samples is (-8π , 8π).
The target accuracy (100%) is 4*epsilon where epsilon equals 2^(-52) (that means that numbers calculated by program shouldn't be larger or smaller than numbers calculated "perfectly" than 4*epsilon).
100*epsilon means accuracy of 86%.
Do you have any ideas on how to make it faster and more accurate? I'm looking both for mathematical tricks on how to further transform given formula, and general MATLAB tips that can accelerate programs?
EDIT:
Using Horner method, I have managed to bring up efficiency up to 81% (accuracy still 91%) with this program:
function Y = main(X)
Y = (sin(X)-X).*(X.^(-3));
i = abs(X) < 0.1;
Y(i) = horner(X(i));
function y = horner (x)
pow = x.*x;
y = -1/6+pow.*(1/120+pow.*(-1/5040+pow./362880));
Do you have any further ideas on how to improve it?

Program seems to work fine for a great range of input:
x = linspace(-8*pi,8*pi,13e6); % 13 million samples in the desired range
y = (sin(x)-x)./x.^3;
plot(x,y)
Due due round-off errors, you may have problem calculating it for very small values of x:
x = 0
y = (sin(x)-x)./x.^3
y =
NaN
You already have the Taylor series expansion of the function around 0. As the Taylor expansion does not include a division by x, you can expect a better behaviour of the Taylor function around this region:
x = -1e-6:1e-9:1e-6;
y = (sin(x)-x)./x.^3;
y_taylor = -1/6 + x.^2/120 - x.^4/5040 + x.^6/362880;
plot(x,y,x,y_taylor); legend('y','taylor expansion','location','best')

You can replace your loop with vectorized code. This is usually more efficient than loop because the loop has a conditional in it, which is bad for branch prediction:
Y = (sin(X)-X).*(X.^(-3));
i = abs(X) < 0.1;
Y(i) = -1/6+X(i).^2/120-X(i).^4/5040+X(i).^6/362880;
Rewriting the primary equation to avoid the cubic root yields a 3x speedup for that computation:
Y = (sin(X)./X - 1) ./ (X.*X);
Speed comparison:
The following script compares timing for this method compared to OP's loop code. I use data that has 7 million values uniformly distributed in (-8π, 8π), and another 6 million values uniformly distributed in (-0.1,0.1).
OP's loop code takes 2.4412 s, and the vectorized solution takes 0.7224 s. Using OP's Horner method and the rewritten sin expression it takes 0.1437 s.
X = [linspace(-8*pi,8*pi,7e6), linspace(-0.1,0.1,6e6)];
timeit(#()method1(X))
timeit(#()method2(X))
function Y = method1(X)
n = size(X, 1);
m = size(X, 2);
Y = zeros(n, m);
d = n*m;
for i = 1:d
x = X(i);
if abs(x)<0.1
Y(i) = -1/6+x.^2/120-x.^4/5040+x.^6/362880;
else
Y(i) = (sin(x)-x).*(x.^(-3));
end
end
end
function Y = method2(X)
Y = (sin(X)-X).*(X.^(-3));
i = abs(X) < 0.1;
Y(i) = -1/6+X(i).^2/120-X(i).^4/5040+X(i).^6/362880;
end
function Y = method3(X)
Y = (sin(X)./X - 1) ./ (X.*X);
i = abs(X) < 0.1;
Y(i) = horner(X(i));
end
function y = horner (x)
pow = x.*x;
y = -1/6+pow.*(1/120+pow.*(-1/5040+pow./362880));
end

Related

Speeding up program in matlab

I have 2 functions:
ccexpan - which calculates coefficients of interpolating polynomial of function f with N nodes in Chebyshew polynomial of the first kind basis.
csum - calculates value for arguments t using coefficients c from ccexpan (using Clenshaw algorithm).
This is what I have written so far:
function c = ccexpan(f,N)
z = zeros (1,N+1);
s = zeros (1,N+1);
for i = 1:(N+1)
z(i) = pi*(i-1)/N;
end
t = f(cos(z));
for k = 1:(N+1)
s(k) = sum(t.*cos(z.*(k-1)));
s(k) = s(k)-(f(1)+f(-1)*cos(pi*(k-1)))/2;
end
c = s.*2/N;
and:
function y = csum(t,c)
M = length(t);
N = length(c);
y = t;
b = zeros(1,N+2);
for k = 1:M
for i = N:-1:1
b(i) = c(i)+2*t(k)*b(i+1)-b(i+2);
end
y(k)=(b(1)-b(3))/2;
end
Unfortunately these programs are very slow, and also slightly inacurrate. Please give me some tips on how to speed them up, and how to improve accuracy.
Where possible try to get away from looping structures. At first blush, I would trade out your first for loop of
for i = 1:(N+1)
z(i) = pi*(i-1)/N;
end
and replace with
i=1:(N+1)
z = pi*(i-1)/N
I did not check the rest of you code but the above example will definitely speed up you code. And a second strategy is to combine loops when possible.
Martin,
Consider the following strategy.
% create hypothetical N and f
N = 3
f = #(x) 1./(1+15*x.*x)
% calculate z and t
i=1:(N+1)
z = pi*(i-1)/N
t = f(cos(z))
% make a column vector of k's
k = (1:(N+1))'
% do this: s(k) = sum(t.*cos(z.*(k-1)))
s1 = t.*cos(z.*(k-1)) % should be a matrix with one row for each row of k
% via implicit expansion
s2 = sum(s1,2) % row sum, i.e., one value for each row of k
% do this: s(k) = s(k)-(f(1)+f(-1)*cos(pi*(k-1)))/2
s3 = s2 - (f(1)+f(-1)*cos(pi*(k-1)))/2
% calculate c
c = s3 .* 2/N

Gradient descent algorithm giving incorrect answer in matlab

I'm taking a course in machine learning and trying to implement the gradient descent algorithm in matlab. The function computeCost works fine, as I have tested it separately. I use it to see the cost at every iteration, and it doesn't seem to be decreasing at all. It just fluctuates randomly. The value of alpha was given to be 0.01, so I know it's not an issue with the learning rate being too big. The answers I get for theta are very off from the expected output. Where am I going wrong? Thanks in advance!
function theta = gradientDescent(X, y, theta, alpha, num_iters)
%GRADIENTDESCENT Performs gradient descent to learn theta
% Initialize some useful values
m = length(y); % number of training examples
temp1=0;
temp2=0;
for iter = 1:num_iters
for k = 1:m
temp1 = temp1 + (theta(1) + theta(2)*X(k, 2) - y(k));
temp2 = temp2 + ((theta(1) + theta(2)*X(k, 2) - y(k))*X(k, 2));
end
theta(1) = theta(1)-(1/m)*alpha*temp1;
theta(2) = theta(2)-(1/m)*alpha*temp2;
computeCost(X, y, theta)
end
end
Edit: here is computeCost as well
function J = computeCost(X, y, theta)
m = length(y); % number of training examples
J = 0;
temp = 0;
for index = 1:m
temp = temp + (theta(1) + theta(2)*X(index, 2)-y(index))^2;
end
J = temp/(2*m);
end
Try changing:
temp1=0;
temp2=0;
for iter = 1:num_iters
to
for iter = 1:num_iters
temp1=0;
temp2=0;
The gradient needs to be computed fresh for each iteration (or you are effectively building in a momentum term).

Implementation of EM algorithm for Gaussian Mixture Models

Using the EM algorithm, I want to train a Gaussian Mixture model with four components on a given dataset. The set is three dimensional and contains 300 samples.
The problem is that after about 6 rounds of the EM algorithm, the covariance matrices sigma become close to singular according to matlab (rank(sigma) = 2 instead of 3). This in turn leads to undesired results like complex values evaluating the gaussian distribution gm(k,i).
Furthermore I used the log of the gaussian to account for underflow troubles - see E-step. I am not sure if this is correct and if I have to take the exp of the responsibilites p(w_k | x^(i), theta) somewhere else?
Can you tell me if my implementation of the EM algorithm is correct so far?
And how to account for the problem with the close to singular covariance sigma?
Here is my implementation of the EM algorithm:
First I initialized the means and the covariance of the components using kmeans:
load('data1.mat');
X = Data'; % 300x3 data set
D = size(X,2); % dimension
N = size(X,1); % number of samples
K = 4; % number of Gaussian Mixture components
% Initialization
p = [0.2, 0.3, 0.2, 0.3]; % arbitrary pi
[idx,mu] = kmeans(X,K); % initial means of the components
% compute the covariance of the components
sigma = zeros(D,D,K);
for k = 1:K
sigma(:,:,k) = cov(X(idx==k,:));
end
For the E-step I am using the following formula to calculate the responsibilities.
w_k are the k gaussian components.
x^(i) is a single datapoint (sample)
theta stands for the parameters of the gaussian mixture model: mu, Sigma, pi.
Here is the corresponding code:
% variables for convergence
converged = 0;
prevLoglikelihood = Inf;
prevMu = mu;
prevSigma = sigma;
prevPi = p;
round = 0;
while (converged ~= 1)
round = round +1
gm = zeros(K,N); % gaussian component in the nominator
sumGM = zeros(N,1); % denominator of responsibilities
% E-step: Evaluate the responsibilities using the current parameters
% compute the nominator and denominator of the responsibilities
for k = 1:K
for i = 1:N
Xmu = X-mu;
% I am using log to prevent underflow of the gaussian distribution (exp("small value"))
logPdf = log(1/sqrt(det(sigma(:,:,k))*(2*pi)^D)) + (-0.5*Xmu*(sigma(:,:,k)\Xmu'));
gm(k,i) = log(p(k)) * logPdf;
sumGM(i) = sumGM(i) + gm(k,i);
end
end
% calculate responsibilities
res = zeros(K,N); % responsibilities
Nk = zeros(4,1);
for k = 1:K
for i = 1:N
% I tried to use the exp(gm(k,i)/sumGM(i)) to compute res but this leads to sum(pi) > 1.
res(k,i) = gm(k,i)/sumGM(i);
end
Nk(k) = sum(res(k,:));
end
Nk(k) is computed using the formula given in the M-step and is used in the M-step to calculate the new probabilities p(k).
M-step
% M-step: Re-estimate the parameters using the current responsibilities
for k = 1:K
for i = 1:N
mu(k,:) = mu(k,:) + res(k,i).*X(k,:);
sigma(:,:,k) = sigma(:,:,k) + res(k,i).*(X(k,:)-mu(k,:))*(X(k,:)-mu(k,:))';
end
mu(k,:) = mu(k,:)./Nk(k);
sigma(:,:,k) = sigma(:,:,k)./Nk(k);
p(k) = Nk(k)/N;
end
Now in order to check for convergence the log-likelihood is computed using this formula:
% Evaluate the log-likelihood and check for convergence of either
% the parameters or the log-likelihood. If not converged, go to E-step.
loglikelihood = 0;
for i = 1:N
loglikelihood = loglikelihood + log(sum(gm(:,i)));
end
% Check for convergence of parameters
errorLoglikelihood = abs(loglikelihood-prevLoglikelihood);
if (errorLoglikelihood <= eps)
converged = 1;
end
errorMu = abs(mu(:)-prevMu(:));
errorSigma = abs(sigma(:)-prevSigma(:));
errorPi = abs(p(:)-prevPi(:));
if (all(errorMu <= eps) && all(errorSigma <= eps) && all(errorPi <= eps))
converged = 1;
end
prevLoglikelihood = loglikelihood;
prevMu = mu;
prevSigma = sigma;
prevPi = p;
end % while
Is there something wrong with my Matlab implementation of EM algorithm for Gaussian Mixture Models?
Previous troubles:
The problem is that I cannot check for convergence using the log-likelihood because it is -Inf. This results from rounded zero values while evaluating the gaussian in the formula of the responsibilities (see E-step).
Can you tell me if my implementation of the EM algorithm is correct so far?
And how to account for the problem with the rounded zero values?
Here is my implementation of the EM algorithm:
First I initialized the means and the covariance of the components using kmeans:
load('data1.mat');
X = Data'; % 300x3 data set
D = size(X,2); % dimension
N = size(X,1); % number of samples
K = 4; % number of Gaussian Mixture components
% Initialization
p = [0.2, 0.3, 0.2, 0.3]; % arbitrary pi
[idx,mu] = kmeans(X,K); % initial means of the components
% compute the covariance of the components
sigma = zeros(D,D,K);
for k = 1:K
sigma(:,:,k) = cov(X(idx==k,:));
end
For the E-step I am using the following formula to calculate the responsibilities
Here is the corresponding code:
% variables for convergence
converged = 0;
prevLoglikelihood = Inf;
prevMu = mu;
prevSigma = sigma;
prevPi = p;
round = 0;
while (converged ~= 1)
round = round +1
gm = zeros(K,N); % gaussian component in the nominator -
% some values evaluate to zero
sumGM = zeros(N,1); % denominator of responsibilities
% E-step: Evaluate the responsibilities using the current parameters
% compute the nominator and denominator of the responsibilities
for k = 1:K
for i = 1:N
% HERE values evalute to zero e.g. exp(-746.6228) = -Inf
gm(k,i) = p(k)/sqrt(det(sigma(:,:,k))*(2*pi)^D)*exp(-0.5*(X(i,:)-mu(k,:))*inv(sigma(:,:,k))*(X(i,:)-mu(k,:))');
sumGM(i) = sumGM(i) + gm(k,i);
end
end
% calculate responsibilities
res = zeros(K,N); % responsibilities
Nk = zeros(4,1);
for k = 1:K
for i = 1:N
res(k,i) = gm(k,i)/sumGM(i);
end
Nk(k) = sum(res(k,:));
end
Nk(k) is computed using the formula given in the M-step.
M-step
% M-step: Re-estimate the parameters using the current responsibilities
mu = zeros(K,3);
for k = 1:K
for i = 1:N
mu(k,:) = mu(k,:) + res(k,i).*X(k,:);
sigma(:,:,k) = sigma(:,:,k) + res(k,i).*(X(k,:)-mu(k,:))*(X(k,:)-mu(k,:))';
end
mu(k,:) = mu(k,:)./Nk(k);
sigma(:,:,k) = sigma(:,:,k)./Nk(k);
p(k) = Nk(k)/N;
end
Now in order to check for convergence the log-likelihood is computed using this formula:
% Evaluate the log-likelihood and check for convergence of either
% the parameters or the log-likelihood. If not converged, go to E-step.
loglikelihood = 0;
for i = 1:N
loglikelihood = loglikelihood + log(sum(gm(:,i)));
end
% Check for convergence of parameters
errorLoglikelihood = abs(loglikelihood-prevLoglikelihood);
if (errorLoglikelihood <= eps)
converged = 1;
end
errorMu = abs(mu(:)-prevMu(:));
errorSigma = abs(sigma(:)-prevSigma(:));
errorPi = abs(p(:)-prevPi(:));
if (all(errorMu <= eps) && all(errorSigma <= eps) && all(errorPi <= eps))
converged = 1;
end
prevLoglikelihood = loglikelihood;
prevMu = mu;
prevSigma = sigma;
prevPi = p;
end % while
After the first round the loglikelihood is around 700.
In the second round it is -Inf because some gm(k,i) values in the E-step are zero. Therefore the log is obviously negative infinity.
The zero values also lead to sumGM equals to zero and therefore leading to all NaN entries inside the mu and sigma matrices.
How can I solve this problem?
Can you tell me if there is something wrong with my implementation?
Could it be solved by increasing Matlab's precision somehow?
EDIT:
I added a scaling for the exp() term in gm(k,i).
Unfortunately this doesn't help much. After some more rounds I still get the underflow problem.
scale = zeros(N,D);
for i = 1:N
max = 0;
for k = 1:K
Xmu = X(i,:)-mu(k,:);
if (norm(scale(i,:) - Xmu) > max)
max = norm(scale(i,:) - Xmu);
scale(i,:) = Xmu;
end
end
end
for k = 1:K
for i = 1:N
Xmu = X(i,:)-mu(k,:);
% scale gm to prevent underflow
Xmu = Xmu - scale(i,:);
gm(k,i) = p(k)/sqrt(det(sigma(:,:,k))*(2*pi)^D)*exp(-0.5*Xmu*inv(sigma(:,:,k))*Xmu');
sumGM(i) = sumGM(i) + gm(k,i);
end
end
Further I noticed that kmeans initializes the means completely different compared to the following rounds where the means are computed in the M-step.
kmeans:
mu = 13.500000000000000 0.026602138870044 0.062415945993735
88.500000000000000 -0.009869960132085 -0.075177888210981
39.000000000000000 -0.042569305020309 0.043402772876513
64.000000000000000 -0.024519281362918 -0.012586980924762
after M-step:
round = 2
mu = 1.000000000000000 0.077230046948357 0.024498886414254
2.000000000000000 0.074260118474053 0.026484346404660
3.000000000000002 0.070944016105476 0.029043085983168
4.000000000000000 0.067613431480832 0.031641849205021
In the next rounds mu doesn't change at all. It stays the same as in round 2.
I guess this is caused because of the underflow in gm(k,i)?
Either my implementation of the scaling is incorrect or the whole implementation of the algorithm is wrong somewhere :(
EDIT 2
After four rounds I got NaN values and looked into gm in more detail. Looking at only one sample (and without the 0.5 factor), gm becomes zero in all components. Put in matlab gm(:,1) = [0 0 0 0]. This in turn leads to sumGM equal to zero -> NaN because I divided by zero. I have given more details in
round = 1
mu = 62.0000 -0.0298 -0.0078
37.0000 -0.0396 0.0481
87.5000 -0.0083 -0.0728
12.5000 0.0303 0.0614
gm(:,1) = [11.7488, 0.0000, 0.0000, 0.0000]
round = 2
mu = 1.0000 0.0772 0.0245
2.0000 0.0743 0.0265
3.0000 0.0709 0.0290
4.0000 0.0676 0.0316
gm(:,1) = [0.0000, 0.0000, 0.0000, 0.3128]
round = 3
mu = 1.0000 0.0772 0.0245
2.0000 0.0743 0.0265
3.0000 0.0709 0.0290
4.0000 0.0676 0.0316
gm(:,1) = [0, 0, 0.0000, 0.2867]
round = 4
mu = 1.0000 0.0772 0.0245
NaN NaN NaN
3.0000 0.0709 0.0290
4.0000 0.0676 0.0316
gm(:,1) = 1.0e-105 * [0, NaN, 0, 0.5375]
First of all the means doesn't seem to change and are completely different compared to the initializaiton of kmeans.
And every sample (not just for the first one like here) corresponds only to one gaussian component according to the output of gm(:,1). Shouldn't the sample be "partially distributed" among every gaussian component?
EDIT3:
So I guess the problem with mu not changing was the first line in the M-step: mu = zeros(K,3);.
To account the underflow problem I am currently trying to use the log of the gaussian:
function logPdf = logmvnpdf(X, mu, sigma, D)
Xmu = X-mu;
logPdf = log(1/sqrt(det(sigma)*(2*pi)^D)) + (-0.5*Xmu*inv(sigma)*Xmu');
end
The new problem is the covariance matrix sigma. Matlab claims:
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate.
After 6 rounds I get imaginary values for gm (gaussian distribution).
The updated E-Step looks like this now:
gm = zeros(K,N); % gaussian component in the nominator
sumGM = zeros(N,1); % denominator of responsibilities
for k = 1:K
for i = 1:N
%gm(k,i) = p(k)/sqrt(det(sigma(:,:,k))*(2*pi)^D)*exp(-0.5*Xmu*inv(sigma(:,:,k))*Xmu');
%gm(k,i) = p(k)*mvnpdf(X(i,:),mu(k,:),sigma(:,:,k));
gm(k,i) = log(p(k)) + logmvnpdf(X(i,:), mu(k,:), sigma(:,:,k), D);
sumGM(i) = sumGM(i) + gm(k,i);
end
end
It looks like you should be able to use a scale factor scale(i) to bring gm(k, i) into a representable range, because if you multiply gm(k, i) by scale(i) this will end up multiplying sumGM(i) as well, and be cancelled away when you work out res(k, i) = gm(k, i) / sumGM(i).
I would make scale(i) = 1 / max_k(exp(-0.5*(X(i,:)-mu(k,:))) in theory, and actually calculate it without doing the exponentiation, so you end up dealing with its log, max_k(-0.5*(X(i,:)-mu(k,:)) - this gives you a common term you can add to each -0.5*(X(i,:)-mu(k,:) before using exp() and will keep at least the maximum within a representable range - anything that still underflows to zero after this correction you don't care about anyway, because it is vanishingly small compared to the other contributions.

Efficiently Calculate Frequency Averaged Periodogram Using GPU

In Matlab I am looking for a way to most efficiently calculate a frequency averaged periodogram on a GPU.
I understand that the most important thing is to minimise for loops and use the already built in GPU functions. However my code still feels relatively unoptimised and I was wondering what changes I can make to it to gain a better speed up.
r = 5; % Dimension
n = 100; % Time points
m = 20; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
Pdgm = zeros(r, r, n/2 + 1);
for j = 1:n/2 + 1
Pdgm(:,:,j) = FT(:,j)*FT(:,j)';
end
% Finally smooth with our weights
SmPdgm = zeros(r, r, n/2 + 1);
% Take advantage of the GPU filter function
% Create new Periodogram WrapPdgm with m/2 values wrapped around in front and
% behind it (it seems like there is redundancy here)
WrapPdgm = zeros(r,r,n/2 + 1 + m);
WrapPdgm(:,:,m/2+1:n/2+m/2+1) = Pdgm;
WrapPdgm(:,:,1:m/2) = flip(Pdgm(:,:,2:m/2+1),3);
WrapPdgm(:,:,n/2+m/2+2:end) = flip(Pdgm(:,:,n/2-m/2+1:end-1),3);
% Perform filtering
for i = 1:r
for j = 1:r
temp = filter(w, [1], WrapPdgm(i,j,:));
SmPdgm(i,j,:) = temp(:,:,m+1:end);
end
end
In particular, I couldn't see a way to optimise out the for loop when calculating the initial Pdgm from the Fourier transformed data and I feel the trick I play with the WrapPdgm in order to take advantage of filter() on the GPU feels unnecessary if there were a smooth function instead.
Solution Code
This seems to be pretty efficient as benchmark runtimes in the next section might convince us -
%// Select the portion of FT to be processed and
%// send copy to GPU for calculating everything
gFT = gpuArray(FT(:,1:n/2 + 1));
%// Perform non-smoothed Periodogram, thus removing the first loop
Pdgm1 = bsxfun(#times,permute(gFT,[1 3 2]),permute(conj(gFT),[3 1 2]));
%// Generate WrapPdgm right on GPU
WrapPdgm1 = zeros(r,r,n/2 + 1 + m,'gpuArray');
WrapPdgm1(:,:,m/2+1:n/2+m/2+1) = Pdgm1;
WrapPdgm1(:,:,1:m/2) = Pdgm1(:,:,m/2+1:-1:2);
WrapPdgm1(:,:,n/2+m/2+2:end) = Pdgm1(:,:,end-1:-1:n/2-m/2+1);
%// Perform filtering on GPU and get the final output, SmPdgm1
filt_data = filter(w,1,reshape(WrapPdgm1,r*r,[]),[],2);
SmPdgm1 = gather(reshape(filt_data(:,m+1:end),r,r,[]));
Benchmarking
Benchmarking Code
%// Input parameters
r = 50; % Dimension
n = 1000; % Time points
m = 200; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
tic, %// ... Code from original approach, toc
tic %// ... Code from proposed approach, toc
Runtime results thus obtained on GPU, GTX 750 Ti against CPU, I-7 4790K -
------------------------------ With Original Approach on CPU
Elapsed time is 0.279816 seconds.
------------------------------ With Proposed Approach on GPU
Elapsed time is 0.169969 seconds.
To get rid of the first loop you can do the following:
Pdgm_cell = cellfun(#(x) x * x', mat2cell(FT(:, 1 : 51), [5], ones(51, 1)), 'UniformOutput', false);
Pdgm = reshape(cell2mat(Pdgm_cell),5,5,[]);
Then in your filter you can do the following:
temp = filter(w, 1, WrapPdgm, [], 3);
SmPdgm = temp(:, :, m + 1 : end);
The 3 lets the filter know to operate along the 3rd dimension of your data.
You can use pagefun on the GPU for the first loop. (Note that the implementation of cellfun is basically a hidden loop, whereas pagefun runs natively on the GPU using a batched GEMM operation). Here's how:
n = 16;
r = 8;
X = gpuArray.rand(r, n);
R = gpuArray.zeros(r, r, n/2 + 1);
for jj = 1:(n/2+1)
R(:,:,jj) = X(:,jj) * X(:,jj)';
end
X2 = X(:,1:(n/2+1));
R2 = pagefun(#mtimes, reshape(X2, r, 1, []), reshape(X2, 1, r, []));
R - R2

Improving performance of interpolation (Barycentric formula)

I have been given an assignment in which I am supposed to write an algorithm which performs polynomial interpolation by the barycentric formula. The formulas states that:
p(x) = (SIGMA_(j=0 to n) w(j)*f(j)/(x - x(j)))/(SIGMA_(j=0 to n) w(j)/(x - x(j)))
I have written an algorithm which works just fine, and I get the polynomial output I desire. However, this requires the use of some quite long loops, and for a large grid number, lots of nastly loop operations will have to be done. Thus, I would appreciate it greatly if anyone has any hints as to how I may improve this, so that I will avoid all these loops.
In the algorithm, x and f stand for the given points we are supposed to interpolate. w stands for the barycentric weights, which have been calculated before running the algorithm. And grid is the linspace over which the interpolation should take place:
function p = barycentric_formula(x,f,w,grid)
%Assert x-vectors and f-vectors have same length.
if length(x) ~= length(f)
sprintf('Not equal amounts of x- and y-values. Function is terminated.')
return;
end
n = length(x);
m = length(grid);
p = zeros(1,m);
% Loops for finding polynomial values at grid points. All values are
% calculated by the barycentric formula.
for i = 1:m
var = 0;
sum1 = 0;
sum2 = 0;
for j = 1:n
if grid(i) == x(j)
p(i) = f(j);
var = 1;
else
sum1 = sum1 + (w(j)*f(j))/(grid(i) - x(j));
sum2 = sum2 + (w(j)/(grid(i) - x(j)));
end
end
if var == 0
p(i) = sum1/sum2;
end
end
This is a classical case for matlab 'vectorization'. I would say - just remove the loops. It is almost that simple. First, have a look at this code:
function p = bf2(x, f, w, grid)
m = length(grid);
p = zeros(1,m);
for i = 1:m
var = grid(i)==x;
if any(var)
p(i) = f(var);
else
sum1 = sum((w.*f)./(grid(i) - x));
sum2 = sum(w./(grid(i) - x));
p(i) = sum1/sum2;
end
end
end
I have removed the inner loop over j. All I did here was in fact removing the (j) indexing and changing the arithmetic operators from / to ./ and from * to .* - the same, but with a dot in front to signify that the operation is performed on element by element basis. This is called array operators in contrast to ordinary matrix operators. Also note that treating the special case where the grid points fall onto x is very similar to what you had in the original implementation, only using a vector var such that x(var)==grid(i).
Now, you can also remove the outermost loop. This is a bit more tricky and there are two major approaches how you can do that in MATLAB. I will do it the simpler way, which can be less efficient, but more clear to read - using repmat:
function p = bf3(x, f, w, grid)
% Find grid points that coincide with x.
% The below compares all grid values with all x values
% and returns a matrix of 0/1. 1 is in the (row,col)
% for which grid(row)==x(col)
var = bsxfun(#eq, grid', x);
% find the logical indexes of those x entries
varx = sum(var, 1)~=0;
% and of those grid entries
varp = sum(var, 2)~=0;
% Outer-most loop removal - use repmat to
% replicate the vectors into matrices.
% Thus, instead of having a loop over j
% you have matrices of values that would be
% referenced in the loop
ww = repmat(w, numel(grid), 1);
ff = repmat(f, numel(grid), 1);
xx = repmat(x, numel(grid), 1);
gg = repmat(grid', 1, numel(x));
% perform the calculations element-wise on the matrices
sum1 = sum((ww.*ff)./(gg - xx),2);
sum2 = sum(ww./(gg - xx),2);
p = sum1./sum2;
% fix the case where grid==x and return
p(varp) = f(varx);
end
The fully vectorized version can be implemented with bsxfun rather than repmat. This can potentially be a bit faster, since the matrices are not explicitly formed. However, the speed difference may not be large for small system sizes.
Also, the first solution with one loop is also not too bad performance-wise. I suggest you test those and see, what is better. Maybe it is not worth it to fully vectorize? The first code looks a bit more readable..

Resources