MATLAB Optimisation of Weighted Gram-Schmidt Orthogonalisation - performance

I have a function in MATLAB which performs the Gram-Schmidt Orthogonalisation with a very important weighting applied to the inner-products (I don't think MATLAB's built in function supports this).
This function works well as far as I can tell, however, it is too slow on large matrices.
What would be the best way to improve this?
I have tried converting to a MEX file but I lose parallelisation with the compiler I'm using and so it is then slower.
I was thinking of running it on a GPU as the element-wise multiplications are highly parallelised. (But I'd prefer the implementation to be easily portable)
Can anyone vectorise this code or make it faster? I am not sure how to do it elegantly ...
I know the stackoverflow minds here are amazing, consider this a challenge :)
Function
function [Q, R] = Gram_Schmidt(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
v = zeros(n, 1);
for j = 1:n
v = A(:,j);
for i = 1:j-1
R(i,j) = sum( v .* conj( Q(:,i) ) .* w ) / ...
sum( Q(:,i) .* conj( Q(:,i) ) .* w );
v = v - R(i,j) * Q(:,i);
end
R(j,j) = norm(v);
Q(:,j) = v / R(j,j);
end
end
where A is an m x n matrix of complex numbers and w is an m x 1 vector of real numbers.
Bottle-neck
This is the expression for R(i,j) which is the slowest part of the function (not 100% sure if the notation is correct):
where w is a non-negative weight function.
The weighted inner-product is mentioned on several Wikipedia pages, this is one on the weight function and this is one on orthogonal functions.
Reproduction
You can produce results using the following script:
A = complex( rand(360000,100), rand(360000,100));
w = rand(360000, 1);
[Q, R] = Gram_Schmidt(A, w);
where A and w are the inputs.
Speed and Computation
If you use the above script you will get profiler results synonymous to the following:
Testing Result
You can test the results by comparing a function with the one above using the following script:
A = complex( rand( 100, 10), rand( 100, 10));
w = rand( 100, 1);
[Q , R ] = Gram_Schmidt( A, w);
[Q2, R2] = Gram_Schmidt2( A, w);
zeros1 = norm( Q - Q2 );
zeros2 = norm( R - R2 );
where Gram_Schmidt is the function described earlier and Gram_Schmidt2 is an alternative function. The results zeros1 and zeros2 should then be very close to zero.
Note:
I tried speeding up the calculation of R(i,j) with the following but to no avail ...
R(i,j) = ( w' * ( v .* conj( Q(:,i) ) ) ) / ...
( w' * ( Q(:,i) .* conj( Q(:,i) ) ) );

1)
My first attempt at vectorization:
function [Q, R] = Gram_Schmidt1(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
for j = 1:n
v = A(:,j);
QQ = Q(:,1:j-1);
QQ = bsxfun(#rdivide, bsxfun(#times, w, conj(QQ)), w.' * abs(QQ).^2);
for i = 1:j-1
R(i,j) = (v.' * QQ(:,i));
v = v - R(i,j) * Q(:,i);
end
R(j,j) = norm(v);
Q(:,j) = v / R(j,j);
end
end
Unfortunately, it turned out to be slower than the original function.
2)
Then I realized that the columns of this intermediate matrix QQ are built incrementally, and that previous ones are not modified. So here is my second attempt:
function [Q, R] = Gram_Schmidt2(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
QQ = complex(zeros(m, n-1));
for j = 1:n
if j>1
qj = Q(:,j-1);
QQ(:,j-1) = (conj(qj) .* w) ./ (w.' * (qj.*conj(qj)));
end
v = A(:,j);
for i = 1:j-1
R(i,j) = (v.' * QQ(:,i));
v = v - R(i,j) * Q(:,i);
end
R(j,j) = norm(v);
Q(:,j) = v / R(j,j);
end
end
Technically no major vectorization was done; I've only precomputed intermediate results, and moved the computation outside the inner loop.
Based on a quick benchmark, this new version is definitely faster:
% some random data
>> M = 10000; N = 100;
>> A = complex(rand(M,N), rand(M,N));
>> w = rand(M,1);
% time
>> timeit(#() Gram_Schmidt(A,w), 2) % original version
ans =
1.2444
>> timeit(#() Gram_Schmidt1(A,w), 2) % first attempt (vectorized)
ans =
2.0990
>> timeit(#() Gram_Schmidt2(A,w), 2) % final version
ans =
0.4698
% check results
>> [Q,R] = Gram_Schmidt(A,w);
>> [Q2,R2] = Gram_Schmidt2(A,w);
>> norm(Q-Q2)
ans =
4.2796e-14
>> norm(R-R2)
ans =
1.7782e-12
EDIT:
Following the comments, we can rewrite the second solution to get rid of the if-statmenet, by moving that part to the end of the outer loop (i.e immediately after computing the new column Q(:,j), we compute and store the corresponding QQ(:,j)).
The function is identical in output, and timing is not that different either; the code is just a bit shorter!
function [Q, R] = Gram_Schmidt3(A, w)
[m, n] = size(A);
Q = zeros(m, n, 'like',A);
R = zeros(n, n, 'like',A);
QQ = zeros(m, n, 'like',A);
for j = 1:n
v = A(:,j);
for i = 1:j-1
R(i,j) = (v.' * QQ(:,i));
v = v - R(i,j) * Q(:,i);
end
R(j,j) = norm(v);
Q(:,j) = v / R(j,j);
QQ(:,j) = (conj(Q(:,j)) .* w) ./ (w.' * (Q(:,j).*conj(Q(:,j))));
end
end
Note that I used the zeros(..., 'like',A) syntax (new in recent MATLAB versions). This allows us to run the function unmodified on the GPU (assuming you have the Parallel Computing Toolbox):
% CPU
[Q3,R3] = Gram_Schmidt3(A, w);
vs.
% GPU
AA = gpuArray(A);
[Q3,R3] = Gram_Schmidt3(AA, w);
Unfortunately in my case, it wasn't any faster. In fact it was many times slower to run on the GPU than on the CPU, but it was worth a shot :)

There is a long discussion here, but, to jump to the answer. You have weighted the numerator and denominator of the R calculation by a vector w. The weighting occurs on the inner loop, and consist of a triple dot product, A dot Q dot w in the numerator, and Q dot Q dot w in the denominator. If you make one change, I think the code will run significantly faster. Write num = (A dot sqrt(w)) dot (Q dot sqrt(w)) and write den = (Q dot sqrt(w)) dot (Q dot sqrt(w)). That moves the (A dot sqrt(w)) and (Q dot sqrt(w)) product calculations out of the inner loop.
I would like to give an description of the formulation to Gram Schmidt Orthogonalization, that, hopefully, in addition to giving an alternate computational solution, gives further insight into the advantage of GSO.
The "goals" of GSO are two fold. First, to enable the solution of an equation like Ax=y, where A has far more rows than columns. This situation occurs frequently when measuring data, in that it is easy to measure more data than the number of states. The approach to the first goal is to rewrite A as QR such that the columns of Q are orthogonal and normalized, and R is a triangular matrix. The algorithm you provided, I believe, achieves the first goal. The Q represents the basis space of the A matrix, and R represents the amplitude of each basis space required to generate each column of A.
The second goal of GSO is to rank the basis vectors in order of significance. This the step that you have not done. And, while including this step, may increase the solution time, the results will identify which elements of x are important, according the data contained in the measurements represented by A.
But, I think, with this implementation, the solution is faster than the approach you presented.
Aij = Qij Rij where Qj are orthonormal and Rij is upper triangular, Ri,j>i=0. Qj are the orthogonal basis vectors for A, and Rij is the participation of each Qj to create a column in A. So,
A_j1 = Q_j1 * R_1,1
A_j2 = Q_j1 * R_1,2 + Q_j2 * R_2,2
A_j3 = Q_j1 * R_1,3 + Q_j2 * R_2,3 + Q_j3 * R_3,3
By inspection, you can write
A_j1 = ( A_j1 / | A_j1 | ) * | A_j1 | = Q_j1 * R_1,1
Then you project Q_j1 onto from every other column A to get the R_1,j elements
R_1,2 = Q_j1 dot Aj2
R_1,3 = Q_j1 dot Aj3
...
R_1,j(j>1) = A_j dot Q_j1
Then you subtract the elements of project of Q_j1 from the columns of A (this would set the first column to zero, so you can ignore the first column
for j = 2,n
A_j = A_j - R_1,j * Q_j1
end
Now one column from A has been removed, the first orthonormal basis vector, Q,j1, was determined, and the contribution of the first basis vector to each column, R_1,j has been determined, and the contribution of the first basis vector has been subtracted from each column. Repeat this process for the remaining columns of A to obtain the remaining columns of Q and rows of R.
for i = 1,n
R_ii = |A_i| A_i is the ith column of A, |A_i| is magnitude of A_i
Q_i = A_i / R_ii Q_i is the ith column of Q
for j = i, n
R_ij = | A_j dot Q_i |
A_j = A_j - R_ij * Q_i
end
end
You are trying to weight the rows of A, with w. Here is one approach. I would normalize w, and incorporate the effect into R. You "removed" the effects of w by multiply and dividing by w. An alternative to "removing" the effect is to normalize the amplitude of w to one.
w = w / | w |
for i = 1,n
R_ii = |A_i inner product w| # A_i inner product w = A_i .* w
Q_i = A_i / R_ii
for j = i, n
R_ij = | (A_i inner product w) dot Q_i | # A dot B = A' * B
A_j = A_j - R_ij * Q_i
end
end
Another approach to implementing w is to normalize w and then premultiply every column of A by w. That cleanly weights the rows of A, and reduces the number of multiplications.
Using the following may help in speeding up your code
A inner product B = A .* B
A dot w = A' w
(A B)' = B'A'
A' conj(A) = |A|^2
The above can be vectorized easily in matlab, pretty much as written.
But, you are missing the second portion of ranking of A, which tells you which states (elements of x in A x = y) are significantly represented in the data
The ranking procedure is easy to describe, but I'll let you work out the programming details. The above procedure essentially assumes the columns of A are in order of significance, and the first column is subtracted off all the remaining columns, then the 2nd column is subtracted off the remaining columns, etc. The first row of R represents the contribution of the first column of Q to each column of A. If you sum the absolute value of the first row of R contributions, you get a measurement of the contribution of the first column of Q to the matrix A. So, you just evaluate each column of A as the first (or next) column of Q, and determine the ranking score of the contribution of that Q column to the remaining columns of A. Then select the A column that has the highest rank as the next Q column. Coding this essentially comes down to pre estimating the next row of R, for every remaining column in A, in order to determine which ranked R magnitude has the largest amplitude. Having a index vector that represents the original column order of A will be beneficial. By ranking the basis vectors, you end up with the "principal" basis vectors that represent A, which is typically much smaller in number than the number of columns in A.
Also, if you rank the columns, it is not necessary to calculate every column of R. When you know which columns of A don't contain any useful information, there's no real benefit to keeping those columns.
In structural dynamics, one approach to reducing the number of degrees of freedom is to calculate the eigenvalues, assuming you have representative values for the mass and stiffness matrix. If you think about it, the above approach can be used to "calculate" the M and K (and C) matrices from measured response, and also identify the "measurement response shapes" that are significantly represented in the data. These are diffenrent, and potentially more important, than the mode shapes. So, you can solve very difficult problems, i.e., estimation of state matrices and number of degrees of freedom represented, from measured response, by the above approach. If you read up on N4SID, he did something similar, except he used SVD instead of GSO. I don't like the technical description for N4SID, too much focus on vector projection notation, which is simply a dot product.
There may be one or two errors in the above information, I'm writing this off the top of my head, before rushing off to work. So, check the algorithm / equations as you implement... Good Luck
Coming back to your question, of how to optimize the algorithm when you weight with w. Here is a basic GSO algorithm, without the sorting, written compatible with your function.
Note, the code below is in octave, not matlab. There are some minor differences.
function [Q, R] = Gram_Schmidt_2(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
# Outer loop identifies the basis vectors
for j = 1:n
aCol = A(:,j);
# Subtract off the basis vector
for i = 1:(j-1)
R(i,j) = ctranspose(Q(:,j)) * aCol;
aCol = aCol - R(i,j) * Q(:,j);
end
amp_A_col = norm(aCol);
R(j,j) = amp_A_col;
Q(:,j) = aCol / amp_A_col;
end
end
To get your algorithm, only change one line. But, you lose a lot of speed because "ctranspose(Q(:,j)) * aCol" is a vector operation but "sum( aCol .* conj( Q(:,i) ) .* w )" is a row operation.
function [Q, R] = Gram_Schmidt_2(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
# Outer loop identifies the basis vectors
for j = 1:n
aCol = A(:,j);
# Subtract off the basis vector
for i = 1:(j-1)
# R(i,j) = ctranspose(Q(:,j)) * aCol;
R(i,j) = sum( aCol .* conj( Q(:,i) ) .* w ) / ...
sum( Q(:,i) .* conj( Q(:,i) ) .* w );
aCol = aCol - R(i,j) * Q(:,j);
end
amp_A_col = norm(aCol);
R(j,j) = amp_A_col;
Q(:,j) = aCol / amp_A_col;
end
end
You can change it back to a vector operation by weighting aCol and Q by the sqrt of w.
function [Q, R] = Gram_Schmidt_3(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
Q_sw = complex(zeros(m, n));
sw = w .^ 0.5;
for j = 1:n
aCol = A(:,j);
aCol_sw = aCol .* sw;
# Subtract off the basis vector
for i = 1:(j-1)
# R(i,j) = ctranspose(Q(:,i)) * aCol;
numTerm = ctranspose( Q_sw(:,i) ) * aCol_sw;
denTerm = ctranspose( Q_sw(:,i) ) * Q_sw(:,i);
R(i,j) = numTerm / denTerm;
aCol_sw = aCol_sw - R(i,j) * Q_sw(:,i);
end
aCol = aCol_sw ./ sw;
amp_A_col = norm(aCol);
R(j,j) = amp_A_col;
Q(:,j) = aCol / amp_A_col;
Q_sw(:,j) = Q(:,j) .* sw;
end
end
As pointed out by JacobD, the above function does not run faster. Possibly it takes time to create the additional arrays. Another grouping strategy for the triple product is to group w with conj(Q). Hope this is faster...
function [Q, R] = Gram_Schmidt_4(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
for j = 1:n
aCol = A(:,j);
for i = 1:(j-1)
cqw = conj(Q(:,i)) .* w;
R(i,j) = ( transpose( aCol ) * cqw) ...
/ (transpose( Q(:,i) ) * cqw);
aCol = aCol - R(i,j) * Q(:,i);
end
amp_A_col = norm(aCol);
R(j,j) = amp_A_col;
Q(:,j) = aCol / amp_A_col;
end
end
Here's a driver function to time different versions.
function Gram_Schmidt_tester_2
nSamples = 360000;
nMeas = 100;
nMeas = 15;
A = complex( rand(nSamples,nMeas), rand(nSamples,nMeas));
w = rand(nSamples, 1);
profile on;
[Q1, R1] = Gram_Schmidt_basic(A);
profile off;
data1 = profile ("info");
tData1=data1.FunctionTable(1).TotalTime;
approx_zero1 = A - Q1 * R1;
max_value1 = max(max(abs(approx_zero1)));
profile on;
[Q2, R2] = Gram_Schmidt_w_Orig(A, w);
profile off;
data2 = profile ("info");
tData2=data2.FunctionTable(1).TotalTime;
approx_zero2 = A - Q2 * R2;
max_value2 = max(max(abs(approx_zero2)));
sw=w.^0.5;
profile on;
[Q3, R3] = Gram_Schmidt_sqrt_w(A, w);
profile off;
data3 = profile ("info");
tData3=data3.FunctionTable(1).TotalTime;
approx_zero3 = A - Q3 * R3;
max_value3 = max(max(abs(approx_zero3)));
profile on;
[Q4, R4] = Gram_Schmidt_4(A, w);
profile off;
data4 = profile ("info");
tData4=data4.FunctionTable(1).TotalTime;
approx_zero4 = A - Q4 * R4;
max_value4 = max(max(abs(approx_zero4)));
profile on;
[Q5, R5] = Gram_Schmidt_5(A, w);
profile off;
data5 = profile ("info");
tData5=data5.FunctionTable(1).TotalTime;
approx_zero5 = A - Q5 * R5;
max_value5 = max(max(abs(approx_zero5)));
profile on;
[Q2a, R2a] = Gram_Schmidt2a(A, w);
profile off;
data2a = profile ("info");
tData2a=data2a.FunctionTable(1).TotalTime;
approx_zero2a = A - Q2a * R2a;
max_value2a = max(max(abs(approx_zero2a)));
profshow (data1, 6);
profshow (data2, 6);
profshow (data3, 6);
profshow (data4, 6);
profshow (data5, 6);
profshow (data2a, 6);
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data1.FunctionTable(1).FunctionName,
data1.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value1)
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data2.FunctionTable(1).FunctionName,
data2.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value2)
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data3.FunctionTable(1).FunctionName,
data3.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value3)
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data4.FunctionTable(1).FunctionName,
data4.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value4)
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data5.FunctionTable(1).FunctionName,
data5.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value5)
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data2a.FunctionTable(1).FunctionName,
data2a.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value2a)
end
On my old home laptop, in Octave, the results are
ans = Time for Gram_Schmidt_basic is 0.889 sec with 360000 samples and 15 meas, max value is 1.57009e-16
ans = Time for Gram_Schmidt_w_Orig is 0.952 sec with 360000 samples and 15 meas, max value is 6.36717e-16
ans = Time for Gram_Schmidt_sqrt_w is 0.390 sec with 360000 samples and 15 meas, max value is 6.47366e-16
ans = Time for Gram_Schmidt_4 is 0.452 sec with 360000 samples and 15 meas, max value is 6.47366e-16
ans = Time for Gram_Schmidt_5 is 2.636 sec with 360000 samples and 15 meas, max value is 6.47366e-16
ans = Time for Gram_Schmidt2a is 0.905 sec with 360000 samples and 15 meas, max value is 6.68443e-16
These results indicate the fastest algorithm is the sqrt_w algorithm above at 0.39 sec, followed by the grouping of conj(Q) with w (above) at 0.452 sec, then version 2 of Amro solution at 0.905 sec, then the original algorithm in the question at 0.952, then a version 5 which interchanges rows / columns to see if row storage presented (code not included) at 2.636 sec. These results indicate the sqrt(w) split between A and Q is the fastest solution. But these results are not consistent with JacobD's comment about sqrt(w) not being faster.

It is possible to vectorize this so only one loop is necessary. The important fundamental change from the original algorithm is that if you swap the inner and outer loops you can vectorize the projection of the reference vector to all remaining vectors. Working off #Amro's solution, I found that an inner loop is actually faster than the matrix subtraction. I do not understand why this would be. Timing this against #Amro's solution, it is about 45% faster.
function [Q, R] = Gram_Schmidt5(A, w)
Q = A;
n_dimensions = size(A, 2);
R = zeros(n_dimensions);
R(1, 1) = norm(Q(:, 1));
Q(:, 1) = Q(:, 1) ./ R(1, 1);
for i = 2 : n_dimensions
Qw = (Q(:, i - 1) .* w)' * Q(:, (i - 1) : end);
R(i - 1, i : end) = Qw(2:end) / Qw(1);
%% Surprisingly this loop beats the matrix multiply
for j = i : n_dimensions
Q(:, j) = Q(:, j) - Q(:, i - 1) * R(i - 1, j);
end
%% This multiply is slower than above
% Q(:, i : end) = ...
% Q(:, i : end) - ...
% Q(:, i - 1) * R(i - 1, i : end);
R(i, i) = norm(Q(:,i));
Q(:, i) = Q(:, i) ./ R(i, i);
end

Related

Tricks to improve the performance of a cunstom function in Julia

I am replicating using Julia a sequence of steps originally made in Matlab. In Octave, this procedure takes 1.4582 seconds and in Julia (using Jupyter) it takes approximately 10 seconds. I'll try to be brief in the scripts. My goal is to achieve or improve Octave's performance. First of all, I will describe my variables and some function:
zgrid (double 1x7 size)
kgrid (double 500x1 size)
V0 (double 500x7 size)
P (double 7x7 size) a transition matrix
delta and beta are fixed parameters.
F(z,k) and u(c) are particular functions and are specified in the Julia script.
% Octave script
% V0 is given
[K, Z, K2] = meshgrid(kgrid, zgrid, kgrid);
K = permute(K, [2, 1, 3]);
Z = permute(Z, [2, 1, 3]);
K2 = permute(K2, [2, 1, 3]);
C = max(f(Z,K) + (1-delta)*K - K2,0);
U = u(C);
EV = V0*P';% EV is a 500x7 matrix size
EV = permute(repmat(EV, 1, 1, 500), [3, 2, 1]);
H = U + beta*EV;
[TV, index] = max(H, [], 3);
In Julia, I created a function that replicates this procedure. I used loops, but it has a performance 9 times longer.
% Julia script
% V0 is the input of my T operator function
V0 = repeat(sqrt.(kgrid), outer = [1,7]);
F = (z,k) -> exp(z)*(k^α);
u = (c) -> (c^(1-μ) - 1)/(1-μ)
% parameters
α = 1/3
β = 0.987
δ = 0.012;
μ = 2
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500);
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]
function T(V)
E=V*P'
T1 = zeros(Float64, 500, 7 )
aux = zeros(Float64, 500)
for i = 1:7
for j = 1:500
for l = 1:500
c= maximum( (F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j] - kgrid[l],0))
aux[l] = u(c) + β*E[l,i]
end
T1[j,i] = maximum(aux)
end
end
return T1
end
I would very much like to improve my performance in Julia. I believe there is a way to improve, but I am new in Julia programming.
This code runs for me in 5ms. Note that I have made F and u into proper (not anonymous) functions, F_ and u_, but you could get a similar effect by making the anonymous functions const.
Your main problem is that you have a lot of non-const global variables, and also that your main function is doing unnecessary work multiple times, and creating an unnecessary array, aux.
The performance tips section in the manual is essential reading: https://docs.julialang.org/en/v1/manual/performance-tips/
F_(z,k) = exp(z) * (k^(1/3)); # you can still use α, but it must be const
u_(c) = (c^(1-2) - 1)/(1-2)
function T_(V, P, kgrid, zgrid, β, δ)
E = V * P'
T1 = similar(V)
for i in axes(T1, 2)
for j in axes(T1, 1)
temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
aux = -Inf
for l in eachindex(kgrid)
c = max(0.0, temp - kgrid[l])
aux = max(aux, u_(c) + β * E[l, i])
end
T1[j,i] = aux
end
end
return T1
end
Benchmark:
V0 = repeat(sqrt.(kgrid), outer = [1,7]);
zgrid = sort!(rand(1, 7); dims=2)
kgrid = sort!(rand(500, 1); dims=1)
P = rand(length(zgrid), length(zgrid))
#btime T_($V0, $P, $kgrid, $zgrid, $β, $δ);
# output: 5.126 ms (4 allocations: 54.91 KiB)
The following should perform much better. The most noticeable differences are that it calculates F 500x less, and doesn't rely on global variables.
function T(V,kgrid,zgrid,β,δ)
E=V*P'
T1 = zeros(Float64, 500, 7)
for j = 1:500
for i = 1:7
x = F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j]
T1[j,i] = maximum(u(max(x - kgrid[l], 0)) + β*E[l,i] for l in 1:500)
end
end
return T1
end

Finding the continued fraction of 2^(1/3) to very high precision

Here I'll use the notation
It is possible to find the continued fraction of a number by computing it then applying the definition, but that requires at least O(n) bits of memory to find a0, a1 ... an, in practice it is a much worse. Using double floating point precision it is only possible to find a0, a1 ... a19.
An alternative is to use the fact that if a,b,c are rational numbers then there exist unique rationals p,q,r such that 1/(a+b*21/3+c*22/3) = x+y*21/3+z*22/3, namely
So if I represent x,y, and z to absolute precision using the boost rational lib I can obtain floor(x + y*21/3+z*22/3) accurately only using double precision for 21/3 and 22/3 because I only need it to be within 1/2 of the true value. Unfortunately the numerators and denominators of x,y, and z grow considerably fast, and if you use regular floats instead the errors pile up quickly.
This way I was able to compute a0, a1 ... a10000 in under an hour, but somehow mathematica can do that in 2 seconds. Here's my code for reference
#include <iostream>
#include <boost/multiprecision/cpp_int.hpp>
namespace mp = boost::multiprecision;
int main()
{
const double t_1 = 1.259921049894873164767210607278228350570251;
const double t_2 = 1.587401051968199474751705639272308260391493;
mp::cpp_rational p = 0;
mp::cpp_rational q = 1;
mp::cpp_rational r = 0;
for(unsigned int i = 1; i != 10001; ++i) {
double p_f = static_cast<double>(p);
double q_f = static_cast<double>(q);
double r_f = static_cast<double>(r);
uint64_t floor = p_f + t_1 * q_f + t_2 * r_f;
std::cout << floor << ", ";
p -= floor;
//std::cout << floor << " " << p << " " << q << " " << r << std::endl;
mp::cpp_rational den = (p * p * p + 2 * q * q * q +
4 * r * r * r - 6 * p * q * r);
mp::cpp_rational a = (p * p - 2 * q * r) / den;
mp::cpp_rational b = (2 * r * r - p * q) / den;
mp::cpp_rational c = (q * q - p * r) / den;
p = a;
q = b;
r = c;
}
return 0;
}
The Lagrange algorithm
The algorithm is described for example in Knuth's book The Art of Computer Programming, vol 2 (Ex 13 in section 4.5.3 Analysis of Euclid's Algorithm, p. 375 in 3rd edition).
Let f be a polynomial of integer coefficients whose only real root is an irrational number x0 > 1. Then the Lagrange algorithm calculates the consecutive quotients of the continued fraction of x0.
I implemented it in python
def cf(a, N=10):
"""
a : list - coefficients of the polynomial,
i.e. f(x) = a[0] + a[1]*x + ... + a[n]*x^n
N : number of quotients to output
"""
# Degree of the polynomial
n = len(a) - 1
# List of consecutive quotients
ans = []
def shift_poly():
"""
Replaces plynomial f(x) with f(x+1) (shifts its graph to the left).
"""
for k in range(n):
for j in range(n - 1, k - 1, -1):
a[j] += a[j+1]
for _ in range(N):
quotient = 1
shift_poly()
# While the root is >1 shift it left
while sum(a) < 0:
quotient += 1
shift_poly()
# Otherwise, we have the next quotient
ans.append(quotient)
# Replace polynomial f(x) with -x^n * f(1/x)
a.reverse()
a = [-x for x in a]
return ans
It takes about 1s on my computer to run cf([-2, 0, 0, 1], 10000). (The coefficients correspond to the polynomial x^3 - 2 whose only real root is 2^(1/3).) The output agrees with the one from Wolfram Alpha.
Caveat
The coefficients of the polynomials evaluated inside the function quickly become quite large integers. So this approach needs some bigint implementation in other languages (Pure python3 deals with it, but for example numpy doesn't.)
You might have more luck computing 2^(1/3) to high accuracy and then trying to derive the continued fraction from that, using interval arithmetic to determine if the accuracy is sufficient.
Here's my stab at this in Python, using Halley iteration to compute 2^(1/3) in fixed point. The dead code is an attempt to compute fixed-point reciprocals more efficiently than Python via Newton iteration -- no dice.
Timing from my machine is about thirty seconds, spent mostly trying to extract the continued fraction from the fixed point representation.
prec = 40000
a = 1 << (3 * prec + 1)
two_a = a << 1
x = 5 << (prec - 2)
while True:
x_cubed = x * x * x
two_x_cubed = x_cubed << 1
x_prime = x * (x_cubed + two_a) // (two_x_cubed + a)
if -1 <= x_prime - x <= 1: break
x = x_prime
cf = []
four_to_the_prec = 1 << (2 * prec)
for i in range(10000):
q = x >> prec
r = x - (q << prec)
cf.append(q)
if True:
x = four_to_the_prec // r
else:
x = 1 << (2 * prec - r.bit_length())
while True:
delta_x = (x * ((four_to_the_prec - r * x) >> prec)) >> prec
if not delta_x: break
x += delta_x
print(cf)

How to keep the distance between $n$ particles within a certain range?

I am working on a problem in Molecular Dynamics and need to randomly generate a position array for np particles within a box of size [-L,L] x [-L,L]. In fact, I need to generate the x-array for the x-coordinates with x(1) = 0 and the y-array for the y-coordinates with y(1)=y(2) =0. I need the particles to be such that the distances between neighboring particles are within some range (e.g: 0.9 <= r <= 1.1) like in the following picture:
However in my code I get something like this:
See how the red lines are larger than what I want.
My code is
REAL, DIMENSION(np) :: x, y
REAL :: w1, w2, minv, maxv, xij, yij, rij
INTEGER :: i, j
!Generating random coordinates for the particles
x(1) = 0.0d0
y(1) = 0.0d0
y(2) = 0.0d0
!-------------------------------------------------------------------------
! translation and rotaion of the whole system were froze (saving 4 degrees of
! freedome)
! x(1) = 0.0d0; y(1) = 0.0d0 fix one particle in the origin
! y(2) = 0.0d0 fix the second particle on the x-axis
!-------------------------------------------------------------------------
rmatrix = 100.0
minv = 0.0
maxv = 10
iter0 = 0
101 DO WHILE(maxv >= 1.1 .OR. minv <= 0.9)
iter0 = iter0 + 1
PRINT *, iter0
CALL init_random_seed()
DO i = 2, np
CALL RANDOM_NUMBER(w1)
x(i) = 10 * w1 - 5
END DO
DO i = 3, np
CALL RANDOM_NUMBER(w2)
y(i) = 10 * w2 - 5
END DO
! rmatrix contains the distances between all particles
DO i = 1, np
DO j = 1, np
IF(j .NE. i) THEN
xij = x(i) - x(j)
yij = y(i) - y(j)
rij = SQRT(xij * xij + yij * yij)
rmatrix(i,j) = rij
END IF
END DO
END DO
minv = MINVAL(rmatrix) ! This is the minimum distance between any two
! particles ( distance cannot be smaller)
! which is the left endpoint of the range interval
DO i = 1, np ! Here is my attempt to control the righ endpoint of
DO j = 1, np ! the range interval. ( This needs to be edited)
IF(j .NE. i) THEN
maxv = MIN(maxv, rmatrix(i,j))
END IF
END DO
IF(maxv >= 1.1) THEN
GOTO 101
END IF
END DO
END DO
CONTANIS
SUBROUTINE init_random_seed()
INTEGER :: i, n, clock
INTEGER, DIMENSION(:), ALLOCATABLE :: seed
CALL RANDOM_SEED(size = n)
ALLOCATE(seed(n))
CALL SYSTEM_CLOCK(COUNT=clock)
seed = clock + 37 * (/ (i - 1, i = 1, n) /)
CALL RANDOM_SEED(PUT = seed)
END SUBROUTINE init_random_seed

Vectorizing nested loops in matlab using bsxfun and with GPU

For loops seem to be extremely slow, so I was wondering if the nested loops in the code shown next could be vectorized using bsxfun and maybe GPU could be introduced too.
Code
%// Paramaters
i = 1;
j = 3;
n1 = 1500;
n2 = 1500;
%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;
%// Nested Loops - I
for x = 1:n1
for y = 1:n1
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
LInc(x, y) = L1(x, y) + (num/denom);
LInc(y, x) = LInc(x, y);
end
end
%// Nested Loops - II
for x = 1:n1
for y = 1:n2
num = (n1 * n * L1(x,i)) + (n2 * n * L2(y,j)) - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1)));
LInc(x, n1+y) = num/denom;
LInc(n1+y, x) = LInc(x, n1+y);
end
end
Edit 1: n and denom could be assumed as constants too.
Here are vectorized CPU and GPU codes and I am hoping that I am using at least good practices for the GPU code and the benchmarking later on.
CPU Code
%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;
%// Calculate num/denom value for stage 1 and 2
nd1 = L1 + (((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - n2*n*bsxfun(#plus,L1(:,i),L1(:,i).'))./denom; %//'
nd2 = (bsxfun(#plus,n1*n*L1(:,i),n2*n*L2(:,j).') - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1))))./denom; %//'
%// Plug in the values in the output matrix
LInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
LInc(n1+1:end,1:n1) = nd2.'; %//'
LInc(1:n1,n1+1:end) = nd2;
GPU Code
%// Pre-allocate for output
gLInc = zeros(n1+n2,n1+n2,'gpuArray');
%// Convert to gpu arrays
gL1 = gpuArray(L1);
gL2 = gpuArray(L2);
%// Calculate num/denom value for stage 1 and 2
nd1 = gL1 + (((n2 ^ 2) * (gL1(i, i) + gL2(j, j) + 1)) - n2*n*bsxfun(#plus,gL1(:,i),gL1(:,i).'))./denom; %//'
nd2 = (bsxfun(#plus,n1*n*gL1(:,i),n2*n*gL2(:,j).') - ((n1 * n2 * (gL1(i, i) + gL2(j, j) + 1))))./denom; %//'
%// Plug in the values in the output matrix
gLInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
gLInc(n1+1:end,1:n1) = nd2.'; %//'
gLInc(1:n1,n1+1:end) = nd2;
%// Gather data from GPU back to CPU
LInc = gather(gLInc);
Benchmarking
GPU benchmarking tips were taken from Measure and Improve GPU Performance.
%// Warm up GPU call with insignificant small scalar inputs, just in case
%// gputimeit doesn't do the same
temp1 = modp2(1,1,1,1,1,1,1,1); %// This is vectorized GPU code
i = 1;
j = 3;
n = 1000; %// Assumed
denom = 1e6; %// Assumed
N_arr = [50 100 200 500 1000 1500]; %// array elements for N (datasize)
timeall = zeros(3,numel(N_arr));
for k1 = 1:numel(N_arr)
N = N_arr(k1);
n1 = N; %// n1, n2 are assumed identical for less-complicated benchmarking
n2 = N;
L1 = rand(n1,n1);
L2 = rand(n2,j);
f = #() modp0(i,j,n1,n2,L1,L2,n,denom);%// Original CPU w/ preallocation
timeall(1,k1) = timeit(f);
clear f
f = #() modp1(i,j,n1,n2,L1,L2,n,denom);%// Vectorzied CPU code
timeall(2,k1) = timeit(f);
clear f
f = #() modp2(i,j,n1,n2,L1,L2,n,denom);%// Vectorized GPU(GTX 750Ti) code
timeall(3,k1) = gputimeit(f);
clear f
end
%// Display benchmark results
figure,hold on, grid on
plot(N_arr,timeall(1,:),'-b.')
plot(N_arr,timeall(2,:),'-ro')
plot(N_arr,timeall(3,:),'-kx')
legend('Original CPU','Vectorized CPU','Vectorized GPU (GTX 750 Ti)')
xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')
Results
Conclusions
Results show that the vectorized GPU code performs really well with higher datasize and goes from slower than both the vectorized CPU and original code to being twice as fast as the vectorized CPU code.
If you have not done so, you should preallocate LInc.
LInc = zeros(n1,n2);
If you want to vectorize it, you don't need to use bsxfun to vectorize your code. I think you can do something like
x = 1:n1;
y = 1:n1;
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
LInc(x, y) = L1(x, y) + (num/denom);
However, this code is confusing to me because as it is, you are overwriting the value of LInc several times. Without knowing what your goal is its hard for me to help more. The above code probably will not return the same values as your function.

Facebook Hacker Cup: Power Overwhelming

A lot of people at Facebook like to play Starcraft II™. Some of them have made a custom game using the Starcraft II™ map editor. In this game, you play as the noble Protoss defending your adopted homeworld of Shakuras from a massive Zerg army. You must do as much damage to the Zerg as possible before getting overwhelmed. You can only build two types of units, shield generators and warriors. Shield generators do no damage, but your army survives for one second per shield generator that you build. Warriors do one damage every second. Your army is instantly overrun after your shield generators expire. How many shield generators and how many warriors should you build to inflict the maximum amount of damage on the Zerg before your army is overrun? Because the Protoss value bravery, if there is more than one solution you should return the one that uses the most warriors.
Constraints
1 ≤ G (cost for one shield generator) ≤ 100
1 ≤ W (cost for one warrior) ≤ 100
G + W ≤ M (available funds) ≤ 1000000000000 (1012)
Here's a solution whose complexity is O(W). Let g be the number of generators we build, and similarly let w be the number of warriors we build (and G, W be the corresponding prices per unit).
We note that we want to maximize w*g subject to w*W + g*G <= M.
First, we'll get rid of one of the variables. Note that if we choose a value for g, then obviously we should buy as many warriors as possible with the remaining amount of money M - g*G. In other words, w = floor((M-g*G)/W).
Now, the problem is to maximize g*floor((M-g*G)/W) subject to 0 <= g <= floor(M/G). We want to get rid of the floor, so let's consider W distinct cases. Let's write g = W*k + r, where 0 <= r < W is the remainder when dividing g by W.
The idea is now to fix r, and insert the expression for g and then let k be the variable in the equation. We'll get the following quadratic equation in k:
Let p = floor((M - r*G)/W), then the equation is (-GW) * k^2 + (Wp - rG)k + rp.
This is a quadratic equation which goes to negative infinity when x goes to infinity or negative infinity so it has a global maximum at k = -B/(2A). To find the maximum value for legal values of k, we'll try the minimum legal value of k, the maximum legal value of k and the two nearest integer points of the real maximum if they are within the legal range.
The overall maximum for all values of r is the one we are seeking. Since there are W values for r, and it takes O(1) to compute the maximum for a fixed value, the overall time is O(W).
If you build g generators, and w warriors, you can do a total damage of
w (damage per time) × g (time until game-over).
The funds constraint restricts the value of g and w to W × w + G × g &leq; M.
If you build g generators, you can build at most (M - g × G)/W warriors, and do g × (M - g × G)/W damage.
This function has a maximum at g = M / (2 G), which results in M2 / (4 G W) damage.
Summary:
Build M / (2 G) shield generators.
Build M / (2 G) warriors.
Do M2 / (4 G W) damage.
Since you can only build integer amounts of any of the two units, this reduces to the optimization problem:
maximize g × w
with respect to g × G + w × W &leq; M and g, w &in; ℤ+
The general problem of Integer Programming is NP-complete, so the best algorithm for this is to check all integer values close to the real-valued solution above.
If you find some pair (gi, wi), with total damage di, you only have to check values where gj × wj &geq; di. This and the original condition W × w + G × g &leq; M constrains the search-space with each item found.
F#-code:
let findBestSetup (G : int) (W : int) (M : int) =
let mutable bestG = int (float M / (2.0 * float G))
let mutable bestW = int (float M / (2.0 * float W))
let mutable bestScore = bestG * bestW
let maxW = (M + isqrt (M*M - 4 * bestScore * G * W)) / (2*G)
let minW = (M - isqrt (M*M - 4 * bestScore * G * W)) / (2*G)
for w = minW to maxW do
// ceiling of (bestScore / w)
let minG = (bestScore + w - 1) / w
let maxG = (M - W*w)/G
for g = minG to maxG do
let score = g * w
if score > bestScore || score = bestScore && w > bestW then
bestG <- g
bestW <- w
bestScore <- score
bestG, bestW, bestScore
This assumed W and G were the counts and the cost of each was equal to 1. So it's obsolete with the updated question.
Damage = LifeTime*DamagePerSecond = W * G
So you need to maximize W*G with the constraint G+W <= M. Since both Generators and Warriors are always good we can use G+W = M.
Thus the function we want to maximize becomes W*(M-W).
Now we set the derivative = 0:
M-2W=0
W = M/2
But since we need the solution to the discrete case(You can't have x.5 warriors and x.5 generators) we use the values closest to the continuous solution(this is optimal due to the properties of a parabel).
If M is even than the continuous solution is identical to the discrete solution. If M is odd then we have two closest solutions, one with one warrior more than generators, and one the other way round. And the OP said we should choose more warriors.
So the final solution is:
G = W = M/2 for even M
and G+1 = W = (M+1)/2 for odd M.
g = total generators
gc = generator cost
w = warriors
wc = warrior cost
m = money
d = total damage
g = (m - (w*wc))/gc
w = (m - (g*gc))/wc
d = g * w
d = ((m - (w*wc))/gc) * ((m - (g*gc))/wc)
d = ((m - (w*wc))/gc) * ((m - (((m - (w*wc))/gc)*gc))/wc) damage as a function of warriors
I then tried to compute an array of all damages then find max but of course it'd not complete in 6 mins with m in the trillions.
To find the max you'd have to differentiate that equation and find when it equals zero, which I forgotten how to do seing I haven't done math in about 6 years
Not a really a solution but here goes.
The assumption is that you already get a high value of damage when the number of shields equals 1 (cannot equal zero or no damage will be done) and the number of warriors equals (m-g)/w. Iterating up should (again an assumption) reach the point of compromise between the number of shields and warriors where damage is maximized. This is handled by the bestDamage > calc branch.
There is almost likely a flaw in this reasoning and it'd be preferable to understand the maths behind the problem. As I haven't practised mathematics for a while I'll just guess that this requires deriving a function.
long bestDamage = 0;
long numShields = 0;
long numWarriors = 0;
for( int k = 1;; k++ ){
// Should move declaration outside of loop
long calc = m / ( k * g ); // k = number of shields
if( bestDamage < calc ) {
bestDamage = calc;
}
if( bestDamage > calc ) {
numShields = k;
numWarriors = (m - (numShields*g))/w;
break;
}
}
System.out.println( "numShields:" + numShields );
System.out.println( "numWarriors:" + numWarriors );
System.out.println( bestDamage );
Since I solved this last night, I thought I'd post my C++ solution. The algorithm starts with an initial guess, located at the global maximum of the continuous case. Then it searches 'little' to the left/right of the initial guess, terminating early when continuous case dips below an already established maximum. Interestingly, the 5 example answers posted by the FB contained 3 wrong answers:
Case #1
ours: 21964379805 dmg: 723650970382348706550
theirs: 21964393379 dmg: 723650970382072360271 Wrong
Case #2
ours: 1652611083 dmg: 6790901372732348715
theirs: 1652611083 dmg: 6790901372732348715
Case #3
ours: 12472139015 dmg: 60666158566094902765
theirs: 12472102915 dmg: 60666158565585381950 Wrong
Case #4
ours: 6386438607 dmg: 10998633262062635721
theirs: 6386403897 dmg: 10998633261737360511 Wrong
Case #5
ours: 1991050385 dmg: 15857126540443542515
theirs: 1991050385 dmg: 15857126540443542515
Finally the code (it uses libgmpxx for large numbers). I doubt the code is optimal, but it does complete in 0.280ms on my personal computer for the example input given by FB....
#include <iostream>
#include <gmpxx.h>
using namespace std;
typedef mpz_class Integer;
typedef mpf_class Real;
static Integer getDamage( Integer g, Integer G, Integer W, Integer M)
{
Integer w = (M - g * G) / W;
return g * w;
}
static Integer optimize( Integer G, Integer W, Integer M)
{
Integer initialNg = M / ( 2 * G);
Integer bestNg = initialNg;
Integer bestDamage = getDamage ( initialNg, G, W, M);
// search left
for( Integer gg = initialNg - 1 ; ; gg -- ) {
Real bestTheoreticalDamage = gg * (M - gg * G) / (Real(W));
if( bestTheoreticalDamage < bestDamage) break;
Integer dd = getDamage ( gg, G, W, M);
if( dd >= bestDamage) {
bestDamage = dd;
bestNg = gg;
}
}
// search right
for( Integer gg = initialNg + 1 ; ; gg ++ ) {
Real bestTheoreticalDamage = gg * (M - gg * G) / (Real(W));
if( bestTheoreticalDamage < bestDamage) break;
Integer dd = getDamage ( gg, G, W, M);
if( dd > bestDamage) {
bestDamage = dd;
bestNg = gg;
}
}
return bestNg;
}
int main( int, char **)
{
Integer N;
cin >> N;
for( int i = 0 ; i < N ; i ++ ) {
cout << "Case #" << i << "\n";
Integer G, W, M, FB;
cin >> G >> W >> M >> FB;
Integer g = optimize( G, W, M);
Integer ourDamage = getDamage( g, G, W, M);
Integer fbDamage = getDamage( FB, G, W, M);
cout << " ours: " << g << " dmg: " << ourDamage << "\n"
<< " theirs: " << FB << " dmg: " << fbDamage << " "
<< (ourDamage > fbDamage ? "Wrong" : "") << "\n";
}
}

Resources