Loop optimisation - performance

I am trying to understand what cache or other optimizations could be done in the source code to get this loop faster. I think it is quite cache friendly but, are there any experts out there that could squeeze a bit more performance tuning this code?
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
SIDEBACK = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1) + &
STEN(I ,J-1,K-1) + STEN(I ,J,K-1) + STEN(I ,J+1,K-1) + &
STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)
SIDEOWN = STEN(I-1,J-1,K) + STEN(I-1,J,K) + STEN(I-1,J+1,K) + &
STEN(I ,J-1,K) + STEN(I ,J,K) + STEN(I ,J+1,K) + &
STEN(I+1,J-1,K) + STEN(I+1,J,K) + STEN(I+1,J+1,K)
SIDEFRONT = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1) + &
STEN(I ,J-1,K+1) + STEN(I ,J,K+1) + STEN(I ,J+1,K+1) + &
STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)
RES(I,J,K) = ( SIDEBACK + SIDEOWN + SIDEFRONT ) / 27.0
END DO
END DO
END DO

Ok, I think I've tried everything I reasonably could, and my conclusion unfortunately is that there is not too much room for optimizations, unless you are willing to go into parallelization. Let's see why, let's see what you can and can't do.
Compiler optimizations
Compilers nowadays are extremely good at optimizing code, much much more than humans are. Relying on the optimizations done by the compilers also have the added benefit that they don't ruin the readability of your source code. Whatever you do, (when optimizing for speed) always try it with every reasonable combination of compiler flags. You can even go as far as to try multiple compilers. Personally I only used gfortran (included in GCC) (OS is 64-bit Windows), which I trust to have efficient and correct optimization techniques.
-O2 almost always improve the speed drastically, but even -O3 is a safe bet (among others, it includes delicious loop unrolling). For this problem, I also tried -ffast-math and -fexpensive-optimizations, they didn't have any measurable effect, but -march-corei7(cpu architecture-specific tuning, specific to Core i7) had, so I did the measurements with -O3 -march-corei7
So how fast it actually is?
I wrote the following code to test your solution and compiled it with -O3 -march-corei7. Usually it ran under 0.78-0.82 seconds.
program benchmark
implicit none
real :: start, finish
integer :: I, J, K
real :: SIDEBACK, SIDEOWN, SIDEFRONT
integer, parameter :: NX = 600
integer, parameter :: NY = 600
integer, parameter :: NZ = 600
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: STEN
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: RES
call random_number(STEN)
call cpu_time(start)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
SIDEBACK = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1) + &
STEN(I ,J-1,K-1) + STEN(I ,J,K-1) + STEN(I ,J+1,K-1) + &
STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)
SIDEOWN = STEN(I-1,J-1,K) + STEN(I-1,J,K) + STEN(I-1,J+1,K) + &
STEN(I ,J-1,K) + STEN(I ,J,K) + STEN(I ,J+1,K) + &
STEN(I+1,J-1,K) + STEN(I+1,J,K) + STEN(I+1,J+1,K)
SIDEFRONT = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1) + &
STEN(I ,J-1,K+1) + STEN(I ,J,K+1) + STEN(I ,J+1,K+1) + &
STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)
RES(I,J,K) = ( SIDEBACK + SIDEOWN + SIDEFRONT ) / 27.0
END DO
END DO
END DO
call cpu_time(finish)
!Use the calculated value, so the compiler doesn't optimize away everything.
!Print the original value as well, because one can never be too paranoid.
print *, STEN(1,1,1), RES(1,1,1)
print '(f6.3," seconds.")',finish-start
end program
Ok, so this is as far as the compiler can take us. What's next?
Store intermediate results?
As you might suspect from the question mark, this one didn't really work. Sorry. But let's not rush that forward.
As mentioned in the comments, your current code calculates every partial sum multiple times, meaning one iteration's STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1) will be the next iteration's STEN(I,J-1,K-1) + STEN(I,J,K-1) + STEN(I,J+1,K-1), so no need to fetch and calculate again, you can store those partial results.
The problem is, that we cannot store too many partial results. As you said, your code is already quite cache-friendly, every partial sum you store means one less array element you can store in L1 cache. We could store a few values, from the last few iterations of I (values for index I-2, I-3, etc.), but the compiler almost certainly does that already. I have 2 proofs for this suspicion. First, my manual loop unrolling made the program slower, by about 5%
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX, 8
SIDEBACK(0) = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1)
SIDEBACK(1) = STEN(I ,J-1,K-1) + STEN(I ,J,K-1) + STEN(I ,J+1,K-1)
SIDEBACK(2) = STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)
SIDEBACK(3) = STEN(I+2,J-1,K-1) + STEN(I+2,J,K-1) + STEN(I+2,J+1,K-1)
SIDEBACK(4) = STEN(I+3,J-1,K-1) + STEN(I+3,J,K-1) + STEN(I+3,J+1,K-1)
SIDEBACK(5) = STEN(I+4,J-1,K-1) + STEN(I+4,J,K-1) + STEN(I+4,J+1,K-1)
SIDEBACK(6) = STEN(I+5,J-1,K-1) + STEN(I+5,J,K-1) + STEN(I+5,J+1,K-1)
SIDEBACK(7) = STEN(I+6,J-1,K-1) + STEN(I+6,J,K-1) + STEN(I+6,J+1,K-1)
SIDEBACK(8) = STEN(I+7,J-1,K-1) + STEN(I+7,J,K-1) + STEN(I+7,J+1,K-1)
SIDEBACK(9) = STEN(I+8,J-1,K-1) + STEN(I+8,J,K-1) + STEN(I+8,J+1,K-1)
SIDEOWN(0) = STEN(I-1,J-1,K) + STEN(I-1,J,K) + STEN(I-1,J+1,K)
SIDEOWN(1) = STEN(I ,J-1,K) + STEN(I ,J,K) + STEN(I ,J+1,K)
SIDEOWN(2) = STEN(I+1,J-1,K) + STEN(I+1,J,K) + STEN(I+1,J+1,K)
SIDEOWN(3) = STEN(I+2,J-1,K) + STEN(I+2,J,K) + STEN(I+2,J+1,K)
SIDEOWN(4) = STEN(I+3,J-1,K) + STEN(I+3,J,K) + STEN(I+3,J+1,K)
SIDEOWN(5) = STEN(I+4,J-1,K) + STEN(I+4,J,K) + STEN(I+4,J+1,K)
SIDEOWN(6) = STEN(I+5,J-1,K) + STEN(I+5,J,K) + STEN(I+5,J+1,K)
SIDEOWN(7) = STEN(I+6,J-1,K) + STEN(I+6,J,K) + STEN(I+6,J+1,K)
SIDEOWN(8) = STEN(I+7,J-1,K) + STEN(I+7,J,K) + STEN(I+7,J+1,K)
SIDEOWN(9) = STEN(I+8,J-1,K) + STEN(I+8,J,K) + STEN(I+8,J+1,K)
SIDEFRONT(0) = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1)
SIDEFRONT(1) = STEN(I ,J-1,K+1) + STEN(I ,J,K+1) + STEN(I ,J+1,K+1)
SIDEFRONT(2) = STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)
SIDEFRONT(3) = STEN(I+2,J-1,K+1) + STEN(I+2,J,K+1) + STEN(I+2,J+1,K+1)
SIDEFRONT(4) = STEN(I+3,J-1,K+1) + STEN(I+3,J,K+1) + STEN(I+3,J+1,K+1)
SIDEFRONT(5) = STEN(I+4,J-1,K+1) + STEN(I+4,J,K+1) + STEN(I+4,J+1,K+1)
SIDEFRONT(6) = STEN(I+5,J-1,K+1) + STEN(I+5,J,K+1) + STEN(I+5,J+1,K+1)
SIDEFRONT(7) = STEN(I+6,J-1,K+1) + STEN(I+6,J,K+1) + STEN(I+6,J+1,K+1)
SIDEFRONT(8) = STEN(I+7,J-1,K+1) + STEN(I+7,J,K+1) + STEN(I+7,J+1,K+1)
SIDEFRONT(9) = STEN(I+8,J-1,K+1) + STEN(I+8,J,K+1) + STEN(I+8,J+1,K+1)
RES(I ,J,K) = ( SIDEBACK(0) + SIDEOWN(0) + SIDEFRONT(0) + &
SIDEBACK(1) + SIDEOWN(1) + SIDEFRONT(1) + &
SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) ) / 27.0
RES(I + 1,J,K) = ( SIDEBACK(1) + SIDEOWN(1) + SIDEFRONT(1) + &
SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) + &
SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) ) / 27.0
RES(I + 2,J,K) = ( SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) + &
SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) + &
SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) ) / 27.0
RES(I + 3,J,K) = ( SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) + &
SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) + &
SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) ) / 27.0
RES(I + 4,J,K) = ( SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) + &
SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) + &
SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) ) / 27.0
RES(I + 5,J,K) = ( SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) + &
SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) + &
SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) ) / 27.0
RES(I + 6,J,K) = ( SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) + &
SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) + &
SIDEBACK(8) + SIDEOWN(8) + SIDEFRONT(8) ) / 27.0
RES(I + 7,J,K) = ( SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) + &
SIDEBACK(8) + SIDEOWN(8) + SIDEFRONT(8) + &
SIDEBACK(9) + SIDEOWN(9) + SIDEFRONT(9) ) / 27.0
END DO
END DO
END DO
And what's worse, it's easy to show we are already pretty close the theoretical minimal possible execution time. In order to calculate all these averages, the absolute minimum we need to do, is access every element at least once, and divide them by 27.0. So you can never get faster than the following code, which executes under 0.48-0.5 seconds on my machine.
program benchmark
implicit none
real :: start, finish
integer :: I, J, K
integer, parameter :: NX = 600
integer, parameter :: NY = 600
integer, parameter :: NZ = 600
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: STEN
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: RES
call random_number(STEN)
call cpu_time(start)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
!This of course does not do what you want to do,
!this is just an example of a speed limit we can never surpass.
RES(I, J, K) = STEN(I, J, K) / 27.0
END DO
END DO
END DO
call cpu_time(finish)
!Use the calculated value, so the compiler doesn't optimize away everything.
print *, STEN(1,1,1), RES(1,1,1)
print '(f6.3," seconds.")',finish-start
end program
But hey, even a negative result is a result. If just accessing every element once (and dividing by 27.0) takes up more than half of the execution time, that just means memory access is the bottle neck. Then maybe you can optimize that.
Less data
If you don't need the full precision of 64-bit doubles, you can declare your array with a type of real(kind=4). But maybe your reals are already 4 bytes. In that case, I believe some Fortran implementations support non-standard 16-bit doubles, or depending on your data you can just use integers (maybe floats multiplied by a number then rounded to integer). The smaller your base type is, the more elements you can fit into the cache. The most ideal would be integer(kind=1), of course, it caused more than a 2x speed up on my machine, compared to real(kind=4). But it depends on the precision you need.
Better locality
Column major arrays are slow when you need data from neighbouring column, and row major ones are slow for neighbouring rows.
Fortunately there is a funky way to store data, called a Z-order curve, which does have applications similar to your use case in computer graphics.
I can't promise it will help, maybe it will be terribly counterproductive, but maybe not. Sorry, I didn't feel like implementing it myself, to be honest.
Parallelization
Speaking of computer graphics, this problem is trivially and extremely well parallelizable, maybe even on a GPU, but if you don't want to go that far, you can just use a normal multicore CPU. The Fortran Wiki seems like a good place to search for Fortran parallelization libraries.

Related

Random walk algorithm for pricing barrier options

How to realize the random walk algorithm with MATLAB? I don't understand this algorithm, because the condition 3 never holds. I post my code in the end.
Choose a time-step h>0 so that M=9/h is a integer.
Set log(L_0) = L_0 = 0.13 and Z(0)=0.
rho_k are independent random variables distributed by the law P(+-1)=0.5
1) log(L_{k+1}) = log(L_k)-0.5*sigma^2*h+sigma*sqrt(h)*rho_{k+1}
2) Z_{k+1} = Z_k - sigma* [(erfc((2*2^(1/2)*(log((25*exp(log L_k))/7) + 9/32))/3)/2 - erfc((2*2^(1/2)*(log(100*exp(log L_k)) + 9/32))/3)/2 + erfc((2*2^(1/2)*(log(7/(25*exp(log L_k))) - 9/32))/3)/56 - erfc((2*2^(1/2)*(log(196/(25*exp(log L_k))) - 9/32))/3)/56 + (exp(log L_k)*((2*2^(1/2)*exp(-(8*(log(7/(25*exp(log L_k))) - 9/32)^2)/9))/(3*exp(log L_k)*pi^(1/2)) - (2*2^(1/2)*exp(-(8*(log(196/(25*exp(log L_k))) - 9/32)^2)/9))/(3*exp(log L_k)*pi^(1/2))))/28 - exp(log L_k)*((2*2^(1/2)*exp(-(8*(log((25*exp(log L_k))/7) + 9/32)^2)/9))/(3*exp(log L_k)*pi^(1/2)) - (2*2^(1/2)*exp(-(8*(log(100*exp(log L_k)) + 9/32)^2)/9))/(3*exp(log L_k)*pi^(1/2))) - (14*2^(1/2)*exp(-(8*(log(7/(25*exp(log L_k))) + 9/32)^2)/9))/(75*exp(log L_k)*pi^(1/2)) + (14*2^(1/2)*exp(-(8*(log(196/(25*exp(log L_k))) + 9/32)^2)/9))/(75*exp(log L_k)*pi^(1/2)) + (2^(1/2)*exp(-(8*(log((25*exp(log L_k))/7) - 9/32)^2)/9))/(150*exp(log L_k)*pi^(1/2)) - (2^(1/2)*exp(-(8*(log(100*exp(log L_k)) - 9/32)^2)/9))/(150*exp(log L_k)*pi^(1/2))))]*sqrt(h)*rho_{k+1};
3) log L_k < log(H)+1/2*sigma^2*h-sigma*sqrt(h)
H=0.28
sigma=0.25
K=0.01
To realize the algorithm we follow the random walk generated by (1),
and at each time t_k; we check whether the condition 3 holds. If it does not, L_k has reached the boundary zone and we stop the chain at log(H). If it does,
we perform (1)-(2) to find log(L_k+1) and Z_{k+1}. If k+1=M; we stop, otherwise we continue with the algorithm.
The outcome of simulating each trajectory is a point (t_kappa, log(L_kappa), Z_kappa).
Evaluate the expectation E[(exp(log(L_{kappa}))-K)^+ * Chi(kappa=M) + Z_{kappa}]=price with the Monte Carlo technique and do 10^6 Monte Carlo runs.
Here is my code. It doesn't work, because I don't get something near 0.0657 (the result).
Y = zeros(1,M+1); %ln L_k = Y(k)
Z = zeros(1,M+1);
sigma = 0.25;
H = 0.28;
K = 0.01;
Y(1) = 0.13;
Z(1) = 0;
M = 90;
h = 0.1;
for k = 1:M+1
vec = [-1 1];
index = random('unid', length(vec),1);
x(1,k) = vec(index);
end %'Rho'
for k = 1:M
if Y(k) < log(H) + 1/2*sigma^2*h - sigma*sqrt(h) %Bedingung
Y(k+1) = Y(k) - (1/2)*(sigma)^2*h + sigma*sqrt(h)*x(k+1);
Z(k+1) = Z(k) + (-sigma*(erfc((2*2^(1/2)*(log((25*exp(Y(k)))/7) + 9/32))/3)/2 - erfc((2*2^(1/2)*(log(100*exp(Y(k))) + 9/32))/3)/2 + erfc((2*2^(1/2)*(log(7/(25*exp(Y(k)))) - 9/32))/3)/56 - erfc((2*2^(1/2)*(log(196/(25*exp(Y(k)))) - 9/32))/3)/56 + (exp(Y(k))*((2*2^(1/2)*exp(-(8*(log(7/(25*exp(Y(k)))) - 9/32)^2)/9))/(3*exp(Y(k))*pi^(1/2)) - (2*2^(1/2)*exp(-(8*(log(196/(25*exp(Y(k)))) - 9/32)^2)/9))/(3*exp(Y(k))*pi^(1/2))))/28 - exp(Y(k))*((2*2^(1/2)*exp(-(8*(log((25*exp(Y(k)))/7) + 9/32)^2)/9))/(3*exp(Y(k))*pi^(1/2)) - (2*2^(1/2)*exp(-(8*(log(100*exp(Y(k))) + 9/32)^2)/9))/(3*exp(Y(k))*pi^(1/2))) - (14*2^(1/2)*exp(-(8*(log(7/(25*exp(Y(k)))) + 9/32)^2)/9))/(75*exp(Y(k))*pi^(1/2)) + (14*2^(1/2)*exp(-(8*(log(196/(25*exp(Y(k)))) + 9/32)^2)/9))/(75*exp(Y(k))*pi^(1/2)) + (2^(1/2)*exp(-(8*(log((25*exp(Y(k)))/7) - 9/32)^2)/9))/(150*exp(Y(k))*pi^(1/2)) - (2^(1/2)*exp(-(8*(log(100*exp(Y(k))) - 9/32)^2)/9))/(150*exp(Y(k))*pi^(1/2))))*sqrt(h)*x(k+1);
else
Y(k)=log(H);
break
end
end
if k == M
end
if (L(M)-K) > 0
(L(M)-K);
else
0;
end
return
exp(Y(M))-K+Z(M)

sum of series AP GP clrs appendix A.1-4

I am trying to prove an equation given in the CLRS exercise book. The equation is:
Sigma k=0 to k=infinity (k-1)/2^k = 0
I solved the LHS but my answer is 1 whereas the RHS should be 0
Following is my solution:
Let's say S = k/2^k = 1/2 + 2/2^2 + 3/2^3 + 4/2^4 ....
2S = 1 + 2/2 + 3/2^2 + 4/2^3 ...
2S - S = 1 + ( 2/2 - 1/2) + (3/2^2 - 2/2^2) + (4/2^3 - 3/2^3)..
S = 1+ 1/2 + 1/2^2 + 1/2^3 + 1/2^4..
S = 2 -- eq 1
Now let's say S1 = (k-1)/2^k = 0/2 + 1/2^2 + 2/2^3 + 3/2^4...
S - S1 = 1/2 + (2/2^2 - 1/2^2) + (3/2^3 - 2/2^3) + (4/2^4 - 3/2^4)....
S - S1 = 1/2 + 1/2^2 + 1/2^3 + 1/2^4...
= 1
From eq 1
2 - S1 = 1
S1 = 1
Whereas the required RHS is 0. Is there anything wrong with my solution? Thanks..
Yes, you have issues in your solution to the problem.
While everything is correct in formulating the value of S, you have calculated the value of S1 incorrectly. You missed substituting the value for k=0 in S1. Whereas, for S, even after putting the value of k, the first term will be 0, so no effect.
Therefore,
S1 = (k-1)/2^k = -1 + 0/2 + 1/2^2 + 2/2^3 + 3/2^4...
// you missed -1 here because you started substituting values from k=1
S - S1 = -(-1) + 1/2 + (2/2^2 - 1/2^2) + (3/2^3 - 2/2^3) + (4/2^4 - 3/2^4)....
S - S1 = 1 + (1/2 + 1/2^2 + 1/2^3 + 1/2^4...)
= 1 + 1
= 2.
From eq 1
2 - S1 = 2
S1 = 0.

Solving a System of four equations (with logs) in Mathematica

I am trying to solve a system of four equations in four variables. I have read a number of threads on similar issues and tried to follow the suggestions. But I think it is a bit messy because of the logs and cross products here. This is the exact system:
7*w = (7*w+5*x+2*y+z) * ( 0.76 + 0.12*Log[w] -0.08*Log[x] -0.03*Log[y] -0.07*Log[7*w+5*x + 2*y + z]),
5*x = (7*w+5*x+2*y+z) * ( 0.84 - 0.08*Log[w] +0.11*Log[x] -0.02*Log[y] -0.08*Log[7*w+5*x + 2*y + z]),
2*y = (7*w+5*x+2*y+z) * (-0.45 - 0.03*Log[w] -0.02*Log[x] +0.05*Log[y] +0.12*Log[7*w+5*x + 2*y + z]),
1*z = (7*w+5*x+2*y+z) * (-0.16 + 0*Log[w] - 0*Log[x] - 0*Log[y] + 0.03*Log[7*w+5*x + 2*y + z])
(FYI-I am a young economist, and this is an extension of a consumer demand system.) Theoretically, we know that there exists a unique solution to this system that is positive.
Trys
Solve & NSolve : As there should be a solution I tried these but neither works. I guess that the system has too many logs to handle.
FindRoot : I started with an initial value of (14,15,10,100) which I get from my data. FindRoot returns the last value (which does not satisfy my system) and the following message.
FindRoot::lstol: The line search decreased the step size to within tolerance specified by AccuracyGoal and PrecisionGoal but was unable.....
I tried different initial values, including the value returned by FindRoot. I tried to analyze the pattern of the solution value at each step. I didn’t see any pattern but noticed that the z values become negative early in the process. So I put bounds on the values. This just stops the code at the minimum value of 0.1. I also tried an exponential system instead of log, same issues.
Reap[FindRoot[{
7*w==(7*w+5*x + 2*y + z)*(0.76 + 0.12*Log[w] -0.08*Log[x] -0.03*Log[y] -0.07*Log[7*w+5*x + 2*y + z]),
5*x==(7*w+5*x + 2*y + z)*(0.84 -0.08*Log[w] +0.11*Log[x] -0.02*Log[y] -0.08*Log[7*w+5*x + 2*y + z]),
2*y==(7*w+5*x + 2*y + z)*(-0.45 - 0.03*Log[w] -0.02*Log[x] +0.05*Log[y] +0.12*Log[7*w+5*x + 2*y + z]),
z==(7*w+5*x + 2*y + z)*(-0.16 + 0*Log[w] -0*Log[x] -0*Log[y] +0.03*Log[7*w+5*x + 2*y + z])},
{{w,14,0.1,500},{x,15,0.1,500},{y,10,0.1,500},
{z,100,0.1,500}},EvaluationMonitor:>Sow[{w,x,y,z}] ]]
FindMinimum : As we can write this problem as a minimization problem, I tried this (following the suggestion here). The value returned did not converge the system or equations to zero. I tried with only the first two equations, and that sort of converged to zero.
Hope this is engaging enough for the experts here! Any ideas how I should find the solution or why can’t I? It’s the first time I am using Mathematica, and unfortunately the first time I am empirically solving a system/optimizing! Thanks a lot.
{g1,g2,g3, g4}={7*w - (7*w+5*x+2*y+z)* (0.76+0.12*Log[w]-0.08*Log[x]-0.03*Log[y] -0.07*Log[7*w+5*x+2*y+z]),5*x - (7*w+5*x+2*y+z)*(0.84-0.08*Log[w]+0.11*Log[x]-0.02*Log[y] -0.08*Log[7*w+5*x+2*y+z]),2*y - (7*w+5*x+2*y+z)*(-0.45-0.03*Log[w]-0.02*Log[x]+0.05*Log[y]+0.12*Log[7*w+5*x+2*y+z]), 1*z - (7*w+5*x+2*y+z)*(-0.16+0*Log[w]-0*Log[x]-0*Log[y]+0.03*Log[7*w+5*x+2*y+z])};subdomain=0<w<100 &&0<x<100 && 0<y<100 && 0<z<100;res=FindMinimum[{Total[{g1,g2,g3, g4}^2],subdomain},{w,x,y,z},AccuracyGoal->5]{g1,g2,g3,g4}/.res[[2]]
I don't have access to Mathematica, I put your equations into AMPL which is free for students. Here is what I did:
var w := 14 >= 0;
var x := 15 >= 0;
var y := 10 >= 0;
var z := 100 >= 0;
eq1: 7*w = (7*w+5*x+2*y+z) * ( 0.76 + 0.12*log(w) -0.08*log(x) -0.03*log(y) -0.07*log(7*w+5*x + 2*y + z));
eq2: 5*x = (7*w+5*x+2*y+z) * ( 0.84 - 0.08*log(w) +0.11*log(x) -0.02*log(y) -0.08*log(7*w+5*x + 2*y + z));
eq3: 2*y = (7*w+5*x+2*y+z) * (-0.45 - 0.03*log(w) -0.02*log(x) +0.05*log(y) +0.12*log(7*w+5*x + 2*y + z));
eq4: 1*z = (7*w+5*x+2*y+z) * (-0.16 + 0*log(w) - 0*log(x) - 0*log(y) +0.03*log(7*w+5*x + 2*y + z));
option show_stats 1;
option presolve 10;
option solver "/home/ali/ampl/ipopt"; # put your path here
option seed 1731;
# Initial solve
solve;
display w, x, y, z;
# Multistart
for {1..10} {
for {j in 1.._snvars}
let _svar[j] := Uniform(1, 50);
solve;
if (solve_result_num < 200) then {
display w, x, y, z;
}
}
If I only require that the variables are nonnegative, I get rubbish, for example
w = 2.39266e-11
x = 6.62678e-11
y = 1.57043e-24
z = 7.0842e-10
or
w = 1.09972e-12
x = 9.77807e-11
y = 3.36229e-21
z = 1.85441e-09
Numerically, these are indeed solutions, they satisfy the equations to a fairly high precision, although I am pretty sure it's not what you are looking for. This indicates issues with your model.
If I increase the lower bounds of the variables a bit:
var w := 14 >= 0.1;
var x := 15 >= 0.1;
var y := 10 >= 0.1;
var z := 100 >= 0.01;
I get, even with multistart, Ipopt 3.11.6: Converged to a locally infeasible point. Problem may be infeasible. This again indicates issues with your model equations.
I am afraid you will have to revise your model.
This won't fix the issues with your model equations but I would introduce new variables: a=log(w), b=log(x), c=log(y), d=log(z). Then w=exp(a) and so on. It has the advantage that the function evaluations won't fail due to negative arguments for the logarithm function.
I would probably also introduce a new variable for (7*w+5*x+2*y+z) just to make the equations more compact.
Neither of these new variables will solve the above issues with your model equations.
If it is really your first time using Mathematica, you might be better off with AMPL and IPOPT; these tools are custom-tailored for solving equations and optimization problems. I suggest you use the AMPL mailing list if you have question and not Stackoverflow; you will get better answers on the mailing list.
This method will often rapidly find approximate solutions, NMinimize the sum of squares with constraints.
In[2]:= NMinimize[{
(7*w - (7*w + 5*x + 2*y + z)*(0.76 + 0.12*Log[w] - 0.08*Log[x] -
0.03*Log[y] - 0.07*Log[7*w + 5*x + 2*y + z]))^2 +
(5*x - (7*w + 5*x + 2*y + z)*(0.84 - 0.08*Log[w] + 0.11*Log[x] -
0.02*Log[y] - 0.08*Log[7*w + 5*x + 2*y + z]))^2 +
(2*y - (7*w + 5*x + 2*y + z)*(-0.45 - 0.03*Log[w] - 0.02*Log[x] +
0.05*Log[y] + 0.12*Log[7*w + 5*x + 2*y + z]))^2 +
(1*z - (7*w + 5*x + 2*y + z)*(-0.16 + 0*Log[w] +
0.03*Log[7*w + 5*x + 2*y + z]))^2,
w > 0 && x > 0 && y > 0 && z > 0}, {w, x, y, z},
Method -> "RandomSearch"]
Out[2]= {9.34024*10^-12, {w->1.86998*10^-8, x->3.83383*10^-8, y->4.59973*10^-8, z->5.29581*10^-7}}

OpenCL (JOCL) - 2D calculus over two arrays in Kernel

I'm asking this here because I thought I've understood how OpenCL works but... I think there are several things I don't get.
What I want to do is to get the difference between all the values of two arrays, then calculate the hypot and finally get the maximum hypot value, so If I have:
double[] arrA = new double[]{1,2,3}
double[] arrB = new double[]{6,7,8}
Calculate
dx1 = 1 - 1; dx2 = 2 - 1; dx3 = 3 - 1, dx4= 1 - 2;... dxLast = 3 - 3
dy1 = 6 - 6; dy2 = 7 - 6; dy3 = 8 - 6, dy4= 6 - 7;... dyLast = 8 - 8
(Extreme dx and dy will get 0, but i don't care about ignoring those cases by now)
Then calculate each hypot based on hypot(dx(i), dy(i))
And once all these values where obtained, get the maximum hypot value
So, I have the next defined Kernel:
String programSource =
"#ifdef cl_khr_fp64 \n"
+ " #pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"
+ "#elif defined(cl_amd_fp64) \n"
+ " #pragma OPENCL EXTENSION cl_amd_fp64 : enable \n"
+ "#else "
+ " #error Double precision floating point not supported by OpenCL implementation.\n"
+ "#endif \n"
+ "__kernel void "
+ "sampleKernel(__global const double *bufferX,"
+ " __global const double *bufferY,"
+ " __local double* scratch,"
+ " __global double* result,"
+ " __const int lengthX,"
+ " __const int lengthY){"
+ " const int index_a = get_global_id(0);"//Get the global indexes for 2D reference
+ " const int index_b = get_global_id(1);"
+ " const int local_index = get_local_id(0);"//Current thread id -> Should be the same as index_a * index_b + index_b;
+ " if (local_index < (lengthX * lengthY)) {"// Load data into local memory
+ " if(index_a < lengthX && index_b < lengthY)"
+ " {"
+ " double dx = (bufferX[index_b] - bufferX[index_a]);"
+ " double dy = (bufferY[index_b] - bufferY[index_a]);"
+ " scratch[local_index] = hypot(dx, dy);"
+ " }"
+ " } "
+ " else {"
+ " scratch[local_index] = 0;"// Infinity is the identity element for the min operation
+ " }"
//Make a Barrier to make sure all values were set into the local array
+ " barrier(CLK_LOCAL_MEM_FENCE);"
//If someone can explain to me the offset thing I'll really apreciate that...
//I just know there is alway a division by 2
+ " for(int offset = get_local_size(0) / 2; offset > 0; offset >>= 1) {"
+ " if (local_index < offset) {"
+ " float other = scratch[local_index + offset];"
+ " float mine = scratch[local_index];"
+ " scratch[local_index] = (mine > other) ? mine : other;"
+ " }"
+ " barrier(CLK_LOCAL_MEM_FENCE);"
//A barrier to make sure that all values where checked
+ " }"
+ " if (local_index == 0) {"
+ " result[get_group_id(0)] = scratch[0];"
+ " }"
+ "}";
For this case, the defined GWG size is (100, 100, 0) and a LWI size of (10, 10, 0).
So, for this example, both arrays have size 10 and the GWG and LWI are obtained as follows:
//clGetKernelWorkGroupInfo(kernel, device, CL.CL_KERNEL_WORK_GROUP_SIZE, Sizeof.size_t, Pointer.to(buffer), null);
long kernel_work_group_size = OpenClUtil.getKernelWorkGroupSize(kernel, device.getCl_device_id(), 3);
//clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_ITEM_SIZES, Sizeof.size_t * numValues, Pointer.to(buffer), null);
long[] maxSize = device.getMaximumSizes();
maxSize[0] = ( kernel_work_group_size > maxSize[0] ? maxSize[0] : kernel_work_group_size);
maxSize[1] = ( kernel_work_group_size > maxSize[1] ? maxSize[1] : kernel_work_group_size);
maxSize[2] = ( kernel_work_group_size > maxSize[2] ? maxSize[2] : kernel_work_group_size);
// maxSize[2] =
long xMaxSize = (x > maxSize[0] ? maxSize[0] : x);
long yMaxSize = (y > maxSize[1] ? maxSize[1] : y);
long zMaxSize = (z > maxSize[2] ? maxSize[2] : z);
long local_work_size[] = new long[] { xMaxSize, yMaxSize, zMaxSize };
int numWorkGroupsX = 0;
int numWorkGroupsY = 0;
int numWorkGroupsZ = 0;
if(local_work_size[0] != 0)
numWorkGroupsX = (int) ((total + local_work_size[0] - 1) / local_work_size[0]);
if(local_work_size[1] != 0)
numWorkGroupsY = (int) ((total + local_work_size[1] - 1) / local_work_size[1]);
if(local_work_size[2] != 0)
numWorkGroupsZ = (int) ((total + local_work_size[2] - 1) / local_work_size[2]);
long global_work_size[] = new long[] { numWorkGroupsX * local_work_size[0],
numWorkGroupsY * local_work_size[1], numWorkGroupsZ * local_work_size[2]};
The thing is I'm not getting the espected values so I decided to make some tests based on a smaller kernel and changing the [VARIABLE TO TEST VALUES] object returned in a result array:
/**
* The source code of the OpenCL program to execute
*/
private static String programSourceA =
"#ifdef cl_khr_fp64 \n"
+ " #pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"
+ "#elif defined(cl_amd_fp64) \n"
+ " #pragma OPENCL EXTENSION cl_amd_fp64 : enable \n"
+ "#else "
+ " #error Double precision floating point not supported by OpenCL implementation.\n"
+ "#endif \n"
+ "__kernel void "
+ "sampleKernel(__global const double *bufferX,"
+ " __global const double *bufferY,"
+ " __local double* scratch,"
+ " __global double* result,"
+ " __const int lengthX,"
+ " __const int lengthY){"
//Get the global indexes for 2D reference
+ " const int index_a = get_global_id(0);"
+ " const int index_b = get_global_id(1);"
//Current thread id -> Should be the same as index_a * index_b + index_b;
+ " const int local_index = get_local_id(0);"
// Load data into local memory
//Only print values if index_a < ArrayA length
//Only print values if index_b < ArrayB length
//Only print values if local_index < (lengthX * lengthY)
//Only print values if this is the first work group.
+ " if (local_index < (lengthX * lengthY)) {"
+ " if(index_a < lengthX && index_b < lengthY)"
+ " {"
+ " double dx = (bufferX[index_b] - bufferX[index_a]);"
+ " double dy = (bufferY[index_b] - bufferY[index_a]);"
+ " result[local_index] = hypot(dx, dy);"
+ " }"
+ " } "
+ " else {"
// Infinity is the identity element for the min operation
+ " result[local_index] = 0;"
+ " }"
The returned values are far of being the espected but, if the [VARIABLE TO TEST VALUES] is (index_a * index_b) + index_a, almost each value of the returned array has the correct (index_a * index_b) + index_a value, i mean:
result[0] -> 0
result[1] -> 1
result[2] -> 2
....
result[97] -> 97
result[98] -> 98
result[99] -> 99
but some values are: -3.350700319577517E-308....
What I'm not doing correctly???
I hope this is well explained and not that big to make you angry with me....
Thank you so much!!!!!
TomRacer
You have many problems in your code, and some of them are concept related. I think you should read the standard or OpenCL guide completely before starting to code. Because some of the system calls you are using have a different behaviour that what you expect.
Work-groups and work-items are NOT like CUDA. If you want 100x100 work-items, separated into 10x10 work-groups you use as global-size (100x100) and local-size(10x10). Unlike CUDA, where the global work item is multiplied by the local size internally.
1.1. In your test code, if you are using 10x10 with 10x10. Then you are not filling the whole space, the non filled area will still have garbage like -X.xxxxxE-308.
You should not use lengthX and lengthY and put a lot of ifs in your code. OpenCL has a method to call the kernels with offsets and with specific number of items, so you can control this from the host side. BTW doing this is a performance loss and is never a good practice since the code is less readable.
get_local_size(0) gives you the local size of axis 0 (10 in your case). What is what you do not understand in this call? Why do you divide it by 2 always?
I hope this can help you in your debugging process.
Cheers
thank you for your answer, first of all this kernel code is based on the commutative reduction code explained here: http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/.
So I'm using that code but I added some things like the 2D operations.
Regarding to the point you mentioned before:
1.1- Actually the global work group size is (100, 100, 0)... That 100 is a result of multiplying 10 x 10 where 10 is the current array size, so my global work group size is based on this rule... then the local work item size is (10, 10, 0).
Global work group size must be a multiple of local work item size, I have read this in many examples and I think this is ok.
1.2- In my test code I'm using the same arrays, in fact if I change the array size GWG size and LWI size will change dinamically.
2.1- There are not so many "if" there, there are just 3 "if", the first one checks when I must compute the hypot() based on the array objects or fill that object with zero.
The second and third "if"s are just part of the reduction algorithm that seems to be fine.
2.2- Regarding to the lengthX and lengthY yeah you are right but I haven't got that yet, how should I use that??
3.1- Yeah, I know that, but I realized that I'm not using the Y axis id so maybe there is another problem here.
3.2- The reduction algorithm iterates over each pair of elements stored in the scratch variable and checking for the maximum value between them, so for each "for" that it does it is reducing the elements to be computed to the half of the previous quantity.
Also I'm going to post a some changes on the main kernel code and in the test kernel code because there where some errors.
Greetings...!!!

vectorize/optimize this code in MATLAB?

I am building my first large-scale MATLAB program, and I've managed to write original vectorized code for everything so for until I came to trying to create an image representing vector density in stereographic projection. After a couple failed attempts I went to the Mathworks file exchange site and found an open source program which fits my needs courtesy of Malcolm Mclean. With a test matrix his function produces something like this:
And while this is almost exactly what I wanted, his code relies on a triply nested for-loop. On my workstation a test data matrix of size 25000x2 took 65 seconds in this section of code. This is unacceptable since I will be scaling up to a data matrices of size 500000x2 in my project.
So far I've been able to vectorize the innermost loop (which was the longest/worst loop), but I would like to continue and be rid of the loops entirely if possible. Here is Malcolm's original code that I need to vectorize:
dmap = zeros(height, width); % height, width: scalar with default value = 32
for ii = 0: height - 1 % 32 iterations of this loop
yi = limits(3) + ii * deltay + deltay/2; % limits(3) & deltay: scalars
for jj = 0 : width - 1 % 32 iterations of this loop
xi = limits(1) + jj * deltax + deltax/2; % limits(1) & deltax: scalars
dd = 0;
for kk = 1: length(x) % up to 500,000 iterations in this loop
dist2 = (x(kk) - xi)^2 + (y(kk) - yi)^2;
dd = dd + 1 / ( dist2 + fudge); % fudge is a scalar
end
dmap(ii+1,jj+1) = dd;
end
end
And here it is with the changes I've already made to the innermost loop (which was the biggest drain on efficiency). This cuts the time from 65 seconds down to 12 seconds on my machine for the same test matrix, which is better but still far slower than I would like.
dmap = zeros(height, width);
for ii = 0: height - 1
yi = limits(3) + ii * deltay + deltay/2;
for jj = 0 : width - 1
xi = limits(1) + jj * deltax + deltax/2;
dist2 = (x - xi) .^ 2 + (y - yi) .^ 2;
dmap(ii + 1, jj + 1) = sum(1 ./ (dist2 + fudge));
end
end
So my main question, are there any further changes I can make to optimize this code? Or even an alternative method to approach the problem? I've considered using C++ or F# instead of MATLAB for this section of the program, and I may do so if I cannot get to a reasonable efficiency level with the MATLAB code.
Please also note that at this point I don't have ANY additional toolboxes, if I did then I know this would be trivial (using hist3 from the statistics toolbox for example).
Mem consuming solution
yi = limits(3) + deltay * ( 1:height ) - .5 * deltay;
xi = limits(1) + deltax * ( 1:width ) - .5 * deltax;
dx = bsxfun( #minus, x(:), xi ) .^ 2;
dy = bsxfun( #minus, y(:), yi ) .^ 2;
dist2 = bsxfun( #plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) );
dmap = sum( 1./(dist2 + fudge ) , 3 );
EDIT
handling extremely large x and y by breaking the operation into blocks:
blockSize = 50000; % process up to XX elements at once
dmap = 0;
yi = limits(3) + deltay * ( 1:height ) - .5 * deltay;
xi = limits(1) + deltax * ( 1:width ) - .5 * deltax;
bi = 1;
while bi <= numel(x)
% take a block of x and y
bx = x( bi:min(end, bi + blockSize - 1) );
by = y( bi:min(end, bi + blockSize - 1) );
dx = bsxfun( #minus, bx(:), xi ) .^ 2;
dy = bsxfun( #minus, by(:), yi ) .^ 2;
dist2 = bsxfun( #plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) );
dmap = dmap + sum( 1./(dist2 + fudge ) , 3 );
bi = bi + blockSize;
end
This is a good example of why starting a loop from 1 matters. The only reason that ii and jj are initiated at 0 is to kill the ii * deltay and jj * deltax terms which however introduces sequentiality in the dmap indexing, preventing parallelization.
Now, by rewriting the loops you could use parfor() after opening a matlabpool:
dmap = zeros(height, width);
yi = limits(3) + deltay*(1:height) - .5*deltay;
matlabpool 8
parfor ii = 1: height
for jj = 1: width
xi = limits(1) + (jj-1) * deltax + deltax/2;
dist2 = (x - xi) .^ 2 + (y - yi(ii)) .^ 2;
dmap(ii, jj) = sum(1 ./ (dist2 + fudge));
end
end
matlabpool close
Keep in mind that opening and closing the pool has significant overhead (10 seconds on my Intel Core Duo T9300, vista 32 Matlab 2013a).
PS. I am not sure whether the inner loop instead of the outer one can be meaningfully parallelized. You can try to switch the parfor to the inner one and compare speeds (I would recommend going for the big matrix immediately since you are already running in 12 seconds and the overhead is almost as big).
Alternatively, this problem can be solved in using kernel density estimation techniques. This is part of the Statistics Toolbox, or there's this KDE implementation by Zdravko Botev (no toolboxes required).
For the example code below, I get 0.3 seconds for N = 500000, or 0.7 seconds for N = 1000000.
N = 500000;
data = [randn(N,2); rand(N,1)+3.5, randn(N,1);]; % 2 overlaid distrib
tic; [bandwidth,density,X,Y] = kde2d(data); toc;
imagesc(density);

Resources