Optimizing a loop : huge arrays operations

Optimizing a loop : huge arrays operations - performance

I am doing huge calculations (derivatives here, but look similar to images operations) on arrays that do not fit in cache, meaning the CPU have to load parts in the cache, calculate, then load another part, etc. But because of the shape of calculations, some data are load, unload and reload multiple times. I was wondering if there is a way to optimize this. I am already using SIMD instructions using the compiler optimization (GCC and Intel).
This is Fortran calculations, but it's similar to C/C++, the memory order is just inverted and arrays use () instead of []. for is replaced by do.
On x axe:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
DF(i,j,k)=(F(i+1,j,k)-F(i-1,j,k))*B+(F(i-2,j,k)-F(i+2,j,k))*C
end do
end do
end do
On y axe:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
DF(i,j,k)=(F(i,j+1,k)-F(i,j-1,k))*B+(F(i,j-2,k)-F(i,j+2,k))*C
end do
end do
end do
on z axe:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
DF(i,j,k)=(F(i,j,k+1)-F(i,j,k-1))*B+(F(i,j,k-2)-F(i,j,k+2))*C
end do
end do
end do
First derivative on axe x is OK because memory is read continuously. Derivatives on y and z axes are not continuous.
And the worst calculation I have combine all axes (This is a Laplacian operator):
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = M(i,j,k,1) * p(i,j,k) &
& + M(i,j,k,2) * p(i-1,j,k) &
& + M(i,j,k,3) * p(i+1,j,k) &
& + M(i,j,k,4) * p(i,j-1,k) &
& + M(i,j,k,5) * p(i,j+1,k) &
& + M(i,j,k,6) * p(i,j,k-1) &
& + M(i,j,k,7) * p(i,j,k+1)
end do
end do
end do
Note that compilers do not understand the last operation (Laplacian). To use SIMD (vectorized calculations), I need to split the operation like this, which gives a 2.5x speedup:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = M(i,j,k,1) * p(i,j,k) &
& + M(i,j,k,2) * p(i-1,j,k) &
& + M(i,j,k,3) * p(i+1,j,k)
end do
end do
end do
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = V(i,j,k) + &
& + M(i,j,k,4) * p(i,j-1,k) &
& + M(i,j,k,5) * p(i,j+1,k)
end do
end do
end do
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = V(i,j,k) + &
& + M(i,j,k,6) * p(i,j,k-1) &
& + M(i,j,k,7) * p(i,j,k+1)
end do
end do
end do
Maybe using SIMD I already reached maximum speed, but because these calculations takes days, even with MPI and more than 1024 CPU, reducing the time of calculations, even of 20% would be a great step!
Does anyone of you have ideas on how to optimize this?

When you use 3D stencils and you reference elements like i,j,k-1, i,j,k+1 the linear order in which you go through the array will not be optimal. The cache efficiency can be increased by loop tiling.
In my code I use
!$omp parallel private(i,j,k,bi,bj,bk)
!$omp do schedule(runtime) collapse(3)
do bk = 1, Unz, tnz
do bj = 1, Uny, tny
do bi = 1, Unx, tnx
do k = bk, min(bk+tnz-1,Unz)
do j = bj, min(bj+tny-1,Uny)
do i = bi, min(bi+tnx-1,Unx)
U2 (i,j,k) = U2(i,j,k) + &
(U(i+1,j,k)-U(i,j,k)) * ...
U2(i,j,k) = U2(i,j,k) - &
(U(i,j,k)-U(i-1,j,k)) * ...
U2(i,j,k) = U2(i,j,k) + &
(U(i,j+1,k)-U(i,j,k)) * ...
U2(i,j,k) = U2(i,j,k) - &
(U(i,j,k)-U(i,j-1,k)) * ...
U2(i,j,k) = U2(i,j,k) + &
(U(i,j,k+1)-U(i,j,k)) * ...
U2(i,j,k) = U2(i,j,k) - &
(U(i,j,k)-U(i,j,k-1)) * ...
end do
end do
end do
end do
end do
end do
!$omp end do
where tnx, tny, tnz are the sizes of a tile in which you iterate in the i,j,k order. The size must be set to be close to the L1 cache. This will increase reuse of the content loaded into the cache.
If you need to separate the directions, you can of course do that and still keep the tiling.

Related

Loop optimisation

I am trying to understand what cache or other optimizations could be done in the source code to get this loop faster. I think it is quite cache friendly but, are there any experts out there that could squeeze a bit more performance tuning this code?
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
SIDEBACK = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1) + &
STEN(I ,J-1,K-1) + STEN(I ,J,K-1) + STEN(I ,J+1,K-1) + &
STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)
SIDEOWN = STEN(I-1,J-1,K) + STEN(I-1,J,K) + STEN(I-1,J+1,K) + &
STEN(I ,J-1,K) + STEN(I ,J,K) + STEN(I ,J+1,K) + &
STEN(I+1,J-1,K) + STEN(I+1,J,K) + STEN(I+1,J+1,K)
SIDEFRONT = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1) + &
STEN(I ,J-1,K+1) + STEN(I ,J,K+1) + STEN(I ,J+1,K+1) + &
STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)
RES(I,J,K) = ( SIDEBACK + SIDEOWN + SIDEFRONT ) / 27.0
END DO
END DO
END DO

Ok, I think I've tried everything I reasonably could, and my conclusion unfortunately is that there is not too much room for optimizations, unless you are willing to go into parallelization. Let's see why, let's see what you can and can't do.
Compiler optimizations
Compilers nowadays are extremely good at optimizing code, much much more than humans are. Relying on the optimizations done by the compilers also have the added benefit that they don't ruin the readability of your source code. Whatever you do, (when optimizing for speed) always try it with every reasonable combination of compiler flags. You can even go as far as to try multiple compilers. Personally I only used gfortran (included in GCC) (OS is 64-bit Windows), which I trust to have efficient and correct optimization techniques.
-O2 almost always improve the speed drastically, but even -O3 is a safe bet (among others, it includes delicious loop unrolling). For this problem, I also tried -ffast-math and -fexpensive-optimizations, they didn't have any measurable effect, but -march-corei7(cpu architecture-specific tuning, specific to Core i7) had, so I did the measurements with -O3 -march-corei7
So how fast it actually is?
I wrote the following code to test your solution and compiled it with -O3 -march-corei7. Usually it ran under 0.78-0.82 seconds.
program benchmark
implicit none
real :: start, finish
integer :: I, J, K
real :: SIDEBACK, SIDEOWN, SIDEFRONT
integer, parameter :: NX = 600
integer, parameter :: NY = 600
integer, parameter :: NZ = 600
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: STEN
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: RES
call random_number(STEN)
call cpu_time(start)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
SIDEBACK = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1) + &
STEN(I ,J-1,K-1) + STEN(I ,J,K-1) + STEN(I ,J+1,K-1) + &
STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)
SIDEOWN = STEN(I-1,J-1,K) + STEN(I-1,J,K) + STEN(I-1,J+1,K) + &
STEN(I ,J-1,K) + STEN(I ,J,K) + STEN(I ,J+1,K) + &
STEN(I+1,J-1,K) + STEN(I+1,J,K) + STEN(I+1,J+1,K)
SIDEFRONT = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1) + &
STEN(I ,J-1,K+1) + STEN(I ,J,K+1) + STEN(I ,J+1,K+1) + &
STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)
RES(I,J,K) = ( SIDEBACK + SIDEOWN + SIDEFRONT ) / 27.0
END DO
END DO
END DO
call cpu_time(finish)
!Use the calculated value, so the compiler doesn't optimize away everything.
!Print the original value as well, because one can never be too paranoid.
print *, STEN(1,1,1), RES(1,1,1)
print '(f6.3," seconds.")',finish-start
end program
Ok, so this is as far as the compiler can take us. What's next?
Store intermediate results?
As you might suspect from the question mark, this one didn't really work. Sorry. But let's not rush that forward.
As mentioned in the comments, your current code calculates every partial sum multiple times, meaning one iteration's STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1) will be the next iteration's STEN(I,J-1,K-1) + STEN(I,J,K-1) + STEN(I,J+1,K-1), so no need to fetch and calculate again, you can store those partial results.
The problem is, that we cannot store too many partial results. As you said, your code is already quite cache-friendly, every partial sum you store means one less array element you can store in L1 cache. We could store a few values, from the last few iterations of I (values for index I-2, I-3, etc.), but the compiler almost certainly does that already. I have 2 proofs for this suspicion. First, my manual loop unrolling made the program slower, by about 5%
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX, 8
SIDEBACK(0) = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1)
SIDEBACK(1) = STEN(I ,J-1,K-1) + STEN(I ,J,K-1) + STEN(I ,J+1,K-1)
SIDEBACK(2) = STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)
SIDEBACK(3) = STEN(I+2,J-1,K-1) + STEN(I+2,J,K-1) + STEN(I+2,J+1,K-1)
SIDEBACK(4) = STEN(I+3,J-1,K-1) + STEN(I+3,J,K-1) + STEN(I+3,J+1,K-1)
SIDEBACK(5) = STEN(I+4,J-1,K-1) + STEN(I+4,J,K-1) + STEN(I+4,J+1,K-1)
SIDEBACK(6) = STEN(I+5,J-1,K-1) + STEN(I+5,J,K-1) + STEN(I+5,J+1,K-1)
SIDEBACK(7) = STEN(I+6,J-1,K-1) + STEN(I+6,J,K-1) + STEN(I+6,J+1,K-1)
SIDEBACK(8) = STEN(I+7,J-1,K-1) + STEN(I+7,J,K-1) + STEN(I+7,J+1,K-1)
SIDEBACK(9) = STEN(I+8,J-1,K-1) + STEN(I+8,J,K-1) + STEN(I+8,J+1,K-1)
SIDEOWN(0) = STEN(I-1,J-1,K) + STEN(I-1,J,K) + STEN(I-1,J+1,K)
SIDEOWN(1) = STEN(I ,J-1,K) + STEN(I ,J,K) + STEN(I ,J+1,K)
SIDEOWN(2) = STEN(I+1,J-1,K) + STEN(I+1,J,K) + STEN(I+1,J+1,K)
SIDEOWN(3) = STEN(I+2,J-1,K) + STEN(I+2,J,K) + STEN(I+2,J+1,K)
SIDEOWN(4) = STEN(I+3,J-1,K) + STEN(I+3,J,K) + STEN(I+3,J+1,K)
SIDEOWN(5) = STEN(I+4,J-1,K) + STEN(I+4,J,K) + STEN(I+4,J+1,K)
SIDEOWN(6) = STEN(I+5,J-1,K) + STEN(I+5,J,K) + STEN(I+5,J+1,K)
SIDEOWN(7) = STEN(I+6,J-1,K) + STEN(I+6,J,K) + STEN(I+6,J+1,K)
SIDEOWN(8) = STEN(I+7,J-1,K) + STEN(I+7,J,K) + STEN(I+7,J+1,K)
SIDEOWN(9) = STEN(I+8,J-1,K) + STEN(I+8,J,K) + STEN(I+8,J+1,K)
SIDEFRONT(0) = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1)
SIDEFRONT(1) = STEN(I ,J-1,K+1) + STEN(I ,J,K+1) + STEN(I ,J+1,K+1)
SIDEFRONT(2) = STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)
SIDEFRONT(3) = STEN(I+2,J-1,K+1) + STEN(I+2,J,K+1) + STEN(I+2,J+1,K+1)
SIDEFRONT(4) = STEN(I+3,J-1,K+1) + STEN(I+3,J,K+1) + STEN(I+3,J+1,K+1)
SIDEFRONT(5) = STEN(I+4,J-1,K+1) + STEN(I+4,J,K+1) + STEN(I+4,J+1,K+1)
SIDEFRONT(6) = STEN(I+5,J-1,K+1) + STEN(I+5,J,K+1) + STEN(I+5,J+1,K+1)
SIDEFRONT(7) = STEN(I+6,J-1,K+1) + STEN(I+6,J,K+1) + STEN(I+6,J+1,K+1)
SIDEFRONT(8) = STEN(I+7,J-1,K+1) + STEN(I+7,J,K+1) + STEN(I+7,J+1,K+1)
SIDEFRONT(9) = STEN(I+8,J-1,K+1) + STEN(I+8,J,K+1) + STEN(I+8,J+1,K+1)
RES(I ,J,K) = ( SIDEBACK(0) + SIDEOWN(0) + SIDEFRONT(0) + &
SIDEBACK(1) + SIDEOWN(1) + SIDEFRONT(1) + &
SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) ) / 27.0
RES(I + 1,J,K) = ( SIDEBACK(1) + SIDEOWN(1) + SIDEFRONT(1) + &
SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) + &
SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) ) / 27.0
RES(I + 2,J,K) = ( SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) + &
SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) + &
SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) ) / 27.0
RES(I + 3,J,K) = ( SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) + &
SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) + &
SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) ) / 27.0
RES(I + 4,J,K) = ( SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) + &
SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) + &
SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) ) / 27.0
RES(I + 5,J,K) = ( SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) + &
SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) + &
SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) ) / 27.0
RES(I + 6,J,K) = ( SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) + &
SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) + &
SIDEBACK(8) + SIDEOWN(8) + SIDEFRONT(8) ) / 27.0
RES(I + 7,J,K) = ( SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) + &
SIDEBACK(8) + SIDEOWN(8) + SIDEFRONT(8) + &
SIDEBACK(9) + SIDEOWN(9) + SIDEFRONT(9) ) / 27.0
END DO
END DO
END DO
And what's worse, it's easy to show we are already pretty close the theoretical minimal possible execution time. In order to calculate all these averages, the absolute minimum we need to do, is access every element at least once, and divide them by 27.0. So you can never get faster than the following code, which executes under 0.48-0.5 seconds on my machine.
program benchmark
implicit none
real :: start, finish
integer :: I, J, K
integer, parameter :: NX = 600
integer, parameter :: NY = 600
integer, parameter :: NZ = 600
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: STEN
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: RES
call random_number(STEN)
call cpu_time(start)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
!This of course does not do what you want to do,
!this is just an example of a speed limit we can never surpass.
RES(I, J, K) = STEN(I, J, K) / 27.0
END DO
END DO
END DO
call cpu_time(finish)
!Use the calculated value, so the compiler doesn't optimize away everything.
print *, STEN(1,1,1), RES(1,1,1)
print '(f6.3," seconds.")',finish-start
end program
But hey, even a negative result is a result. If just accessing every element once (and dividing by 27.0) takes up more than half of the execution time, that just means memory access is the bottle neck. Then maybe you can optimize that.
Less data
If you don't need the full precision of 64-bit doubles, you can declare your array with a type of real(kind=4). But maybe your reals are already 4 bytes. In that case, I believe some Fortran implementations support non-standard 16-bit doubles, or depending on your data you can just use integers (maybe floats multiplied by a number then rounded to integer). The smaller your base type is, the more elements you can fit into the cache. The most ideal would be integer(kind=1), of course, it caused more than a 2x speed up on my machine, compared to real(kind=4). But it depends on the precision you need.
Better locality
Column major arrays are slow when you need data from neighbouring column, and row major ones are slow for neighbouring rows.
Fortunately there is a funky way to store data, called a Z-order curve, which does have applications similar to your use case in computer graphics.
I can't promise it will help, maybe it will be terribly counterproductive, but maybe not. Sorry, I didn't feel like implementing it myself, to be honest.
Parallelization
Speaking of computer graphics, this problem is trivially and extremely well parallelizable, maybe even on a GPU, but if you don't want to go that far, you can just use a normal multicore CPU. The Fortran Wiki seems like a good place to search for Fortran parallelization libraries.

Filling a matrix using parallel processing in Julia

I'm trying to speed up the solution time for a dynamic programming problem in Julia (v. 0.5.0), via parallel processing. The problem involves choosing the optimal values for every element of a 1073 x 19 matrix at every iteration, until successive matrix differences fall within a tolerance. I thought that, within each iteration, filling in the values for each element of the matrix could be parallelized. However, I'm seeing a huge performance degradation using SharedArray, and I'm wondering if there's a better way to approach parallel processing for this problem.
I construct the arguments for the function below:
est_params = [.788,.288,.0034,.1519,.1615,.0041,.0077,.2,0.005,.7196]
r = 0.015
tau = 0.35
rho =est_params[1]
sigma =est_params[2]
delta = 0.15
gamma =est_params[3]
a_capital =est_params[4]
lambda1 =est_params[5]
lambda2 =est_params[6]
s =est_params[7]
theta =est_params[8]
mu =est_params[9]
p_bar_k_ss =est_params[10]
beta = (1+r)^(-1)
sigma_range = 4
gz = 19
gp = 29
gk = 37
lnz=collect(linspace(-sigma_range*sigma,sigma_range*sigma,gz))
z=exp(lnz)
gk_m = fld(gk,2)
# Need to add mu somewhere to k_ss
k_ss = (theta*(1-tau)/(r+delta))^(1/(1-theta))
k=cat(1,map(i->k_ss*((1-delta)^i),collect(1:gk_m)),map(i->k_ss/((1-delta)^i),collect(1:gk_m)))
insert!(k,gk_m+1,k_ss)
sort!(k)
p_bar=p_bar_k_ss*k_ss
p = collect(linspace(-p_bar/2,p_bar,gp))
#Tauchen
N = length(z)
Z = zeros(N,1)
Zprob = zeros(Float32,N,N)
Z[N] = lnz[length(z)]
Z[1] = lnz[1]
zstep = (Z[N] - Z[1]) / (N - 1)
for i=2:(N-1)
Z[i] = Z[1] + zstep * (i - 1)
end
for a = 1 : N
for b = 1 : N
if b == 1
Zprob[a,b] = 0.5*erfc(-((Z[1] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2))
elseif b == N
Zprob[a,b] = 1 - 0.5*erfc(-((Z[N] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
else
Zprob[a,b] = 0.5*erfc(-((Z[b] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) -
0.5*erfc(-((Z[b] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
end
end
end
# Collecting tauchen results in a 2 element array of linspace and array; [2] gets array
# Zprob=collect(tauchen(gz, rho, sigma, mu, sigma_range))[2]
Zcumprob=zeros(Float32,gz,gz)
# 2 in cumsum! denotes the 2nd dimension, i.e. columns
cumsum!(Zcumprob, Zprob,2)
gm = gk * gp
control=zeros(gm,2)
for i=1:gk
control[(1+gp*(i-1)):(gp*i),1]=fill(k[i],(gp,1))
control[(1+gp*(i-1)):(gp*i),2]=p
end
endog=copy(control)
E=Array(Float32,gm,gm,gz)
for h=1:gm
for m=1:gm
for j=1:gz
# set the nonzero net debt indicator
if endog[h,2]<0
p_ind=1
else
p_ind=0
end
# set the investment indicator
if (control[m,1]-(1-delta)*endog[h,1])!=0
i_ind=1
else
i_ind=0
end
E[m,h,j] = (1-tau)*z[j]*(endog[h,1]^theta) + control[m,2]-endog[h,2]*(1+r*(1-tau)) +
delta*endog[h,1]*tau-(control[m,1]-(1-delta)*endog[h,1]) -
(i_ind*gamma*endog[h,1]+endog[h,1]*(a_capital/2)*(((control[m,1]-(1-delta)*endog[h,1])/endog[h,1])^2)) +
s*endog[h,2]*p_ind
elem = E[m,h,j]
if E[m,h,j]<0
E[m,h,j]=elem+lambda1*elem-.5*lambda2*elem^2
else
E[m,h,j]=elem
end
end
end
end
I then constructed the function with serial processing. The two for loops iterate through each element to find the largest value in a 1072-sized (=the gm scalar argument in the function) array:
function dynam_serial(E,gm,gz,beta,Zprob)
v = Array(Float32,gm,gz )
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
Tv = Array(Float32,gm,gz)
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1 # arbitrary initial value greater than convcrit
while diff>convcrit
exp_v=v*Zprob'
for h=1:gm
for j=1:gz
Tv[h,j]=findmax(E[:,h,j] + beta*exp_v[:,j])[1]
end
end
diff = maxabs(Tv - v)
v=copy(Tv)
end
end
Timing this, I get:
#time dynam_serial(E,gm,gz,beta,Zprob)
> 106.880008 seconds (91.70 M allocations: 203.233 GB, 15.22% gc time)
Now, I try using Shared Arrays to benefit from parallel processing. Note that I reconfigured the iteration so that I only have one for loop, rather than two. I also use v=deepcopy(Tv); otherwise, v is copied as an Array object, rather than a SharedArray:
function dynam_parallel(E,gm,gz,beta,Zprob)
v = SharedArray(Float32,(gm,gz),init = S -> S[Base.localindexes(S)] = myid() )
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1 # arbitrary initial value greater than convcrit
while diff>convcrit
exp_v=v*Zprob'
Tv = SharedArray(Float32,gm,gz,init = S -> S[Base.localindexes(S)] = myid() )
#sync #parallel for hj=1:(gm*gz)
j=cld(hj,gm)
h=mod(hj,gm)
if h==0;h=gm;end;
#async Tv[h,j]=findmax(E[:,h,j] + beta*exp_v[:,j])[1]
end
diff = maxabs(Tv - v)
v=deepcopy(Tv)
end
end
Timing the parallel version; and using a 4-core 2.5 GHz I7 processor with 16GB of memory, I get:
addprocs(3)
#time dynam_parallel(E,gm,gz,beta,Zprob)
> 164.237208 seconds (2.64 M allocations: 201.812 MB, 0.04% gc time)
Am I doing something incorrect here? Or is there a better way to approach parallel processing in Julia for this particular problem? I've considered using Distributed Arrays, but it's difficult for me to see how to apply them to the present problem.
UPDATE:
Per #DanGetz and his helpful comments, I turned instead to trying to speed up the serial processing version. I was able to get performance down to 53.469780 seconds (67.36 M allocations: 103.419 GiB, 19.12% gc time) through:
1) Upgrading to 0.6.0 (saved about 25 seconds), which includes the helpful #views macro.
2) Preallocating the main array I'm trying to fill in (Tv), per the section on Preallocating Outputs in the Julia Performance Tips: https://docs.julialang.org/en/latest/manual/performance-tips/. (saved another 25 or so seconds)
The biggest remaining slow-down seems to be coming from the add_vecs function, which sums together subarrays of two larger matrices. I've tried devectorizing and using BLAS functions, but haven't been able to produce better performance.
In any event, the improved code for dynam_serial is below:
function add_vecs(r::Array{Float32},h::Int,j::Int,E::Array{Float32},exp_v::Array{Float32},beta::Float32)
#views r=E[:,h,j] + beta*exp_v[:,j]
return r
end
function dynam_serial(E::Array{Float32},gm::Int,gz::Int,beta::Float32,Zprob::Array{Float32})
v = Array{Float32}(gm,gz)
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
Tv = Array{Float32}(gm,gz)
r = Array{Float32}(gm)
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1 # arbitrary initial value greater than convcrit
while diff>convcrit
exp_v=v*Zprob'
for h=1:gm
for j=1:gz
#views Tv[h,j]=findmax(add_vecs(r,h,j,E,exp_v,beta))[1]
end
end
diff = maximum(abs,Tv - v)
v=copy(Tv)
end
return Tv
end

If add_vecs seems to be the critical function, writing an explicit for loop could offer more optimization. How does the following benchmark:
function add_vecs!(r::Array{Float32},h::Int,j::Int,E::Array{Float32},
exp_v::Array{Float32},beta::Float32)
#inbounds for i=1:size(E,1)
r[i]=E[i,h,j] + beta*exp_v[i,j]
end
return r
end
UPDATE
To continue optimizing dynam_serial I have tried to remove more allocations. The result is:
function add_vecs_and_max!(gm::Int,r::Array{Float64},h::Int,j::Int,E::Array{Float64},
exp_v::Array{Float64},beta::Float64)
#inbounds for i=1:gm
r[i] = E[i,h,j]+beta*exp_v[i,j]
end
return findmax(r)[1]
end
function dynam_serial(E::Array{Float64},gm::Int,gz::Int,
beta::Float64,Zprob::Array{Float64})
v = Array{Float64}(gm,gz)
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
r = Array{Float64}(gm)
exp_v = Array{Float64}(gm,gz)
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1.0 # arbitrary initial value greater than convcrit
while diff>convcrit
A_mul_Bt!(exp_v,v,Zprob)
diff = -Inf
for h=1:gm
for j=1:gz
oldv = v[h,j]
newv = add_vecs_and_max!(gm,r,h,j,E,exp_v,beta)
v[h,j]= newv
diff = max(diff, oldv-newv, newv-oldv)
end
end
end
return v
end
Switching the functions to use Float64 should increase speed (as CPUs are inherently optimized for 64-bit word lengths). Also, using the mutating A_mul_Bt! directly saves another allocation. Avoiding the copy(...) by switching the arrays v and Tv.
How do these optimizations improve your running time?
2nd UPDATE
Updated the code in the UPDATE section to use findmax. Also, changed dynam_serial to use v without Tv, as there was no need to save the old version except for the diff calculation, which is now done inside the loop.

Here's the code I copied-and-pasted, provided by Dan Getz above. I include the array and scalar definitions exactly as I ran them. Performance was: 39.507005 seconds (11 allocations: 486.891 KiB) when running #time dynam_serial(E,gm,gz,beta,Zprob).
using SpecialFunctions
est_params = [.788,.288,.0034,.1519,.1615,.0041,.0077,.2,0.005,.7196]
r = 0.015
tau = 0.35
rho =est_params[1]
sigma =est_params[2]
delta = 0.15
gamma =est_params[3]
a_capital =est_params[4]
lambda1 =est_params[5]
lambda2 =est_params[6]
s =est_params[7]
theta =est_params[8]
mu =est_params[9]
p_bar_k_ss =est_params[10]
beta = (1+r)^(-1)
sigma_range = 4
gz = 19 #15 #19
gp = 29 #19 #29
gk = 37 #25 #37
lnz=collect(linspace(-sigma_range*sigma,sigma_range*sigma,gz))
z=exp.(lnz)
gk_m = fld(gk,2)
# Need to add mu somewhere to k_ss
k_ss = (theta*(1-tau)/(r+delta))^(1/(1-theta))
k=cat(1,map(i->k_ss*((1-delta)^i),collect(1:gk_m)),map(i->k_ss/((1-delta)^i),collect(1:gk_m)))
insert!(k,gk_m+1,k_ss)
sort!(k)
p_bar=p_bar_k_ss*k_ss
p = collect(linspace(-p_bar/2,p_bar,gp))
#Tauchen
N = length(z)
Z = zeros(N,1)
Zprob = zeros(Float64,N,N)
Z[N] = lnz[length(z)]
Z[1] = lnz[1]
zstep = (Z[N] - Z[1]) / (N - 1)
for i=2:(N-1)
Z[i] = Z[1] + zstep * (i - 1)
end
for a = 1 : N
for b = 1 : N
if b == 1
Zprob[a,b] = 0.5*erfc(-((Z[1] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2))
elseif b == N
Zprob[a,b] = 1 - 0.5*erfc(-((Z[N] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
else
Zprob[a,b] = 0.5*erfc(-((Z[b] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) -
0.5*erfc(-((Z[b] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
end
end
end
# Collecting tauchen results in a 2 element array of linspace and array; [2] gets array
# Zprob=collect(tauchen(gz, rho, sigma, mu, sigma_range))[2]
Zcumprob=zeros(Float64,gz,gz)
# 2 in cumsum! denotes the 2nd dimension, i.e. columns
cumsum!(Zcumprob, Zprob,2)
gm = gk * gp
control=zeros(gm,2)
for i=1:gk
control[(1+gp*(i-1)):(gp*i),1]=fill(k[i],(gp,1))
control[(1+gp*(i-1)):(gp*i),2]=p
end
endog=copy(control)
E=Array(Float64,gm,gm,gz)
for h=1:gm
for m=1:gm
for j=1:gz
# set the nonzero net debt indicator
if endog[h,2]<0
p_ind=1
else
p_ind=0
end
# set the investment indicator
if (control[m,1]-(1-delta)*endog[h,1])!=0
i_ind=1
else
i_ind=0
end
E[m,h,j] = (1-tau)*z[j]*(endog[h,1]^theta) + control[m,2]-endog[h,2]*(1+r*(1-tau)) +
delta*endog[h,1]*tau-(control[m,1]-(1-delta)*endog[h,1]) -
(i_ind*gamma*endog[h,1]+endog[h,1]*(a_capital/2)*(((control[m,1]-(1-delta)*endog[h,1])/endog[h,1])^2)) +
s*endog[h,2]*p_ind
elem = E[m,h,j]
if E[m,h,j]<0
E[m,h,j]=elem+lambda1*elem-.5*lambda2*elem^2
else
E[m,h,j]=elem
end
end
end
end
function add_vecs_and_max!(gm::Int,r::Array{Float64},h::Int,j::Int,E::Array{Float64},
exp_v::Array{Float64},beta::Float64)
maxr = -Inf
#inbounds for i=1:gm r[i] = E[i,h,j]+beta*exp_v[i,j]
maxr = max(r[i],maxr)
end
return maxr
end
function dynam_serial(E::Array{Float64},gm::Int,gz::Int,
beta::Float64,Zprob::Array{Float64})
v = Array{Float64}(gm,gz)
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
Tv = Array{Float64}(gm,gz)
r = Array{Float64}(gm)
exp_v = Array{Float64}(gm,gz)
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1.0 # arbitrary initial value greater than convcrit
while diff>convcrit
A_mul_Bt!(exp_v,v,Zprob)
diff = -Inf
for h=1:gm
for j=1:gz
Tv[h,j]=add_vecs_and_max!(gm,r,h,j,E,exp_v,beta)
diff = max(abs(Tv[h,j]-v[h,j]),diff)
end
end
(v,Tv)=(Tv,v)
end
return v
end
Now, here's another version of the algorithm and inputs. The functions are similar to what Dan Getz suggested, except that I use findmax rather than an iterated max function to find the array maximum. In the input construction, I am using both Float32 and mixing different bit-types together. However, I've consistently achieved better performance this way: 24.905569 seconds (1.81 k allocations: 46.829 MiB, 0.01% gc time). But it's not clear at all why.
using SpecialFunctions
est_params = [.788,.288,.0034,.1519,.1615,.0041,.0077,.2,0.005,.7196]
r = 0.015
tau = 0.35
rho =est_params[1]
sigma =est_params[2]
delta = 0.15
gamma =est_params[3]
a_capital =est_params[4]
lambda1 =est_params[5]
lambda2 =est_params[6]
s =est_params[7]
theta =est_params[8]
mu =est_params[9]
p_bar_k_ss =est_params[10]
beta = Float32((1+r)^(-1))
sigma_range = 4
gz = 19
gp = 29
gk = 37
lnz=collect(linspace(-sigma_range*sigma,sigma_range*sigma,gz))
z=exp(lnz)
gk_m = fld(gk,2)
# Need to add mu somewhere to k_ss
k_ss = (theta*(1-tau)/(r+delta))^(1/(1-theta))
k=cat(1,map(i->k_ss*((1-delta)^i),collect(1:gk_m)),map(i->k_ss/((1-delta)^i),collect(1:gk_m)))
insert!(k,gk_m+1,k_ss)
sort!(k)
p_bar=p_bar_k_ss*k_ss
p = collect(linspace(-p_bar/2,p_bar,gp))
#Tauchen
N = length(z)
Z = zeros(N,1)
Zprob = zeros(Float32,N,N)
Z[N] = lnz[length(z)]
Z[1] = lnz[1]
zstep = (Z[N] - Z[1]) / (N - 1)
for i=2:(N-1)
Z[i] = Z[1] + zstep * (i - 1)
end
for a = 1 : N
for b = 1 : N
if b == 1
Zprob[a,b] = 0.5*erfc(-((Z[1] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2))
elseif b == N
Zprob[a,b] = 1 - 0.5*erfc(-((Z[N] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
else
Zprob[a,b] = 0.5*erfc(-((Z[b] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) -
0.5*erfc(-((Z[b] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
end
end
end
# Collecting tauchen results in a 2 element array of linspace and array; [2] gets array
# Zprob=collect(tauchen(gz, rho, sigma, mu, sigma_range))[2]
Zcumprob=zeros(Float32,gz,gz)
# 2 in cumsum! denotes the 2nd dimension, i.e. columns
cumsum!(Zcumprob, Zprob,2)
gm = gk * gp
control=zeros(gm,2)
for i=1:gk
control[(1+gp*(i-1)):(gp*i),1]=fill(k[i],(gp,1))
control[(1+gp*(i-1)):(gp*i),2]=p
end
endog=copy(control)
E=Array(Float32,gm,gm,gz)
for h=1:gm
for m=1:gm
for j=1:gz
# set the nonzero net debt indicator
if endog[h,2]<0
p_ind=1
else
p_ind=0
end
# set the investment indicator
if (control[m,1]-(1-delta)*endog[h,1])!=0
i_ind=1
else
i_ind=0
end
E[m,h,j] = (1-tau)*z[j]*(endog[h,1]^theta) + control[m,2]-endog[h,2]*(1+r*(1-tau)) +
delta*endog[h,1]*tau-(control[m,1]-(1-delta)*endog[h,1]) -
(i_ind*gamma*endog[h,1]+endog[h,1]*(a_capital/2)*(((control[m,1]-(1-delta)*endog[h,1])/endog[h,1])^2)) +
s*endog[h,2]*p_ind
elem = E[m,h,j]
if E[m,h,j]<0
E[m,h,j]=elem+lambda1*elem-.5*lambda2*elem^2
else
E[m,h,j]=elem
end
end
end
end
function add_vecs!(gm::Int,r::Array{Float32},h::Int,j::Int,E::Array{Float32},
exp_v::Array{Float32},beta::Float32)
#inbounds #views for i=1:gm
r[i]=E[i,h,j] + beta*exp_v[i,j]
end
return r
end
function dynam_serial(E::Array{Float32},gm::Int,gz::Int,beta::Float32,Zprob::Array{Float32})
v = Array{Float32}(gm,gz)
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
Tv = Array{Float32}(gm,gz)
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1.00000 # arbitrary initial value greater than convcrit
iter=0
exp_v=Array{Float32}(gm,gz)
r=Array{Float32}(gm)
while diff>convcrit
A_mul_Bt!(exp_v,v,Zprob)
for h=1:gm
for j=1:gz
Tv[h,j]=findmax(add_vecs!(gm,r,h,j,E,exp_v,beta))[1]
end
end
diff = maximum(abs,Tv - v)
(v,Tv)=(Tv,v)
end
return v
end

How to improve graph centrality calculation parallel performance?

I am getting performance degradation after parallelizing the code that is calculating graph centrality. Graph is relatively large, 100K vertices. Single threaded application take approximately 7 minutes. As recommended on julialang site (http://julia.readthedocs.org/en/latest/manual/parallel-computing/#man-parallel-computing) I adapted code and used pmap api in order to parallelize calculations. I started calculation with 8 processes (julia -p 8 calc_centrality.jl). To my surprise I got 10 fold slow down. Parallel process now take more than hour. I noticed that it take several minutes for parallel process to initialize and starts calculation. Even after all 8 CPUs are %100 busy with julia app, calculation is super slow.
Any suggestion how to improve parallel performance is appreciated.
calc_centrality.jl :
using Graphs
require("read_graph.jl")
require("centrality_mean.jl")
function main()
file_name = "test_graph.csv"
println("graph creation: ", file_name)
g = create_generic_graph_from_file(file_name)
println("num edges: ", num_edges(g))
println("num vertices: ", num_vertices(g))
data = cell(8)
data[1] = {g, 1, 2500}
data[2] = {g, 2501, 5000}
data[3] = {g, 5001, 7500}
data[4] = {g, 7501, 10000}
data[5] = {g, 10001, 12500}
data[6] = {g, 12501, 15000}
data[7] = {g, 15001, 17500}
data[8] = {g, 17501, 20000}
cm = pmap(centrality_mean, data)
println(cm)
end
println("Elapsed: ", #elapsed main(), "\n")
centrality_mean.jl
using Graphs
function centrality_mean(gr, start_vertex)
centrality_cnt = Dict()
vertex_to_visit = Set()
push!(vertex_to_visit, start_vertex)
cnt = 0
while !isempty(vertex_to_visit)
next_vertex_set = Set()
for vertex in vertex_to_visit
if !haskey(centrality_cnt, vertex)
centrality_cnt[vertex] = cnt
for neigh in out_neighbors(vertex, gr)
push!(next_vertex_set, neigh)
end
end
end
cnt += 1
vertex_to_visit = next_vertex_set
end
mean([ v for (k,v) in centrality_cnt ])
end
function centrality_mean(data::Array{})
gr = data[1]
v_start = data[2]
v_end = data[3]
n = v_end - v_start + 1;
cm = Array(Float64, n)
v = vertices(gr)
cnt = 0
for i = v_start:v_end
cnt += 1
if cnt%10 == 0
println(cnt)
end
cm[cnt] = centrality_mean(gr, v[i])
end
return cm
end

I'm guessing this has nothing to do with parallelism. Your second centrality_mean method has no clue what type of objects gr, v_start, and v_end are. So it's going to have to use non-optimized, slow code for that "outer loop."
While there are several potential solutions, probably the easiest is to break up your function that receives the "command" from pmap:
function centrality_mean(data::Array{})
gr = data[1]
v_start = data[2]
v_end = data[3]
centrality_mean(gr, v_start, v_end)
end
function centrality_mean(gr, v_start, v_end)
n = v_end - v_start + 1;
cm = Array(Float64, n)
v = vertices(gr)
cnt = 0
for i = v_start:v_end
cnt += 1
if cnt%10 == 0
println(cnt)
end
cm[cnt] = centrality_mean(gr, v[i])
end
return cm
end
All this does is create a break, and gives julia a chance to optimize the second part (which contains the performance-critical loop) for the actual types of the inputs.

Below is code with #everywhere suggested in comments by https://stackoverflow.com/users/1409374/rickhg12hs. That fixed performance issue!!!
test_parallel_pmap.jl
using Graphs
require("read_graph.jl")
require("centrality_mean.jl")
function main()
#everywhere file_name = "test_data.csv"
println("graph creation from: ", file_name)
#everywhere data_graph = create_generic_graph_from_file(file_name)
#everywhere data_graph_vertex = vertices(data_graph)
println("num edges: ", num_edges(data_graph))
println("num vertices: ", num_vertices(data_graph))
range = cell(2)
range[1] = {1, 25000}
range[2] = {25001, 50000}
cm = pmap(centrality_mean_pmap, range)
for i = 1:length(cm)
println(length(cm[i]))
end
end
println("Elapsed: ", #elapsed main(), "\n")
centrality_mean.jl
using Graphs
function centrality_mean(start_vertex::ExVertex)
centrality_cnt = Dict{ExVertex, Int64}()
vertex_to_visit = Set{ExVertex}()
push!(vertex_to_visit, start_vertex)
cnt = 0
while !isempty(vertex_to_visit)
next_vertex_set = Set()
for vertex in vertex_to_visit
if !haskey(centrality_cnt, vertex)
centrality_cnt[vertex] = cnt
for neigh in out_neighbors(vertex, data_graph)
push!(next_vertex_set, neigh)
end
end
end
cnt += 1
vertex_to_visit = next_vertex_set
end
mean([ v for (k,v) in centrality_cnt ])
end
function centrality_mean(v_start::Int64, v_end::Int64)
n = v_end - v_start + 1;
cm = Array(Float64, n)
cnt = 0
for i = v_start:v_end
cnt += 1
cm[cnt] = centrality_mean(data_graph_vertex[i])
end
return cm
end
function centrality_mean_pmap(range::Array{})
v_start = range[1]
v_end = range[2]
centrality_mean(v_start, v_end)
end

From the Julia page on parallel computing:
Julia provides a multiprocessing environment based on message passing to allow programs to run on multiple processes in separate memory domains at once.
If I interpret this right, Julia's parallelism requires message passing to synchonize the processess. If each individual process does only a little work, and then does a message-pass, the computation will be dominated by message-passing overhead and not doing any work.
I can't see from your code, and I don't know Julia well enough, to see where the parallelism breaks are. But you have a big complicated graph that may be spread willy-nilly across multiple processes. If they need to interact on walking across graph links, you'll have exactly that kind of overhead.
You may be able to fix it by precomputing a partition of the graph into roughly equal size, highly cohesive regions. I suspect that breakup requires that same type of complex graph processing that you already want to do, so you may have a chicken and egg problem to boot.
It may be that Julia offers you the wrong parallelism model. You might want a shared address space so that threads walking the graph dont' have to use messages to traverse arcs.

speeding up some for loops in matlab

Basically I am trying to solve a 2nd order differential equation with the forward euler method. I have some for loops inside my code, which take considerable time to solve and I would like to speed things up a bit. Does anyone have any suggestions how could I do this?
And also when looking at the time it takes, I notice that my end at line 14 takes 45 % of my total time. What is end actually doing and why is it taking so much time?
Here is my simplified code:
t = 0:0.01:100;
dt = t(2)-t(1);
B = 3.5 * t;
F0 = 2 * t;
BB=zeros(1,length(t)); % Preallocation
x = 2; % Initial value
u = 0; % Initial value
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
Running the code it takes me 8.552 sec.

You can remove the inner loop, I think:
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
So BB(ii) = BB(ii) (zero at initalisation) + sum for 1 to ii of BB(kk)* u(ii-kk+1).dt
but kk = 1:ii, so for a given ii, ii-kk+1 →　ii-(1:ii) + 1 → ii:-1:1
So I think this is equivalent to:
for ii = 1:length(t)
BB(ii) = sum(B(1:ii).*u(ii:-1:1)*dt);
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
It doesn't take as long as 8 seconds for me using either method, but the version with only one loop is about 2x as fast (the output of BB appears to be the same).

Is the sum loop of B(kk) * u(ii-kk+1) just conv(B(1:ii),u(1:ii),'same')

The best way to speed up loops in matlab is to try to avoid them. Try if you are able to perform a matrix operation instead of the inner loop. For example try to break the calculation you do there in small parts, then decide, if there are parts you can perform in advance without knowing the results of the next iteration of the loop.
to your secound part of the question, my guess:: The end contains the check if the loop runs for another round and this check by it self is not that long but called 50.015.001 times!

Speed up an Enumeration process

After a few days of optimization this is my code for an enumeration process that consist in finding the best combination for every row of W. The algorithm separates the matrix W in one where the elements of W are grather of LimiteInferiore (called W_legali) and one that have only element below the limit (called W_nlegali).
Using some parameters like Media (aka Mean), rho_b_legali The algorithm minimizes the total cost function. In the last part, I find where is the combination with the lowest value of objective function and save it in W_ottimo
As you can see the algorithm is not so "clean" and with very large matrix (142506x3000) is damn slow...So, can somebody help me to speed it up a little bit?
for i=1:3000
W = PesoIncertezza * MatriceCombinazioni';
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
W_legali = W;
W_legali(W<LimiteInferiore) = nan;
if i==1
Media = W_legali;
rho_b_legale = ones(size (W_legali,1),size(MatriceCombinazioni,1));
else
Media = (repmat(sum(W_tot_migl,2),1,size(MatriceCombinazioni,1))+W_legali)/(size(W_tot_migl,2)+1);
rho_b_legale = repmat(((n_b+1)/i),1,size(MatriceCombinazioni,1));
end
[W_legali_migl,comb] = min(C_u .* Media .* (1./rho_b_legale) + (1./rho_b_legale) .* c_0 + (c_1./(i * rho_b_legale)),[],2);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
MatriceCombinazioni_2 = MatriceCombinazioni;
MatriceCombinazioni_2(sum(MatriceCombinazioni_2,2)<2,:)=[];
W_nlegali = PesoIncertezza * MatriceCombinazioni_2';
W_nlegali(W_nlegali>=LimiteInferiore) = nan;
if i==1
Media = W_nlegali;
rho_b_nlegale = zeros(size (W_nlegali,1),size(MatriceCombinazioni_2,1));
else
Media = (repmat(sum(W_tot_migl,2),1,size(MatriceCombinazioni_2,1))+W_nlegali)/(size(W_tot_migl,2)+1);
rho_b_nlegale = repmat(((n_b)/i),1,size(MatriceCombinazioni_2,1));
end
[W_nlegali_migliori,comb2] = min(C_u .* Media .* (1./rho_b_nlegale) + (1./rho_b_nlegale) .* c_0 + (c_1./(i * rho_b_nlegale)),[],2);
z = [W_legali_migl, W_nlegali_migliori];
[z_ott,comb3] = min(z,[],2);
%Increasing n_b
if i==1
n_b = zeros(size(W,1),1);
end
index = find(comb3==1);
increment = ones(size(index,1),1);
B = accumarray(index,increment);
nzIndex = (B ~= 0);
n_b(nzIndex) = n_b(nzIndex) + B(nzIndex);
%Using comb3 to find where is the best configuration, is in
%W_legali or in W_nLegali?
combinazione = comb.*logical(comb3==1) + comb2.*logical(comb3==2);
W_ottimo = W(sub2ind(size(W),[1:size(W,1)],combinazione'))';
W_tot_migl(:,i) = W_ottimo;
FunzObb(:,i) = z_ott;
[PesoCestelli] = Simulazione_GenerazioneNumeriCasuali (PianoSperimentale,NumeroCestelli,NumeroEsperimenti,Alfa);
[PesoIncertezza_2] = Simulazione_GenerazioneIncertezza (NumeroCestelli,NumeroEsperimenti,IncertezzaCella,PesoCestelli);
PesoIncertezza(MatriceCombinazioni(combinazione,:)~=0) = PesoIncertezza_2(MatriceCombinazioni(combinazione,:)~=0); %updating just the hoppers that has been discharged
end

When you see repmat you should think bsxfun. For example, replace:
Media = (repmat(sum(W_tot_migl,2),1,size(MatriceCombinazioni,1))+W_legali) / ...
(size(W_tot_migl,2)+1);
with
Media = bsxfun(#plus,sum(W_tot_migl,2),W_legali) / ...
(size(W_tot_migl,2)+1);
The purpose of bsxfun is to do a virtual "singleton expansion" like repmat, without actually replicating the array into a matrix of the same size as W_legali.
Also note that in the above code, sum(W_tot_migl,2) is computed twice. There are other small optimizations, but changing to bsxfun should give you a good improvement.
The values of 1./rho_b_legale are effectively computed three times. Store this quotient matrix.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Optimizing a loop : huge arrays operations - performance

Related

Loop optimisation

Filling a matrix using parallel processing in Julia

How to improve graph centrality calculation parallel performance?

speeding up some for loops in matlab

Speed up an Enumeration process

Categories

Resources