With OpenMP parallelized nested loops run slow

With OpenMP parallelized nested loops run slow - parallel-processing

I've got a part of a fortran program consisting of some nested loops which I want to parallelize with OpenMP.
integer :: nstates , N, i, dima, dimb, dimc, a_row, b_row, b_col, c_row, row, col
double complex, dimension(4,4):: mat
double complex, dimension(:), allocatable :: vecin,vecout
nstates = 2
N = 24
allocate(vecin(nstates**N), vecout(nstates**N))
vecin = ...some data
vecout = 0
mat = reshape([...some data...],[4,4])
dimb=nstates**2
!$OMP PARALLEL DO PRIVATE(dima,dimc,row,col,a_row,b_row,c_row,b_col)
do i=1,N-1
dima=nstates**(i-1)
dimc=nstates**(N-i-1)
do a_row = 1, dima
do b_row = 1,dimb
do c_row = 1,dimc
row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
do b_col = 1,dimb
col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
!$OMP ATOMIC
vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
end do
end do
end do
end do
end do
!$OMP END PARALLEL DO
The program runs and the result I get is also correct, it's just incredible slow. Much slower than without OpenMP. I don't know much about OpenMP. Have I done something wrong with the use of PRIVATE or OMP ATOMIC? I would be grateful for every advice how to improve the performance of my code.

If your arrays are too large and you get stack overflows with automatic reduction, you can implement the reduction yourself with allocatable temporary arrays.
As Francois Jacq pointed out, you also have a race condition caused by dima and dimb which should be private.
double complex, dimension(:), allocatable :: tmp
!$OMP PARALLEL PRIVATE(dima,dimb,row,col,a_row,b_row,c_row,b_col,tmp)
allocate(tmp(size(vecout)))
tmp = 0
!$OMP DO
do i=1,N-1
dima=nstates**(i-1)
dimc=nstates**(N-i-1)
do a_row = 1, dima
do b_row = 1,dimb
do c_row = 1,dimc
row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
do b_col = 1,dimb
col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
tmp(row) = tmp(row) + vecin(col)*mat(b_row,b_col)
end do
end do
end do
end do
end do
!$OMP END DO
!$OMP CRITICAL
vecout = vecout + tmp
!$OMP END CRITICAL
!$OMP END PARALLEL

Could you try something like :
do b_col=1,dimb
do i=1,N-1
dima=nstates**(i-1)
dimc=nstates**(N-i-1)
!$OMP PARALLEL DO COLLAPSE(3) PRIVATE(row,col,a_row,b_row,c_row)
do a_row = 1, dima
do b_row = 1,dimb
do c_row = 1,dimc
row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
enddo
enddo
enddo
enddo
enddo
The advantage is that the // loop does not cause collision now : all the indexes row are different.

Related

OpenMP on fortran 90 lasts almost the same(if not more) as non parallelized program

I'm trying to parallelize a simulation of an Ising 2D model to get some expected values as a function of the temperature of the system. For L=48, the one-threaded version takes about 240 seconds to run 20 temperatures and 1 seed each, but the parallelized version takes about 268 seconds, which is similar.
If you take the time per seed per temperature, it results in 12 seconds for the one-threaded version and 13.4 seconds for the parallelized version. I'm looking for help with my code because I don't understand these durations. I thought that the parallelized version would split one temperature among all threads and therefore should take about 30 seconds to complete.
I need to run the simulation for 50 temperatures and 200 seeds each, for 5 values of L. It would be helpful to reduce the compute time, because otherwise it could take 20 hours for L=48 and some days for L=72.
I'm using an i7-10700KF (8 cores, 16 logical threads).
program Ising
use omp_lib
implicit none
integer L, seed, i, j, seed0, nseed,k
parameter (L=48)
integer s(1:L, 1:L)
integer*4 pbc(0:L+1), mctot, N, mcd, mcini, difE
real*8 genrand_real2, magne, energ, energia, temp, temp1, DE
real*8 mag, w(-8:8)
real*8 start, finish
real*8 sum, sume, sume2, summ, summ2, sumam, vare, varm, maxcv, maxx
real*8 cv, x, Tmaxcv, Tmaxx
integer irand, jrand
11 format(10(f20.6))
! Initialize variables
mctot = 80000
mcd = 20
mcini = 8000
N = L*L
seed0 = 20347880
nseed = 20
maxcv=0.d0
maxx=0.d0
! Initialize vector pbc
pbc(0) = L
pbc(L+1) = 1
do i = 1, L
pbc(i) = i
end do
! Initialize matrix s with random values
do i = 1, L
do j = 1, L
if (genrand_real2() < 0.5) then
s(i,j) = 1
else
s(i,j) = -1
endif
end do
end do
! Metropolis algorithm
open(1, file='Expectation values.dat')
start = omp_get_wtime()
write(1,*) '#Temp, ','E, ','E2, ','M, ','M2, ','|M|, ','VarE, ','VarM, ',&
'Cv, ','X, '
!Start loop to calculate for different temperatures
!$OMP PARALLEL PRIVATE(s,seed,w,energia,difE,irand,jrand,temp,mag,sum,sume,sume2,summ,summ2,sumam,vare,varm,cv,x)
temp1 = 1.59d0
!$OMP DO ordered schedule(dynamic)
do k = 1, 10
temp = temp1 + (0.01d0*k)
!Define the matrix w, which contains the values of the Boltzmann function for each temperature, so as not to have to calculate them each iteration
do i = -8, 8
w(i) = dexp(-i/temp)
end do
write(*,*) "Temperature: ", temp, "Thread", omp_get_thread_num()
sum = 0.d0
sume = 0.d0
sume2 = 0.d0
summ = 0.d0
summ2 = 0.d0
sumam = 0.d0
do seed = seed0, seed0 + nseed-1, 1
call init_genrand(seed)
call reinicia(s,l)
energia = energ(s,l,pbc)
do i = 1, mctot
do j = 1, N
irand = int(genrand_real2()*L) + 1
jrand = int(genrand_real2()*L) + 1
difE = int(DE(s,l,irand,jrand,pbc))
if (difE < 0) then
s(irand,jrand) = -s(irand,jrand)
energia = energia + difE
else if (genrand_real2() < w(int(difE))) then
s(irand,jrand) = -s(irand,jrand)
energia = energia + difE
endif
end do
if ((i > mcini).and.(mcd*(i/mcd)==i)) then
mag= magne(s,l)
sum = sum + 1.d0
sume = sume + energia
sume2 = sume2 + energia**2
summ = summ + mag
summ2 = summ2 + mag**2
sumam = sumam + abs(mag)
endif
end do
end do
!Energy
sume=sume/(sum*N)
sume2=sume2/(sum*N*N)
!Magnetitzation
summ = summ/(sum*N)
sumam=sumam/(sum*N)
summ2=summ2/(sum*N*N)
!Variances
vare = dsqrt(sume2-sume*sume)/dsqrt(sum)
varm = dsqrt(summ2-summ*summ)/dsqrt(sum)
!Cv
cv = (N*(sume2-sume*sume))/temp**2
if (cv.gt.maxcv) then
maxcv=cv
Tmaxcv=temp
endif
!X
x = (N*(summ2-summ*summ))/temp
if (x.gt.maxx) then
maxx=x
Tmaxx=temp
endif
write(1,11) temp,sume,sume2,summ,summ2,sumam,vare,varm,cv,x
end do
!$OMP END DO
!$OMP END PARALLEL
finish = omp_get_wtime()
close(1)
print*, "Time: ",(finish-start),"Seconds"
end program Ising
! Functions
!Function that calculates the energy of the matrix s
real*8 function energ(S,L, pbc)
implicit none
integer s(1:L, 1:L), i, j, L
integer*4 pbc(0:L+1)
real*8 ene
ene = 0.0d0
do i = 1, L
do j = 1, L
ene = ene - s(i,j) * s(pbc(i+1),j) - s(i,j) * s(i,pbc(j+1))
end do
end do
energ = ene
return
end function energ
!Function that calculates the difference in energy that occurs when the spin of position (i, j) is changed
real*8 function DE(S,L,i,j,pbc)
implicit none
integer s(1:L, 1:L), i, j, L, difE
integer*4 pbc(0:L+1)
real*8 suma
difE = 0
suma = 0.0d0
suma = suma + s(pbc(i-1),j) + s(pbc(i+1),j) + s(i,pbc(j-1)) + s(i,pbc(j+1))
difE = difE + int(2 * s(i,j) * suma)
DE = difE
return
end function DE
!Function that calculates the magnetization of the matrix s
real*8 function magne(S,L)
implicit none
integer s(1:L, 1:L),L
magne = sum(s)
return
end function magne
! SUBRUTINES
!Subroutine that resets the matrix s with random values
subroutine reinicia(S,L)
implicit none
integer s(1:L, 1:L), i,j,L
real*8 genrand_real2
do i = 1, L
do j = 1, L
if (genrand_real2() < 0.5) then
s(i,j) = 1
else
s(i,j) = -1
endif
end do
end do
return
end subroutine
I have tried parallelizing the seeds loop instead of the temperatures, but it lasts almost the same, so i think i'm not parallelizing it correctly, because it looks a nice code to parallelize.
The other option I thought of is to manually parallelize the simulation. I could do this by compiling 16 programs, each of which handles a different range of temperatures. Then I could run all of the programs concurrently, so each program would get its own thread. However, this approach would require a lot of extra RAM.

Wrong results and crashes from Fortran subroutine with OpenMP

I wrote the following code, and then tried using OpenMP to parallelise it. However, after I compiled the following OpenMP code using f2py, Python always generates certain errors when I run it. There are no error messages, only that the numbers are a bit off, and whenever I compile it with f2py and run it in Python, it kills the kernel.
I am wondering if this has anything to do with my parallel region. I am always a bit confused about which variables to take private so can anyone observe any errors?
subroutine simulate_omp(m,nt,s0,l,d,a,numthreads,x,y,psi)
!Pedestrian motion model
!input variables:
!n = m^2 = number of students
!nt: number of time steps
!s0: student speed
!l: initial spacing between students
!d: student motion influence by all other students within distance <= d
!a: noise amplitude
!numthreads: number of threads to use in parallel regions
!output variables:
!x,y: all m^2 student paths from i=1 to nt+1
!psi: synchronization parameter, store at all nt+1 times (including initial
condition)
use omp_lib
implicit none
integer, intent(in) :: m,nt,numthreads
real(kind=8), intent(in) :: s0,l,d,a
real(kind=8), dimension(m*m,nt+1), intent(out) :: x,y
real(kind=8), dimension(nt+1), intent(out) :: psi
real(kind=8), dimension(m*m,nt+1) :: xtemp,ytemp,u,v
real(kind=8), dimension(m*m,nt) :: usum,vsum,umean,vmean
real(kind=8) :: r(m*m)
real(kind=8),parameter :: pi = 4*atan(1.0_8)
integer :: i1,j1,k1,i2,j2,k2,count
!$call omp_set_num_threads(numthreads)
! initialize student positions
x = 0.d0
y = 0.d0
k1 = 0
do i1 = 1,m
do j1=1,m
k1 = k1 + 1
x(k1,1) = (j1-1)*l/2 - (m-1)*l/4
y(k1,1) = (i1-1)*l/2 - (m-1)*l/4
end do
end do
x(:,1) = x(:,1)/(m-1)
y(:,1) = y(:,1)/(m-1)
! initialize
xtemp(:,1) = x(:,1)
ytemp(:,1) = y(:,1)
call random_number(r)
u(:,1) = s0*cos(r*2*pi-pi)
v(:,1) = s0*sin(r*2*pi-pi)
psi(1) = sqrt(sum(u(:,1))**2+sum(v(:,1)**2))/dble(m)/dble(m)/s0
do i2 = 1,nt
!$OMP parallel do private(j2,k2,l)
do j2 = 1,m*m
usum(j2,i2) = 0
vsum(j2,i2) = 0
count = 0
!$OMP parallel do reduction(+:usum,vsum,count)
do k2 = 1,m*m
if ((xtemp(k2,i2)-xtemp(j2,i2))**2+(ytemp(k2,i2)-ytemp(j2,i2))**2<=d**2)
then
usum(j2,i2) = usum(j2,i2)+u(k2,i2)
vsum(j2,i2) = vsum(j2,i2)+v(k2,i2)
count = count+1
end if
end do
!$OMP end parallel do
umean(j2,i2) = usum(j2,i2)/dble(count)
vmean(j2,i2) = vsum(j2,i2)/dble(count)
u(j2,i2+1) = s0*cos(atan(vmean(j2,i2)/umean(j2,i2))+a*(r(j2)*2*pi-pi))
v(j2,i2+1) = s0*sin(atan(vmean(j2,i2)/umean(j2,i2))+a*(r(j2)*2*pi-pi))
xtemp(j2,i2+1) = xtemp(j2,i2)+u(j2,i2+1)
ytemp(j2,i2+1) = ytemp(j2,i2)+v(j2,i2+1)
! boundary conditions
if (xtemp(j2,i2+1)>l) then
xtemp(j2,i2+1) = xtemp(j2,i2+1)-2*l
else
if (xtemp(j2,i2+1)<-l) then
xtemp(j2,i2+1) = xtemp(j2,i2+1)+2*l
end if
end if
if (ytemp(j2,i2+1)>l) then
ytemp(j2,i2+1) = ytemp(j2,i2+1)-2*l
else
if (ytemp(j2,i2+1)<-l) then
ytemp(j2,i2+1) = ytemp(j2,i2+1)+2*l
end if
end if
end do
!$OMP end parallel do
psi(i2+1) = sqrt(sum(u(:,i2+1))**2+sum(v(:,i2+1))**2)/dble(m)/dble(m)/s0
end do
x(:,1:nt+1) = xtemp(:,1:nt+1)
y(:,1:nt+1) = ytemp(:,1:nt+1)
end subroutine simulate_omp

The argument l is declared with intent(in) and not modified in the loop so there is no need to declare it private. Below is a suggestion without the outer parallel loop:
subroutine simulate_omp(m,nt,s0,l,d,a,numthreads,x,y,psi)
!Pedestrian motion model
!input variables:
!n = m^2 = number of students
!nt: number of time steps
!s0: student speed
!l: initial spacing between students
!d: student motion influence by all other students within distance <= d
!a: noise amplitude
!numthreads: number of threads to use in parallel regions
!output variables:
!x,y: all m^2 student paths from i=1 to nt+1
!psi: synchronization parameter, store at all nt+1 times (including initial
condition)
use omp_lib
implicit none
integer, intent(in) :: m,nt,numthreads
real(kind=8), intent(in) :: s0,l,d,a
real(kind=8), dimension(m*m,nt+1), intent(out) :: x,y
real(kind=8), dimension(nt+1), intent(out) :: psi
real(kind=8), dimension(m*m,nt+1) :: xtemp,ytemp,u,v
real(kind=8), dimension :: usum,vsum,umean,vmean
real(kind=8) :: r(m*m)
real(kind=8),parameter :: pi = 4*atan(1.0_8)
integer :: i1,j1,k1,i2,j2,k2,count
!$call omp_set_num_threads(numthreads)
! initialize student positions
x = 0.d0
y = 0.d0
k1 = 0
do i1 = 1,m
do j1=1,m
k1 = k1 + 1
x(k1,1) = (j1-1)*l/2 - (m-1)*l/4
y(k1,1) = (i1-1)*l/2 - (m-1)*l/4
end do
end do
x(:,1) = x(:,1)/(m-1)
y(:,1) = y(:,1)/(m-1)
! initialize
xtemp(:,1) = x(:,1)
ytemp(:,1) = y(:,1)
call random_number(r)
u(:,1) = s0*cos(r*2*pi-pi)
v(:,1) = s0*sin(r*2*pi-pi)
psi(1) = sqrt(sum(u(:,1))**2+sum(v(:,1)**2))/dble(m)/dble(m)/s0
do i2 = 1,nt
do j2 = 1,m*m
usum = 0
vsum = 0
count = 0
!$OMP parallel do private(k2), reduction(+:usum,vsum,count)
do k2 = 1,m*m
if ((xtemp(k2,i2)-xtemp(j2,i2))**2+(ytemp(k2,i2)-ytemp(j2,i2))**2<=d**2) then
usum = usum+u(k2,i2)
vsum = vsum+v(k2,i2)
count = count+1
end if
end do
!$OMP end parallel do
umean = usum/dble(count)
vmean = vsum/dble(count)
u(j2,i2+1) = s0*cos(atan(vmean/umean)+a*(r(j2)*2*pi-pi))
v(j2,i2+1) = s0*sin(atan(vmean/umean)+a*(r(j2)*2*pi-pi))
xtemp(j2,i2+1) = xtemp(j2,i2)+u(j2,i2+1)
ytemp(j2,i2+1) = ytemp(j2,i2)+v(j2,i2+1)
! boundary conditions
if (xtemp(j2,i2+1)>l) then
xtemp(j2,i2+1) = xtemp(j2,i2+1)-2*l
else
if (xtemp(j2,i2+1)<-l) then
xtemp(j2,i2+1) = xtemp(j2,i2+1)+2*l
end if
end if
if (ytemp(j2,i2+1)>l) then
ytemp(j2,i2+1) = ytemp(j2,i2+1)-2*l
else
if (ytemp(j2,i2+1)<-l) then
ytemp(j2,i2+1) = ytemp(j2,i2+1)+2*l
end if
end if
end do
psi(i2+1) = sqrt(sum(u(:,i2+1))**2+sum(v(:,i2+1))**2)/dble(m)/dble(m)/s0
end do
x(:,1:nt+1) = xtemp(:,1:nt+1)
y(:,1:nt+1) = ytemp(:,1:nt+1)
end subroutine simulate_omp
You can time it and compare it with the outer loop parallelised using private(j2,k2,umean,vmean,usum,vsum,count), shared(u,v,xtemp,ytemp). Make sure to have OMP_NESTED set to true for the latter tests.

Avoiding false sharing when in OpenMP parallel loop

Consider a parallel loop, where each thread will be computing on a private vector dudz(izfirst:izlast). In my implementation, I want to accomplish two things:
Not allocate memory when this parallel region is entered (it is called every time step)
Avoid false sharing (I am currently rewriting the code to avoid excess cache misses)
To avoid problem 1, I was thinking of creating the array dudz(izfirst:izlast,nproc) where each thread only accesses dudz(:, omp_id), but isn't this vulnerable to false sharing? To avoid false sharing, I was thinking about using private(dudz), but doesn't this allocate memory?
The following code can be adapted to either of my solutions, but which one is better? Is there a third alternative that handles both my concerns?
!$omp parallel do num_threads(nproc) private(ix, iz, ishift)
do ix = ixfirst, ixlast
do iz = izfirst, izfirst+ophalf-1
dudz(iz) = 0.0
enddo
!$omp simd
do iz = izfirst+ophalf, izlast-ophalf+1
dudz(iz) = az(1)*( u(iz,ix) - u(iz-1,ix) )
do ishift = 2, ophalf
dudz(iz) = dudz(iz) + az(ishift)*( u(iz+ishift-1,ix) - u(iz-ishift,ix) )
enddo
dudz(iz) = dudz(iz)*buoy_z(iz,ix)
enddo
!$omp end simd
do iz = izlast-ophalf+2, izlast
dudz(iz) = 0.0
enddo
enddo
!$omp end parallel do
Thank you for any advice.

Optimizing a loop : huge arrays operations

I am doing huge calculations (derivatives here, but look similar to images operations) on arrays that do not fit in cache, meaning the CPU have to load parts in the cache, calculate, then load another part, etc. But because of the shape of calculations, some data are load, unload and reload multiple times. I was wondering if there is a way to optimize this. I am already using SIMD instructions using the compiler optimization (GCC and Intel).
This is Fortran calculations, but it's similar to C/C++, the memory order is just inverted and arrays use () instead of []. for is replaced by do.
On x axe:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
DF(i,j,k)=(F(i+1,j,k)-F(i-1,j,k))*B+(F(i-2,j,k)-F(i+2,j,k))*C
end do
end do
end do
On y axe:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
DF(i,j,k)=(F(i,j+1,k)-F(i,j-1,k))*B+(F(i,j-2,k)-F(i,j+2,k))*C
end do
end do
end do
on z axe:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
DF(i,j,k)=(F(i,j,k+1)-F(i,j,k-1))*B+(F(i,j,k-2)-F(i,j,k+2))*C
end do
end do
end do
First derivative on axe x is OK because memory is read continuously. Derivatives on y and z axes are not continuous.
And the worst calculation I have combine all axes (This is a Laplacian operator):
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = M(i,j,k,1) * p(i,j,k) &
& + M(i,j,k,2) * p(i-1,j,k) &
& + M(i,j,k,3) * p(i+1,j,k) &
& + M(i,j,k,4) * p(i,j-1,k) &
& + M(i,j,k,5) * p(i,j+1,k) &
& + M(i,j,k,6) * p(i,j,k-1) &
& + M(i,j,k,7) * p(i,j,k+1)
end do
end do
end do
Note that compilers do not understand the last operation (Laplacian). To use SIMD (vectorized calculations), I need to split the operation like this, which gives a 2.5x speedup:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = M(i,j,k,1) * p(i,j,k) &
& + M(i,j,k,2) * p(i-1,j,k) &
& + M(i,j,k,3) * p(i+1,j,k)
end do
end do
end do
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = V(i,j,k) + &
& + M(i,j,k,4) * p(i,j-1,k) &
& + M(i,j,k,5) * p(i,j+1,k)
end do
end do
end do
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = V(i,j,k) + &
& + M(i,j,k,6) * p(i,j,k-1) &
& + M(i,j,k,7) * p(i,j,k+1)
end do
end do
end do
Maybe using SIMD I already reached maximum speed, but because these calculations takes days, even with MPI and more than 1024 CPU, reducing the time of calculations, even of 20% would be a great step!
Does anyone of you have ideas on how to optimize this?

When you use 3D stencils and you reference elements like i,j,k-1, i,j,k+1 the linear order in which you go through the array will not be optimal. The cache efficiency can be increased by loop tiling.
In my code I use
!$omp parallel private(i,j,k,bi,bj,bk)
!$omp do schedule(runtime) collapse(3)
do bk = 1, Unz, tnz
do bj = 1, Uny, tny
do bi = 1, Unx, tnx
do k = bk, min(bk+tnz-1,Unz)
do j = bj, min(bj+tny-1,Uny)
do i = bi, min(bi+tnx-1,Unx)
U2 (i,j,k) = U2(i,j,k) + &
(U(i+1,j,k)-U(i,j,k)) * ...
U2(i,j,k) = U2(i,j,k) - &
(U(i,j,k)-U(i-1,j,k)) * ...
U2(i,j,k) = U2(i,j,k) + &
(U(i,j+1,k)-U(i,j,k)) * ...
U2(i,j,k) = U2(i,j,k) - &
(U(i,j,k)-U(i,j-1,k)) * ...
U2(i,j,k) = U2(i,j,k) + &
(U(i,j,k+1)-U(i,j,k)) * ...
U2(i,j,k) = U2(i,j,k) - &
(U(i,j,k)-U(i,j,k-1)) * ...
end do
end do
end do
end do
end do
end do
!$omp end do
where tnx, tny, tnz are the sizes of a tile in which you iterate in the i,j,k order. The size must be set to be close to the L1 cache. This will increase reuse of the content loaded into the cache.
If you need to separate the directions, you can of course do that and still keep the tiling.

More efficient Matlab Code please

I am new to matlab so I do not know all the shortcuts matlab has to make the code more efficient and faster. I have been hacking together something in matlab for a homework assignment while focusing on completing the assignment rather than efficiency. Now I'm finding that I'm spending more time waiting on the program than actually coding it. Below is a headache of nested for loops that takes forever to finish. Is there a faster or efficient way of coding this without so many forloops?
for i = 1:ysize
for j = 1:xsize
MArr = zeros(windowSize^2, 2, 2);
for i2 = i - floor(windowSize/2): i + floor(windowSize/2)
if i2 > 0 && i2 < ysize + 1
for j2 = j - floor(windowSize/2): j + floor(windowSize/2)
if j2 > 0 && j2 < xsize + 1
mat = weight*[mappedGX(i2,j2)^2, mappedGX(i2,j2)*mappedGY(i2,j2); mappedGX(i2,j2)*mappedGY(i2,j2), mappedGY(i2,j2)^2];
for i3 = 1:2
for j3 = 1:2
MArr(windowSize*(j2-(j - floor(windowSize/2))+1) + (i2-(i - floor(windowSize/2)) + 1),i3,j3) = mat(i3,j3);
end
end
end
end
end
end
Msum = zeros(2,2);
for k = size(MArr)
for i2 = 1:2
for j2 = 1:2
Msum = Msum + MArr(k,i2,j2);
end
end
end
R(i,j) = det(Msum) - alpha*(trace(Msum)^2);
R = -1 * R;
end
end

Instead of looping, use colons. For example:
for i3 = 1:2
for j3 = 1:2
MArr(windowSize*(j2-(j - floor(windowSize/2))+1) + (i2-(i - floor(windowSize/2)) + 1),i3,j3) = mat(i3,j3);
end
end
Can be written as:
MArr(windowSize*(j2-(j-floor(windowSize/2))+1)+(i2-(i-floor(windowSize/2))+1),:,:)=mat;
After you find all places where this can be done, learn to use indexing instead of looping, e.g.,
i2 = i - floor(windowSize/2): i + floor(windowSize/2);
i2=i2(i2>0 && i2<ysize+1);
j2 = j - floor(windowSize/2): j + floor(windowSize/2);
j2=j2(j2>0 && j2<xsize+1);
mat = weight*[mappedGX(i2,j2)^2, mappedGX(i2,j2)*mappedGY(i2,j2);
(Note for advanced users: the last line may not work if mappedGX is a matrix, and i2/j2 don't represent a rectangular sub-matrix. In such a case you will need sub2ind())

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

With OpenMP parallelized nested loops run slow - parallel-processing

Related

OpenMP on fortran 90 lasts almost the same(if not more) as non parallelized program

Wrong results and crashes from Fortran subroutine with OpenMP

Avoiding false sharing when in OpenMP parallel loop

Optimizing a loop : huge arrays operations

More efficient Matlab Code please

Categories

Resources