OpenMP on fortran 90 lasts almost the same(if not more) as non parallelized program

OpenMP on fortran 90 lasts almost the same(if not more) as non parallelized program - parallel-processing

I'm trying to parallelize a simulation of an Ising 2D model to get some expected values as a function of the temperature of the system. For L=48, the one-threaded version takes about 240 seconds to run 20 temperatures and 1 seed each, but the parallelized version takes about 268 seconds, which is similar.
If you take the time per seed per temperature, it results in 12 seconds for the one-threaded version and 13.4 seconds for the parallelized version. I'm looking for help with my code because I don't understand these durations. I thought that the parallelized version would split one temperature among all threads and therefore should take about 30 seconds to complete.
I need to run the simulation for 50 temperatures and 200 seeds each, for 5 values of L. It would be helpful to reduce the compute time, because otherwise it could take 20 hours for L=48 and some days for L=72.
I'm using an i7-10700KF (8 cores, 16 logical threads).
program Ising
use omp_lib
implicit none
integer L, seed, i, j, seed0, nseed,k
parameter (L=48)
integer s(1:L, 1:L)
integer*4 pbc(0:L+1), mctot, N, mcd, mcini, difE
real*8 genrand_real2, magne, energ, energia, temp, temp1, DE
real*8 mag, w(-8:8)
real*8 start, finish
real*8 sum, sume, sume2, summ, summ2, sumam, vare, varm, maxcv, maxx
real*8 cv, x, Tmaxcv, Tmaxx
integer irand, jrand
11 format(10(f20.6))
! Initialize variables
mctot = 80000
mcd = 20
mcini = 8000
N = L*L
seed0 = 20347880
nseed = 20
maxcv=0.d0
maxx=0.d0
! Initialize vector pbc
pbc(0) = L
pbc(L+1) = 1
do i = 1, L
pbc(i) = i
end do
! Initialize matrix s with random values
do i = 1, L
do j = 1, L
if (genrand_real2() < 0.5) then
s(i,j) = 1
else
s(i,j) = -1
endif
end do
end do
! Metropolis algorithm
open(1, file='Expectation values.dat')
start = omp_get_wtime()
write(1,*) '#Temp, ','E, ','E2, ','M, ','M2, ','|M|, ','VarE, ','VarM, ',&
'Cv, ','X, '
!Start loop to calculate for different temperatures
!$OMP PARALLEL PRIVATE(s,seed,w,energia,difE,irand,jrand,temp,mag,sum,sume,sume2,summ,summ2,sumam,vare,varm,cv,x)
temp1 = 1.59d0
!$OMP DO ordered schedule(dynamic)
do k = 1, 10
temp = temp1 + (0.01d0*k)
!Define the matrix w, which contains the values of the Boltzmann function for each temperature, so as not to have to calculate them each iteration
do i = -8, 8
w(i) = dexp(-i/temp)
end do
write(*,*) "Temperature: ", temp, "Thread", omp_get_thread_num()
sum = 0.d0
sume = 0.d0
sume2 = 0.d0
summ = 0.d0
summ2 = 0.d0
sumam = 0.d0
do seed = seed0, seed0 + nseed-1, 1
call init_genrand(seed)
call reinicia(s,l)
energia = energ(s,l,pbc)
do i = 1, mctot
do j = 1, N
irand = int(genrand_real2()*L) + 1
jrand = int(genrand_real2()*L) + 1
difE = int(DE(s,l,irand,jrand,pbc))
if (difE < 0) then
s(irand,jrand) = -s(irand,jrand)
energia = energia + difE
else if (genrand_real2() < w(int(difE))) then
s(irand,jrand) = -s(irand,jrand)
energia = energia + difE
endif
end do
if ((i > mcini).and.(mcd*(i/mcd)==i)) then
mag= magne(s,l)
sum = sum + 1.d0
sume = sume + energia
sume2 = sume2 + energia**2
summ = summ + mag
summ2 = summ2 + mag**2
sumam = sumam + abs(mag)
endif
end do
end do
!Energy
sume=sume/(sum*N)
sume2=sume2/(sum*N*N)
!Magnetitzation
summ = summ/(sum*N)
sumam=sumam/(sum*N)
summ2=summ2/(sum*N*N)
!Variances
vare = dsqrt(sume2-sume*sume)/dsqrt(sum)
varm = dsqrt(summ2-summ*summ)/dsqrt(sum)
!Cv
cv = (N*(sume2-sume*sume))/temp**2
if (cv.gt.maxcv) then
maxcv=cv
Tmaxcv=temp
endif
!X
x = (N*(summ2-summ*summ))/temp
if (x.gt.maxx) then
maxx=x
Tmaxx=temp
endif
write(1,11) temp,sume,sume2,summ,summ2,sumam,vare,varm,cv,x
end do
!$OMP END DO
!$OMP END PARALLEL
finish = omp_get_wtime()
close(1)
print*, "Time: ",(finish-start),"Seconds"
end program Ising
! Functions
!Function that calculates the energy of the matrix s
real*8 function energ(S,L, pbc)
implicit none
integer s(1:L, 1:L), i, j, L
integer*4 pbc(0:L+1)
real*8 ene
ene = 0.0d0
do i = 1, L
do j = 1, L
ene = ene - s(i,j) * s(pbc(i+1),j) - s(i,j) * s(i,pbc(j+1))
end do
end do
energ = ene
return
end function energ
!Function that calculates the difference in energy that occurs when the spin of position (i, j) is changed
real*8 function DE(S,L,i,j,pbc)
implicit none
integer s(1:L, 1:L), i, j, L, difE
integer*4 pbc(0:L+1)
real*8 suma
difE = 0
suma = 0.0d0
suma = suma + s(pbc(i-1),j) + s(pbc(i+1),j) + s(i,pbc(j-1)) + s(i,pbc(j+1))
difE = difE + int(2 * s(i,j) * suma)
DE = difE
return
end function DE
!Function that calculates the magnetization of the matrix s
real*8 function magne(S,L)
implicit none
integer s(1:L, 1:L),L
magne = sum(s)
return
end function magne
! SUBRUTINES
!Subroutine that resets the matrix s with random values
subroutine reinicia(S,L)
implicit none
integer s(1:L, 1:L), i,j,L
real*8 genrand_real2
do i = 1, L
do j = 1, L
if (genrand_real2() < 0.5) then
s(i,j) = 1
else
s(i,j) = -1
endif
end do
end do
return
end subroutine
I have tried parallelizing the seeds loop instead of the temperatures, but it lasts almost the same, so i think i'm not parallelizing it correctly, because it looks a nice code to parallelize.
The other option I thought of is to manually parallelize the simulation. I could do this by compiling 16 programs, each of which handles a different range of temperatures. Then I could run all of the programs concurrently, so each program would get its own thread. However, this approach would require a lot of extra RAM.

Related

How to reduce the allocations in Julia?

I am starting to use Julia mainly because of its speed. Currently, I am solving a fixed point problem. Although the current version of my code runs fast I would like to know some methods to improve its speed.
First of all, let me summarize the algorithm.
There is an initial seed called C0 that maps from the space (b,y) into an action space c, then we have C0(b,y)
There is a formula that generates a rule Ct from C0.
Then, using an additional restriction, I can obtain an updating of b [let's called it bt]. Thus,it generates a rule Ct(bt,y)
I need to interpolate the previous rule to move from the grid bt into the original grid b. It gives me an update for C0 [let's called that C1]
I will iterate until the distance between C1 and C0 is below a convergence threshold.
To implement it I created two structures:
struct Parm
lC::Array{Float64, 2} # Lower limit
uC::Array{Float64, 2} # Upper limit
γ::Float64 # CRRA coefficient
δ::Float64 # factor in the euler
γ1::Float64 #
r1::Float64 # inverse of the gross interest rate
yb1::Array{Float64, 2} # y - b(t+1)
P::Array{Float64, 2} # Transpose of transition matrix
end
mutable struct Upd1
pol::Array{Float64,2} # policy function
b::Array{Float64, 1} # exogenous grid for interpolation
dif::Float64 # updating difference
end
The first one is a set of parameters while the second one stores the decision rule C1. I also define some functions:
function eulerm(x::Upd1,p::Parm)
ct = p.δ *(x.pol.^(-p.γ)*p.P).^(-p.γ1); #Euler equation
bt = p.r1.*(ct .+ p.yb1); #Endeogenous grid for bonds
return ct,bt
end
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #simd for col in 1:size(bt,2)
F1 = LinearInterpolation(bt[:,col], ct[:,col],extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC];
polnew[polnew .> p.uC] .= p.uC[polnew .> p.uC];
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
function updating!(x::Upd1,p::Parm)
ct, bt = eulerm(x,p); # endogeneous grid
x.pol, x.dif = interp0!(bt,ct,x,p);
end
function conver(x::Upd1,p::Parm)
while x.dif>1e-8
updating!(x,p);
end
end
The first formula implements steps 2 and 3. The third one makes the updating (last part of step 4), and the last one iterates until convergence (step 5).
The most important function is the second one. It makes the interpolation. While I was running the function #time and #btime I realized that the largest number of allocations are in the loop inside this function. I tried to reduce it by not defining polnew and goes directly to x.pol but in this case, the results are not correct since it only need two iterations to converge (I think that Julia is thinking that polold is exactly the same than x.pol and it is updating both at the same time).
Any advice is well received.
To anyone that wants to run it by themselves, I add the rest of the required code:
function rouwen(ρ::Float64, σ2::Float64, N::Int64)
if (N % 2 != 1)
return "N should be an odd number"
end
sigz = sqrt(σ2/(1-ρ^2));
zn = sigz*sqrt(N-1);
z = range(-zn,zn,N);
p = (1+ρ)/2;
q = p;
Rho = [p 1-p;1-q q];
for i = 3:N
zz = zeros(i-1,1);
Rho = p*[Rho zz; zz' 0] + (1-p)*[zz Rho; 0 zz'] + (1-q)*[zz' 0; Rho zz] + q *[0 zz'; zz Rho];
Rho[2:end-1,:] = Rho[2:end-1,:]/2;
end
return z,Rho;
end
#############################################################
# Parameters of the model
############################################################
lb = 0; ub = 1000; pivb = 0.25; nb = 500;
ρ = 0.988; σz = 0.0439; μz =-σz/2; nz = 7;
ϕ = 0.0; σe = 0.6376; μe =-σe/2; ne = 7;
β = 0.98; r = 1/400; γ = 1;
b = exp10.(range(start=log10(lb+pivb), stop=log10(ub+pivb), length=nb)) .- pivb;
#=========================================================
Algorithm
======================================================== =#
(z,Pz) = rouwen(ρ,σz, nz);
μZ = μz/(1-ρ);
z = z .+ μZ;
(ee,Pe) = rouwen(ϕ,σe,ne);
ee = ee .+ μe;
y = exp.(vec((z .+ ee')'));
P = kron(Pz,Pe);
R = 1 + r;
r1 = R^(-1);
γ1 = 1/γ;
δ = (β*R)^(-γ1);
m = R*b .+ y';
lC = max.(m .- ub,0);
uC = m .- lb;
by1 = b .- y';
# initial guess for C0
c0 = 0.1*(m);
# Set of parameters
pp = Parm(lC,uC,γ,δ,γ1,r1,by1,P');
# Container of results
up1 = Upd1(c0,b,1);
# Fixed point problem
conver(up1,pp)
UPDATE As it was reccomend, I made the following changes to the third function
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds for col in 1:size(bt,2)
F1 = LinearInterpolation(#view(bt[:,col]), #view(ct[:,col]),extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
for j in eachindex(polnew)
polnew[j] < p.lC[j] ? polnew[j] = p.lC[j] : nothing
polnew[j] > p.uC[j] ? polnew[j] = p.uC[j] : nothing
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
This leads to an improvement in the speed (from ~1.5 to ~1.3 seconds). And a reduction in the number of allocations. Somethings that I noted were:
Changing from polnew[:,col] = F1(x.b) to polnew[:,col] .= F1(x.b) can reduce the total allocations but the time is slower, why is that?
How should I understand the difference between #time and #btime. For this case, I have:
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
1.338042 seconds (385.72 k allocations: 1.157 GiB, 3.37% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
Just to be precise, in both cases, I run it several times and I choose representative numbers for each line.
Does it mean that all the time is due allocations during the compilation?

Start going through the "performance tips" as advised by #DNF but below you will find most important comments for your code.
Vectorize vector assignments - a small dot makes big difference
julia> julia> a = rand(3,4);
julia> #btime $a[3,:] = $a[3,:] ./ 2;
40.726 ns (2 allocations: 192 bytes)
julia> #btime $a[3,:] .= $a[3,:] ./ 2;
20.562 ns (1 allocation: 96 bytes)
Use views when doing something with subarrays:
julia> #btime sum($a[3,:]);
18.719 ns (1 allocation: 96 bytes)
julia> #btime sum(#view($a[3,:]));
5.600 ns (0 allocations: 0 bytes)
Your code around a lines polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC]; will make much less allocations when you do it with a for loop over each element of polnew
#simd will have no effect on conditionals (point 3) neither when code is calling complex external functions

I want to give an update about this problem. I made two main changes to my code: (i) I define my own linear interpolation function and (ii) I include the check of bounds in the interpolation.
With this the new function three is
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #views for col in 1:size(bt,2)
polnew[:,col] = myint(bt[:,col], ct[:,col],x.b[:],p.lC[:,col],p.uC[:,col]);
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
And the interpolation is now:
function myint(x0,y0,x1,ly,uy)
y1 = similar(x1);
n = size(x0,1);
j = 1;
#simd for i in eachindex(x1)
while (j <= n) && (x1[i] > x0[j])
j+=1;
end
if j == 1
y1[i] = y0[1] + ((y0[2]-y0[1])/(x0[2]-x0[1]))*(x1[i]-x0[1]) ;
elseif j == n+1
y1[i] = y0[n] + ((y0[n]-y0[n-1])/(x0[n]-x0[n-1]))*(x1[i]-x0[n]);
else
y1[i] = y0[j-1]+ ((x1[i]-x0[j-1])/(x0[j]-x0[j-1]))*(y0[j]-y0[j-1]);
end
y1[i] > uy[i] ? y1[i] = uy[i] : nothing;
y1[i] < ly[i] ? y1[i] = ly[i] : nothing;
end
return y1;
end
As you can see, I am taking advantage (and assuming) that both vectors that we use as basis are ordered while the two last lines in the outer loops checks the bounds imposed by lC and uC.
With that I get the following total time
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
0.734630 seconds (28.93 k allocations: 752.214 MiB, 3.82% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
which is almost as twice faster with ~8% of the total allocations. the use of views in the loop of the function interp0! also helps a lot.

Replace the fmincon function with another optimization algorithm

This source code is an implementation for the epsilon-constraint method. How can I replace the fmincon() function with PSO or GA optimization algorithm (I do not want to use a build-in function).
This code for the main function
x0 = [1 1]; % Starting point
UB = [1 1]; % Upper bound
LB = [0 0]; % Lower bound
options = optimset('LargeScale', 'off', 'MaxFunEvals', 1000, ...
'TolFun', 1e-6, 'TolCon', 1e-6, 'disp', 'off');
% Create constraint bound vector:
n = 50; % Number of Pareto points
eps_min = -1;
eps_max = 0;
eps = eps_min:(eps_max - eps_min)/(n-1):eps_max;
% Solve scalarized problem for each epsilon value:
xopt = zeros(n,length(x0));
for i=1:n
xopt(i,:)=fmincon('obj_eps', x0, [], [], [], [], LB, UB,...
'nonlcon_eps', options, eps(i));
end
This is the constraints function:
function [C,constraintViolation] = nonlcon_eps(x, eps)
constraintViolation= 0;
Ceq = [];
C(1) =x(2)+(x(1)-1)^3;
if C(1) > 0
constraintViolation= constraintViolation+ 1;
end
C(2) = -x(1) - eps;
if C(2) > 0
constraintViolation= constraintViolation+ 1;
end
This is the objective function:
function f = obj_eps(x, ~)
f = 2*x(1)-x(2);
I have replaced this part:
for i=1:n
xopt(i,:)=fmincon('obj_eps', x0, [], [], [], [], LB, UB,'nonlcon_eps', options, eps(i));
end
with this:
maxIteration = 1000;
dim = 2;
n = 50; % Number of Pareto points
eps_min = -1;
eps_max = 0;
EpsVal = eps_min:(eps_max - eps_min)/(n-1):eps_max;
for i=1:n
[gbest]= PSOalgo(N,T,lb,ub,dim,fobj,fcon,EpsVal(i));
end
function [gbest]= PSOalgo(N,maxite,lb,ub,dim,fobj,fcon,EpsVal)
% initialization
wmax=0.9; % inertia weight
wmin=0.4; % inertia weight
c1=2; % acceleration factor
c2=2; % acceleration factor
% pso initialization
X=initialization(N,dim,ub,lb);
v = 0.1*X; % initial velocity
for i=1:N
fitnessX(i,1)= fobj(X(i,:));
end
[fmin0,index0]= min(fitnessX);
pbest= X; % initial pbest
pbestfitness = fitnessX;
gbest= X(index0,:); % initial gbest
gbestfitness = fmin0;
ite=0; % Loop counter
while ite<maxite
w=wmax-(wmax-wmin)*ite/maxite; % update inertial weight
% pso velocity updates
for i=1:N
for j=1:dim
v(i,j)=w*v(i,j)+c1*rand()*(pbest(i,j)- X(i,j)) + c2*rand()*(gbest(1,j)- X(i,j));
end
end
% pso position update
for i=1:N
for j=1:dim
X(i,j)= X(i,j)+v(i,j);
end
% Check boundries
FU=X(i,:)>ub;
FL=X(i,:)<lb;
X(i,:)=(X(i,:).*(~(FU+FL)))+ub.*FU+lb.*FL;
% evaluating fitness
fitnessX(i,1) = fobj(X(i,:));
[~,constraintViolation(i,1)] = fcon(X(i,:), EpsVal);
end
% updating pbest and fitness
for i=1:N
if fitnessX(i,1) < pbestfitness(i,1) && constraintViolation(i,1) == 0
pbest(i,:)= X(i,:);
pbestfitness(i,1)= fitnessX(i,1);
end
[~,constraintViolation(i,1)] = fcon(pbest(i,:), EpsVal);
end
% updating gbest and best fitness
for i=1:N
if pbestfitness(i,1)<gbestfitness && constraintViolation(i,1) == 0
gbest=pbest(i,:);
gbestfitness= pbestfitness(i,1);
end
end
ite = ite+1;
end
end
However, when I run the code the output is only one solution, while it should be 50 (n=50). This because all other solution do not satisfy the constraints. How can modify the code to get one solution at each run(n=1:50)
UPDATE:
I have included PSOalgo() code and did some modification. The output now is 50 solutions.However, the obtained result is not correct.
fmincon () result:
PSO result:

Bellard's algorithm for calculating Pi

For a university project I wanted to implement Bellard's algorithm for calculating the n-th digit of pi in Fortran. I stumbled across this question on math.stackexchange: https://math.stackexchange.com/questions/1776840/confusion-with-bellards-algorithm-for-pi
With the answer to that question I managed to implement the code, but I'm not getting a result and I can't figure out what I'm doing wrong:
PROGRAM pi
IMPLICIT NONE
INTEGER :: find_number_of_primes, number_primes, last_digit, digit, N, &
eps, i, j, multiplicity, test
REAL :: log2, base, v_max, m, s, v, b, A, pi_sum
INTEGER, ALLOCATABLE :: primes(:)
digit = 3
eps = 2
base = 10
N = INT((digit+eps)*log2(base))
pi_sum = 0
number_primes = find_number_of_primes(2*N)
ALLOCATE(primes(number_primes))
CALL get_primes(number_primes, primes)
DO i=1,SIZE(primes)
v_max = INT(log(REAL(2*N))/log(REAL(primes(i))))
m = primes(i)**v_max
s = 0
v = 0
b = 1
A = 1
DO j = 1,(N)
b = MOD((j/(primes(i)**multiplicity(digit, j)) * b), m)
A = MOD((2*j-1)/(primes(i)**multiplicity(primes(i), (2*j-1)))*A, m)
v = v - multiplicity(primes(i),j)+multiplicity(primes(i),(2*j-1))
IF (v > 0) THEN
s = MOD(s*j*b*(A**(-1))*a**(v_max-v), m)
END IF
END DO
s = MOD(s * base**(digit-1), m)
pi_sum = pi_sum + MOD((s/m), REAL(1))
END DO
PRINT*, pi_sum
END PROGRAM
The custom functions find_number_of_primes, get_primes, log2 and multiplicityare working as intended. First one finds the number of prime numbers in the given interval, second one returns the prime numbers in an array, the third one calculates
log_2(x) = log(x)/log(2)
and the last one calculates the multiplicity (I guess that's what it is called) by checking how often the second number has to be divided by the first number until the rest of the division is no longer zero. Here are the source codes:
The logarithm:
real function log2(x)
implicit none
real, intent(in) :: x
log2 = log(x) / log(2.)
end function
Finding the number of prime numbers in interval:
integer function find_number_of_primes(limit) result(number_primes)
implicit none
INTEGER, INTENT(IN) :: limit
INTEGER :: i, j
LOGICAL :: is_prime
number_primes = 0
DO i = 3,(limit-1)
is_prime = .TRUE.
DO j = 2, (i-1)
IF (MODULO(i, j) == 0) THEN
is_prime = .FALSE.
EXIT
END IF
END DO
IF (is_prime .EQV. .TRUE.) THEN
number_primes = number_primes + 1
END IF
END DO
end function find_number_of_primes
Getting the actual prime numbers:
SUBROUTINE get_primes(number_primes, primes)
IMPLICIT NONE
INTEGER, INTENT(IN) :: number_primes
INTEGER :: i, found_primes, j
LOGICAL:: is_prime
INTEGER, INTENT(OUT) :: primes(number_primes)
i = 3
found_primes = 0
DO
is_prime = .TRUE.
DO j = 2, (i-1)
IF (MODULO(i,j) == 0) THEN
is_prime = .FALSE.
END IF
END DO
IF (is_prime .EQV. .TRUE.) THEN
found_primes = found_primes + 1
primes(found_primes) = i
IF (found_primes == number_primes) EXIT
END IF
i = i + 1
END DO
end SUBROUTINE get_primes
integer function multiplicity(a, b)
implicit none
INTEGER, INTENT(IN) :: a, b
multiplicity = 0
DO
multiplicity = multiplicity + 1
IF (MOD(b, (a**(multiplicity+1))) /= 0) THEN
EXIT
END IF
END DO
end function multiplicity
Pastebin for the whole file: https://pastebin.com/0F4zQYaH
Now in the question that I've linked, at the calculation of b in the inner loop there is a^{v(n, k)} in the denominator, but in the answer to the question the term in the denominator is a^{v(a, k)}. Also, the inner loop in the question runs from 1 to N while the loop in the answer goes from 1 to 2N.
The final result is NaN, here is some console output for digit = 2 and eps = 1:
************ i = 1 ***************************
v_max = 2.00000000
m = 9.00000000
------------- j = 1 -------------------------------
b = 0.00000000
A = 0.00000000
v = 0.00000000
------------- j = 2 -------------------------------
b = 0.00000000
A = 0.00000000
v = 0.00000000
------------- j = 3 -------------------------------
b = 0.00000000
A = 0.00000000
v = 0.00000000
------------- j = 4 -------------------------------
b = 0.00000000
A = 0.00000000
v = 0.00000000
------------- j = 5 -------------------------------
b = 0.00000000
A = 0.00000000
v = 1.00000000
v > 0 --> s = NaN
------------- j = 6 -------------------------------
b = 0.00000000
A = 0.00000000
v = 1.00000000
v > 0 --> s = NaN
and so one, complete output: https://pastebin.com/ucJNg6Vi
Can anybody help me figure out what I'm doing wrong?

There is a bug in your code:
v = v - multiplicity(primes(i),j)+multiplicity(primes(i),(2*j-1))
should be
v = v - multiplicity(primes(i),j)- multiplicity(primes(i),(2*j-1))
Also,
s = MOD(s*j*b*(A**(-1))*a**(v_max-v), m)
should be
s = MOD(s + j*b*(A**(-1))*a**(v_max-v), m)

Vectorized code slower than loops? MATLAB

In the problem Im working on there is such a part of code, as shown below. The definition part is just to show you the sizes of arrays. Below I pasted vectorized version - and it is >2x slower. Why it happens so? I know that i happens if vectorization requiers large temporary variables, but (it seems) it is not true here.
And generally, what (other than parfor, with I already use) can I do to speed up this code?
maxN = 100;
levels = maxN+1;
xElements = 101;
umn = complex(zeros(levels, levels));
umn2 = umn;
bessels = ones(xElements, xElements, levels); % 1.09 GB
posMcontainer = ones(xElements, xElements, maxN);
tic
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
nn = n + 1;
mm = 1;
for m = 1 : 2 : n
umn(nn, mm) = bessels(i, j, nn) * posMcontainer(i, j, m);
mm = mm + 1;
end
end
end
end
toc % 0.520594 seconds
tic
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
umn2(nn, 1:numOfEl) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
end
end
toc % 1.275926 seconds
sum(sum(umn-umn2)) % veryfying, if all done right
Best regards,
Alex
From the profiler:
Edit:
In reply to #Jason answer, this alternative takes the same time:
for n = 1:2:maxN
nn(n) = n + 1;
numOfEl(n) = ceil(n/2);
end
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
umn2(nn(n), 1:numOfEl(n)) = bessels(i, j, nn(n)) * posMcontainer(i, j, 1:2:n);
end
end
end
Edit2:
In reply to #EBH :
The point is to do the following:
parfor i = 1 : xElements
for j = 1 : xElements
umn = complex(zeros(levels, levels)); % cleaning
for n = 0:maxN
mm = 1;
for m = -n:2:n
nn = n + 1; % for indexing
if m < 0
umn(nn, mm) = bessels(i, j, nn) * negMcontainer(i, j, abs(m));
end
if m > 0
umn(nn, mm) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
if m == 0
umn(nn, mm) = bessels(i, j, nn);
end
mm = mm + 1; % for indexing
end % m
end % n
beta1 = sum(sum(Aj1.*umn));
betaSumSq1(i, j) = abs(beta1).^2;
beta2 = sum(sum(Aj2.*umn));
betaSumSq2(i, j) = abs(beta2).^2;
end % j
end % i
I speeded it up as much, as I was able to. What you have written is taking only the last bessels and posMcontainer values, so it does not produce the same result. In the real code, those two containers are filled not with 1, but with some precalculated values.

After your edit, I can see that umn is just a temporary variable for another calculation. It still can be mostly vectorizable:
betaSumSq1 = zeros(xElements); % preallocating
betaSumSq2 = zeros(xElements); % preallocating
% an index matrix to fetch the right values from negMcontainer and
% posMcontainer:
indmat = tril(repmat([0 1;1 0],ceil((maxN+1)/2),floor(levels/2)));
indmat(end,:) = [];
% an index matrix to fetch the values in correct order for umn:
b_ind = repmat([1;0],ceil((maxN+1)/2),1);
b_ind(end) = [];
tempind = logical([fliplr(indmat) b_ind indmat+triu(ones(size(indmat)))]);
% permute the arrays to prevent squeeze:
PM = permute(posMcontainer,[3 1 2]);
NM = permute(negMcontainer,[3 1 2]);
B = permute(bessels,[3 1 2]);
for k = 1 : maxN+1 % third dim
for jj = 1 : xElements % columns
b = B(:,jj,k); % get one vector of B
% perform b*NM for every row of NM*indmat, than flip the result:
neg = fliplr(bsxfun(#times,bsxfun(#times,indmat,NM(:,jj,k).'),b));
% perform b*PM for every row of PM*indmat:
pos = bsxfun(#times,bsxfun(#times,indmat,PM(:,jj,k).'),b);
temp = [neg mod(1:levels,2).'.*b pos].'; % concat neg and pos
% assign them to the right place in umn:
umn = reshape(temp(tempind.'),[levels levels]).';
beta1 = Aj1.*umn;
betaSumSq1(jj,k) = abs(sum(beta1(:))).^2;
beta2 = Aj2.*umn;
betaSumSq2(jj,k) = abs(sum(beta2(:))).^2;
end
end
This reduce running time from ~95 seconds to less 3 seconds (both without parfor), so it improves in almost 97%.

I would suspect it is memory allocation. You are re-allocating the m array in a 3 deep loop.
try rearranging the code:
tic
for n = 1 : 2 : maxN
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
for j = 1 : xElements
for i = 1 : xElements
umn2(nn, 1:numOfEl) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
end
end
toc % 1.275926 seconds
I was trying this in Igor pro, which a similar language, but with different optimizations. So the direct translations don't time the same way as Matlab (vectorized was slightly faster in Igor). But reordering the loops did speed up the vectorized form.
In your second part of the code, that is setting umn2, inside the loops, you have:
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
Those 3 lines don't require any input from the i and j loops, they only use the n loop. So reordering the loops such that i and j are inside the n loop will mean that those 3 lines are done xElements^2 (100^2) times less often. I suspect it is that m = 1:2:n line that takes time, since that is allocating an array.

How to keep the distance between $n$ particles within a certain range?

I am working on a problem in Molecular Dynamics and need to randomly generate a position array for np particles within a box of size [-L,L] x [-L,L]. In fact, I need to generate the x-array for the x-coordinates with x(1) = 0 and the y-array for the y-coordinates with y(1)=y(2) =0. I need the particles to be such that the distances between neighboring particles are within some range (e.g: 0.9 <= r <= 1.1) like in the following picture:
However in my code I get something like this:
See how the red lines are larger than what I want.
My code is
REAL, DIMENSION(np) :: x, y
REAL :: w1, w2, minv, maxv, xij, yij, rij
INTEGER :: i, j
!Generating random coordinates for the particles
x(1) = 0.0d0
y(1) = 0.0d0
y(2) = 0.0d0
!-------------------------------------------------------------------------
! translation and rotaion of the whole system were froze (saving 4 degrees of
! freedome)
! x(1) = 0.0d0; y(1) = 0.0d0 fix one particle in the origin
! y(2) = 0.0d0 fix the second particle on the x-axis
!-------------------------------------------------------------------------
rmatrix = 100.0
minv = 0.0
maxv = 10
iter0 = 0
101 DO WHILE(maxv >= 1.1 .OR. minv <= 0.9)
iter0 = iter0 + 1
PRINT *, iter0
CALL init_random_seed()
DO i = 2, np
CALL RANDOM_NUMBER(w1)
x(i) = 10 * w1 - 5
END DO
DO i = 3, np
CALL RANDOM_NUMBER(w2)
y(i) = 10 * w2 - 5
END DO
! rmatrix contains the distances between all particles
DO i = 1, np
DO j = 1, np
IF(j .NE. i) THEN
xij = x(i) - x(j)
yij = y(i) - y(j)
rij = SQRT(xij * xij + yij * yij)
rmatrix(i,j) = rij
END IF
END DO
END DO
minv = MINVAL(rmatrix) ! This is the minimum distance between any two
! particles ( distance cannot be smaller)
! which is the left endpoint of the range interval
DO i = 1, np ! Here is my attempt to control the righ endpoint of
DO j = 1, np ! the range interval. ( This needs to be edited)
IF(j .NE. i) THEN
maxv = MIN(maxv, rmatrix(i,j))
END IF
END DO
IF(maxv >= 1.1) THEN
GOTO 101
END IF
END DO
END DO
CONTANIS
SUBROUTINE init_random_seed()
INTEGER :: i, n, clock
INTEGER, DIMENSION(:), ALLOCATABLE :: seed
CALL RANDOM_SEED(size = n)
ALLOCATE(seed(n))
CALL SYSTEM_CLOCK(COUNT=clock)
seed = clock + 37 * (/ (i - 1, i = 1, n) /)
CALL RANDOM_SEED(PUT = seed)
END SUBROUTINE init_random_seed

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

OpenMP on fortran 90 lasts almost the same(if not more) as non parallelized program - parallel-processing

Related

How to reduce the allocations in Julia?

Replace the fmincon function with another optimization algorithm

Bellard's algorithm for calculating Pi

Vectorized code slower than loops? MATLAB

How to keep the distance between $n$ particles within a certain range?

Categories

Resources