How can I use basic tricks for improving my Julia code? - performance

I'm relatively new with Julia and I'm currently using version 1.0. I have a code that is intendend to produce a sequence of integers, based on an input matrix. The code takes 3 hours to run on my machine (i5, dual core, 16GB ram), using 16% of CPU and 3% of memory. Is there any basic tips I can learn and apply to optimize my code in Julia to improve its performance? Does indentation have an effect on performance? Is there a package that can track my code and suggest improvements? I provide my code below. The code includes a R code that generates data to which the Julia code is applicable. If an error occurs during the R code, it's just a lack of achievement during simulations and it must be run again until simulation is complete.
using Distances
using RCall
using Distributions
using BSON: #save, #load
using StatsBase
using LinearAlgebra
R"simul<-function(m){
comb<-expand.grid(c(0.01,0.2,0.4),
c(sample(2:7,1),sample(8:12,1),sample(13:20,1)),
c(sample(2:5,1),sample(6:10,1),sample(11:20,1)),
c(150,500,1500))
gener<-function(i){
maxoverlap<-comb[i,1]
nbvar<-comb[i,2]
nbclass<-comb[i,3]
propmix<-runif(1,0.001,1/nbclass)
Q<-MixSim(MaxOmega = maxoverlap, K = nbclass, p = nbvar,PiLow = propmix,resN = 1000)
A <- simdataset(n = comb[i,4], Pi = Q$Pi, Mu = Q$Mu, S = Q$S)
results<-list(Q,A)
return(results)
}
donnees<-sapply(1:nrow(comb),gener)
}
library(MixSim)
donneesimul=simul(1)"
#rget donneesimul
function pointsdpp(t)
datasim=donneesimul[2,t][:X]
Eucldist=pairwise(Euclidean(),transpose(datasim))
D=maximum(Eucldist.^2)
sigma2hat=mean(((Eucldist.^2)./D)[tril!(trues(size((Eucldist.^2)./D)),-1)])
L=exp.(-(Eucldist.^2/D)/(2*sigma2hat))
eigenv=eigvals(L)
prob=eigenv./(eigenv.+1)
eigenvectors=eigvecs(L)
function sampledpp(m)
u=rand(size(L,1))
V=eigenvectors[:,findall(u.<=prob)]
k=size(V,2)
Y=zeros(Int64,k)
for i=k:-1:1
P=sum(V.^2,dims=2)
Pri=P / sum(P)
Cumpri=cumsum(Pri,dims=1)
u=rand()
Y[i]=findfirst(u.<=Cumpri)[1]
if i==1 break end
j=findfirst(V[Y[i],:].!=0)
Vj=V[:,j]
V=V[:,deleteat!(collect(1:1:size(V,2)),j)]
V=V-repeat(Vj,1,size(V,2)).*repeat(transpose(V[Y[i],:]/Vj[Y[i]]),size(V,1))
for a = 1:i-1
for b = 1:a-1
V[:,a] = V[:,a] - transpose(V[:,a])*V[:,b]*V[:,b]
end
V[:,a] = V[:,a] / norm(V[:,a])
end
end
Y=sort(Y)
return(Y)
end
m=collect(1:1000)
sampleY_repet=map(sampledpp,m)
end
w=collect(1:1:81)
echantdpp=map(pointsdpp,w)
#save "echantdppdatasim1.bson" echantdpp

There are many issues to be considered when evaluating Julia performance. While the code you provided is far beyond MWE (minimal working example) and is not reproducible neither.
However, here are some general guidelines:
Take some time to read carefully the Julia performance tips and apply them
Since you process some arrays your code will likely benefit from the #simd macro. Using array views is also very often a low-hanging-fruit for codes such as yours.
You use 16% of CPU power (likely you have 8 cores and your program uses just one). Consider using either multi-threading or multiprocessing - your program will run many times faster
For some scenario you might consider using GPU computing with Flux.jl
Consider moving your multi-core computation to the cloud (Julia scaling on AWS EC2 instances works fantastic)
Since each of those topics is a big area on its own work step-by-step on your code and ask questions to get help.

Related

Poor performance in matlab

So I had to write a program in Matlab to calculate the convolution of two functions, manually. I wrote this simple piece of code that I know is not that optimized probably:
syms recP(x);
recP(x) = rectangularPulse(-1,1,x);
syms triP(x);
triP(x) = triangularPulse(-1,1,x);
t = -10:0.1:10;
s1 = -10:0.1:10;
for i = 1:201
s1(i) = 0;
for j = t
s1(i) = s1(i) + ( recP(j) * triP(t(i)-j) );
end
end
plot(t,s1);
I have a core i7-7700HQ coupled with 32 GB of RAM. Matlab is stored on my HDD and my Windows is on my SSD. The problem is that this simple code is taking I think at least 20 minutes to run. I have it in a section and I don't run the whole code. Matlab is only taking 18% of my CPU and 3 GB of RAM for this task. Which is I think probably enough, I don't know. But I don't think it should take that long.
Am I doing anything wrong? I've searched for how to increase the RAM limit of Matlab, and I found that it is not limited and it takes how much it needs. I don't know if I can increase the CPU usage of it or not.
Is there any solution to how make things a little bit faster? I have like 6 or 7 of these for loops in my homework and it takes forever if I run the whole live script. Thanks in advance for your help.
(Also, it highlights the piece of code that is currently running. It is the for loop, the outer one is highlighted)
Like Ander said, use the symbolic toolbox in matlab as a last resort. Additionally, when trying to speed up matlab code, focus on taking advantage of matlab's vectorized operations. What I mean by this is matlab is very efficient at performing operations like this:
y = x.*z;
where x and z are some Nx1 vectors each and the operator '.*' is called 'dot multiplication'. This is essentially telling matlab to perform multiplication on x1*z1, x[2]*z[2] .... x[n]*z[n] and assign all the values to the corresponding value in the vector y. Additionally, many of the functions in matlab are able to accept vectors as inputs and perform their operations on each element and return an equal size vector with the output at each element. You can check this for any given function by scrolling down in its documentation to the inputs and outputs section and checking what form of array the inputs and outputs can take. For example, rectangularPulse's documentation says it can accept vectors as inputs. Therefore, you can simplify your inner loop to this:
s1(i) = s1(i) + ( rectangularPulse(-1,1,t) * triP(t(i)-t) );
So to summarize:
Avoid the symbolic toolbox in matlab until you have a better handle of what you're doing or you absolutely have to use it.
Use matlab's ability to handle vectors and arrays very well.
Deconstruct any nested loops you write one at a time from the inside out. Usually this dramatically accelerates matlab code especially when you are new to writing it.
See if you can even further simplify the code and get rid of your outer loop as well.

TensorFlow weight initialization taking 99% of total run time

Problem
I noticed that one of my networks was running slow and nvidia-smi was reporting only around ~10% GPU usage. After running the profiler, I saw that TruncatedNormal process was taking the vast majority of running time (see photo). What could causing this kind of problem?
Code
Weight declaration function (from MNIST tutorial):
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
Code in action:
# First Layer
with tf.name_scope('input'):
x = tf.placeholder(tf.float32, [None, Nvars])
w1 = weight_variable([Nvars, 8])
b1 = bias_variable([8])
y1 = tf.nn.relu(tf.matmul(x, w1) + b1)
The issue is that the tensorflow functions that you are using have multiplicative properties. When a CPU needs to invoke the product handler, performance degrades exponentially.
The problem may be avoided if the calculation is pre-supported by the kernel. What you need to do is use compiler optimization; i.e. change some compiler settings which will cause the compiler to 'ease' the multiplication.
Or... you could just not multiply, and just add a few times.
Hope I helped.

Can I take advantage of parallelization to make this piece of code faster?

OK, a follow-up of this and this question. The code I want to modify is of course:
function fdtd1d_local(steps, ie = 200)
ez = zeros(ie + 1);
hy = zeros(ie);
for n in 1:steps
for i in 2:ie
ez[i]+= (hy[i] - hy[i-1])
end
ez[1]= sin(n/10)
for i in 1:ie
hy[i]+= (ez[i+1]- ez[i])
end
end
(ez, hy)
end
fdtd1d_local(1);
#time sol1=fdtd1d_local(10);
elapsed time: 3.4292e-5 seconds (4148 bytes allocated)
And I've naively tried:
function fdtd1d_local_parallel(steps, ie = 200)
ez = dzeros(ie + 1);
hy = dzeros(ie);
for n in 1:steps
for i in 2:ie
localpart(ez)[i]+= (hy[i] - hy[i-1])
end
localpart(ez)[1]= sin(n/10)
for i in 1:ie
localpart(hy)[i]+= (ez[i+1]- ez[i])
end
end
(ez, hy)
end
fdtd1d_local_parallel(1);
#time sol2=fdtd1d_local_parallel(10);
elapsed time: 0.0418593 seconds (3457828 bytes allocated)
sol2==sol1
true
The result is correct, but the performance is much worse. So why? Because parallelization isn't for a dual core old lap-top, or I'm wrong again?
Well, I admit that the only thing I know about parallelization is it can speed up codes but not every piece of code can be paralleled, is there any basic knowledge that one should know before trying parallel programming?
Any help would be appreciated.
There are several things going on. First, notice the difference in memory consumed. That's a sign that something is wrong. You'll get greater clarity by separating allocation (your zeros and dzeros lines) from the core algorithm. However, it's unlikely that very much of that memory is being used by allocation; more likely, something in your loop is using memory. Notice that you're describing the localpart on the left hand side, but you're using the raw DArray on the right hand side. That may be triggering some IPC traffic. If you need to debug the memory consumption, see the ProfileView package.
Second, it's not obvious to me that you're really breaking the problem up among processes. You're looping over each element of the whole array, instead you should have each worker loop over its own piece of the array. However, you're going to run into problems at the edges between localparts, because the updates require the neighboring values. You'd be much better off using a SharedArray.
Finally, launching threads has overhead; for small problems, you're better off not parallelizing and just using simple algorithms. Only when the computation time gets to hundreds of milliseconds (or more) would I even think about going to the effort to parallelize.
N.B.: I'm a relative Julia, FDTD, Maxwell's Equations, and parallel processing noob.
#tholy provided a good answer presenting the important issues to be considered.
In addition, the Wikipedia Finite-difference time-domain method page presents some good info with references and links to software packages, some of which use some style of parallel processing.
It seems that many parallel processing approaches to FDTD partition the physical environment into smaller chunks and then calculate the chunks in parallel. One complication is that the boundary conditions must be passed between adjacent chunks.
Using your toy 1D problem, and my limited Julia skills, I implemented the toy to use two cores on my machine. It's not the most general, modular, extendable, effective, nor efficient, but it does demonstrate parallel processing. Hopefully a Julia wizard will improve it.
Here's the Julia code I used:
addprocs(2)
#everywhere function ez_front(n::Int, ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
ez_local[1]=sin(n/10)
#simd for i=2:length(ez_local)
#inbounds ez_local[i] += (hy_local[i] - hy_local[i-1])
end
end
#everywhere function ez_back(ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
index_boundary::Int = first(localindexes(hy)[1])-1
ez_local[1] += (hy_local[1]-hy[index_boundary])
#simd for i=2:length(ez_local)
#inbounds ez_local[i] += (hy_local[i] - hy_local[i-1])
end
end
#everywhere function hy_front(ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
index_boundary = last(localindexes(ez)[1])+1
#simd for i=1:(length(hy_local)-1)
#inbounds hy_local[i] += (ez_local[i+1] - ez_local[i])
end
hy_local[end] += (ez[index_boundary] - ez_local[end])
end
#everywhere function hy_back(ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
#simd for i=2:(length(hy_local)-1)
#inbounds hy_local[i] += (ez_local[i+1] - ez_local[i])
end
hy_local[end] -= ez_local[end]
end
function fdtd1d_parallel(steps::Int, ie::Int = 200)
ez = dzeros((ie,),workers()[1:2],2)
hy = dzeros((ie,),workers()[1:2],2)
for n = 1:steps
#sync begin
#async begin
remotecall(workers()[1],ez_front,n,ez,hy)
remotecall(workers()[2],ez_back,ez,hy)
end
end
#sync begin
#async begin
remotecall(workers()[1],hy_front,ez,hy)
remotecall(workers()[2],hy_back,ez,hy)
end
end
end
(convert(Array{Float64},ez), convert(Array{Float64},hy))
end
fdtd1d_parallel(1);
#time sol2=fdtd1d_parallel(10);
On my machine (an old 32-bit 2-core laptop), this parallel version wasn't faster than the local version until ie was set to somewhere around 5000000.
This is an interesting case for learning parallel processing in Julia, but if I needed to solve Maxwell's equations using FDTD, I'd first consider the many FDTD software libraries that are already available. Perhaps a Julia package could interface to one of those.

A concrete example of a parfor loop in Matlab that outperforms the for loop

I am still somewhat new to parallel computing in Matlab. I have used OpenMP in C successfully, but could not get better performance in Matlab.
First, since I'm machine at a university that I am new to, I verified that the machine I am on has the Parallel Computing Toolbox by typing ver in the command prompt and it displayed: Parallel Computing Toolbox Version 5.2 (R2011b). Note that the machine has 4 cores
I tried simple examples of using parfor vs. for, but for always won, though this might be because of the overhead cost. I was doing simple things like the example here: MATLAB parfor is slower than for -- what is wrong?
Before trying to apply parfor to my bigger more complicated program (I need to compute 500 evaluations of a function and each evaluation takes about a minute, so parallelizing will help here), I would very much like to see a concrete example where parfor beats for. . Examples are abundant for OpenMP, but did not find a simple example that I can copy and paste that shows parfor is better than for
I use the following code (once per Matlab session) in order to use parfor:
pools = matlabpool('size');
cpus = feature('numCores');
if pools ~= (cpus - 1)
if pools > 0
matlabpool('close');
end
matlabpool('open', cpus - 1);
end
This leaves 1 core for other processes.
Note, the feature() command is undocumented.
There is an example of improved performance from parfor on Loren Shure's MATLAB blog.
Her example is simply computing the rank of a magic square matrix:
function ranks = parMagic(n)
ranks = zeros(1,n);
parfor (ind = 1:n)
ranks(ind) = rank(magic(ind)); % last index could be ind,not n-ind+1
end
Serg describes how to "enable" parallel functionality.
Here is a very simple cut and paste example to test it with as requested. Simply copy and paste the follwing into an mfile and run it.
function parfortest()
enable_parallel;
pause on
tic;
N=500;
for i=1:N
sequential_answer=slow_fun(i);
end
sequential_time=toc
tic;
parfor i=1:N
sequential_answer=slow_fun(i);
end
parallel_time=toc
end
function result=slow_fun(x)
pause(0.001);
result=x;
end
If you have run the code to enable parallel as shown in the answer by Serg you should get a pretty obvious improvement in performance.

Non-linear performance of Java function in parallel MATLAB

Recently, I implemented parallelisation in my MATLAB program, much to the suggestions offered in Slow xlsread in MATLAB. However, implementing the parallelism has cropped up another problem - non-linearly increasing processing time with increasing scale.
The culprit seems to be the java.util.concurrent.LinkedBlockingQueue method as can be seen from the attached images of profiler and the corresponding condensed graphs.
Problem: How do I remove this non-linearity as my work involves processing more than 1000 sheets in single run - which would take an insanely long time?
Note: The parallelised part of the program involves just reading all the .xls files and storing them in matrices, after which I start the remainder of my program. dlmwrite is used towards the end of the program and optimization on its time is not really required, although could also be suggested.
Culprit:
Code being parallelised:
parfor i = 1:runs
sin = 'Sheet';
sno = num2str(i);
sna = strcat(sin, sno);
data(i, :, :) = xlsread('Processes.xls', sna, '' , 'basic');
end
Doing parallel IO operation is likely to be a problem (could be slower in fact) unless maybe if you keep everything on an SSD. If you are always reading the same file and it's not enormous, you may want to try reading it prior to your loop and just doing your data manipulation in parallel.

Resources