temp=0
#elapsed for k in 1:1000
global temp+=k
end
will return elapsed time. But how can you save this into a variable?
temp=0
time=#elapsed for k in 1:1000
global temp+=k
end
I think this worked in previous versions of Julia? But for 1.0.0 I get
cannot assign variable libc.time from module Main
Also it does time the whole for loop correct? I'm really saddened by tic and toc being unusable in 1.0.0, I think the logic was simpler there.
Well, it quite clearly tells you that time is an existing variable (namely, a function) in Main:
julia> time
time (generic function with 2 methods)
So, just name your result differently:
julia> ime=#elapsed for k in 1:1000
global temp+=k
end
6.6707e-5
julia> ime
6.6707e-5
Related
On the next MWE, #code_warntype returns bad performance type, ##1469::JuMP.Containers.SparseAxisArray.
using JuMP, Gurobi
function MWE(n)
m = Model(Gurobi.Optimizer)
#variable(m, x[i=1:n, j=i+1:n], Bin)
#variable(m, y[i=1:n], Bin)
end
#codewarn_type MWE(5)
While the adapted version where j goes from 1 to n instead of i+1 to n is totally great for #codewarn_type.
function MWE_codewarntype_safe(n)
m = Model(Gurobi.Optimizer)
#variable(m, x[i=1:n, j=1:n], Bin)
#variable(m, y[i=1:n], Bin)
end
#codewarn_type MWE(5)
However, I can't allow my model to have nearly twice more variables and more than half unused. I ran both code with larger instances and the performances quickly deteriorate. Does this mean I should ignore what #code_warntype tells? If so, that's not the first time I would have to ignore it and I find it particularly unclear how to understand when #codewarn_type returns are meaningful. Maybe I should ask a more general question about this macro, how to read and understand it?
Hmm. I thought we fixed this. Note that x is a concrete type, so this is just a failure of Julia's inference. It also means that when x is passed to another function (e.g., add_constraint), it will be fast.
Edit: opened an issue to discuss: https://github.com/jump-dev/JuMP.jl/issues/2457
Here's the MWE:
using JuMP
function MWE(n)
model = Model()
#variable(model, x[i=1:n, j=i+1:n])
end
#code_warntype MWE(5)
The question to ask is: is the time difference material? If it's only marginally faster, I would go with the more readable version.
I have a function, say foo() that returns an int value, and I have to pass to different values to this function to obtaion two different values that have to be summed up, eg.
result = foo(2) + foo(37)
and I would like to make those foo(2) and foo(37) to be calculated in parallel (at the same time). It may help to have two versions of foo, one that uses a for loop and another one recursive. I am quite new to Julia and parallel programming but would like to get this problem going so that I can keep it up until I can build it as a web app with Genie.jl. Also, any resources to learn about parallel programming with Julia besides its docs will be highly appreciated!
If you want to use processes for the parallelisation you can use the "distributed for" loop:
8.2.3. Aggregate results
The second situation is when you want to perform a small operation on each of the items but you also want to perform an “aggregation function” at the end to retrieve a scalar value (or an array if the input is a matrix).
In these cases, you can use the #distributed (aggregationFunction) for construct.
As an example, you run in parallel a division by 2 and then use the sum as the aggregation function (assume three working processes are available):
function f(n)
s = 0.0
for i = 1:n
s += i/2
end
return s
end
function pf(n)
s = #distributed (+) for i = 1:n # aggregate using sum on variable s
i/2
# last element of for cycle is used by the aggregator
end
return s
end
#benchmark f(10000000) # median time: 11.478 ms
#benchmark pf(10000000) # median time: 4.458 ms
(From Julia Quick Syntax Reference)
Alternatively you can use threads. Julia already have multi-threads, but Julia 1.3 (due in few days/weeks, in rc4 at time of writing) will introduce a comprehensive thread API.
Is there any way to generate pseudo-random numbers to less precision and thus speed the process up?
Another thing is that I know it saves time if random numbers are generated all at once (e.g. rand(100,1000)), instead of one by one. Could someone explain why this is true?
If you have a CUDA-capable GPU, you can do random number generation on it, as it's supposed to be much faster... Specifically Philox4x32-10:
parallel.gpu.rng(0, 'Philox4x32-10');
R = gpuArray.rand(sZ,'single'); % run this for more info: doc('gpuArray/rand')
MATLAB actually implements more than one random number generator. They differ significantly in terms of execution time and in terms of "randomness" (I think, but I didn't verify). However, I understand from your question that speed is more important for you.
% 'twister' is the default in MATLAB Versions 7.4 and later
tic();
for i=1:1000000
rand('twister');
end
toc();
%Elapsed time is 2.912960 seconds.
% 'state' is the default in MATLAB versions 5 through 7.3
tic();
for i=1:1000000
rand('state');
end
toc();
% Elapsed time is 2.162040 seconds.
% 'seed' is the default in MATLAB version 4
tic();
for i=1:1000000
rand('seed');
end
toc();
% Elapsed time is 0.758830 seconds.
Important note: I ran the script above with an rather old version of MATLAB (v.7.6, a.k.a. R2008a). In newer versions, the syntax rand(generator) is discouraged . Instead, you should use the function rng(seed, generator) (online documentation). As a side effect, rng(seed, generator) gives you even more random number generators to choose from. Check the documentation for details.
Regarding the second question: Whatever generator you pick, generating many random numbers at once will always be faster than generating many single random numbers. This is because MATLAB's internals are heavily optimized for parallel processing.
tic();
for i=1:100000
rand();
end
toc();
% Elapsed time is 0.024388 seconds.
tic();
rand(100, 1000);
toc();
% Elapsed time is 0.000680 seconds.
Since R2015a the rng function for configuring and seeding the global generator has a 'simdTwister' option that uses a faster "SIMD-oriented Fast Mersenne Twister" algorithm:
rng(1,'twister');
R = rand(1e4); % Warmup for timing
tic
R = rand(1e4);
toc
rng(1,'simdTwister');
R = rand(1e4); % Warmup for timing
tic
R = rand(1e4);
toc
This will probably be the fastest builtin generator for your system (excepting the possibility of GPU-based generators). On my computer it's a little more than twice as fast as the default Mersenne Twister algorithm for large arrays.
I'm trying to vectorize or make this loop run faster (it's a minimal code):
n=1000;
L=100;
x=linspace(-L/2,L/2);
V1=rand(n);
for i=1:length(x)
for k=1:n
for j=1:n
V2(j,k)=V1(j,k)*log(2/L)*tan(pi/L*(x(i)+L/2)*j);
end
end
V3(i,:)=sum(V2);
end
would appreciate you help.
An alternative to vectorization, is to recognize the expensive operations in the code and somehow reduce them. For instance, the log(2/L) is called 100*1000*1000 times with input that does not depend on any of the three for loops. If we calculate this value outside of the for loops, then we can use it instead:
logResult = log(2/L);
and
V2(j,k)=V1(j,k)*log(2/L)*tan(pi/L*(x(i)+L/2)*j);
becomes
V2(j,k)=(V1(j,k)*logResult*tan(pi/L*(x(i)+L/2)*j));
Likewise, the code calls the tan function the same 100*1000*1000 times. Note how this calculation, tan(pi/L*(x(i)+L/2)*j) does not depend on k. And so if we calculate these values outside of the for loops, we can reduce this calculation by 1000 times:
tanValues = zeros(lenx,n);
for i=1:lenx
for j=1:n
tanValues(i,j) = tan(pi/L*(x(i)+L/2)*j);
end
end
and the calculation for V2(j,k) becomes
V2(j,k)=V1(j,k)*logResult*tanValues(i,j);
Also, memory can be pre-allocated to the V2 and V3 matrices to avoid the internal resizing that occurs on each iteration. Just do the following outside the for loops
V2 = zeros(n,n);
V3 = zeros(lenx,n);
Using tic and toc reduces the original execution from ~14 seconds to ~6 on my workstation. This is still three times slower than natan's solution which is ~2 seconds for me.
here's a vectorized solution using meshgrid, bsxfun and repmat:
% fast preallocation
jj(n,n)=0; B(n,n,L)=0; V3(L,n)=0;
lg=log(2/L);
% the vectorizaion part
jj=meshgrid(1:n);
B=bsxfun(#times,ones(n),permute(x,[3 1 2]));
V3=squeeze(sum(lg*repmat(V1,1,1,numel(x)).*tan(bsxfun(#times,jj',pi/L*(B+L/2))),1)).';
Running your code at my computer using tic\toc took ~25 seconds. The bsxfun code took ~4.5 seconds...
OK, a follow-up of this and this question. The code I want to modify is of course:
function fdtd1d_local(steps, ie = 200)
ez = zeros(ie + 1);
hy = zeros(ie);
for n in 1:steps
for i in 2:ie
ez[i]+= (hy[i] - hy[i-1])
end
ez[1]= sin(n/10)
for i in 1:ie
hy[i]+= (ez[i+1]- ez[i])
end
end
(ez, hy)
end
fdtd1d_local(1);
#time sol1=fdtd1d_local(10);
elapsed time: 3.4292e-5 seconds (4148 bytes allocated)
And I've naively tried:
function fdtd1d_local_parallel(steps, ie = 200)
ez = dzeros(ie + 1);
hy = dzeros(ie);
for n in 1:steps
for i in 2:ie
localpart(ez)[i]+= (hy[i] - hy[i-1])
end
localpart(ez)[1]= sin(n/10)
for i in 1:ie
localpart(hy)[i]+= (ez[i+1]- ez[i])
end
end
(ez, hy)
end
fdtd1d_local_parallel(1);
#time sol2=fdtd1d_local_parallel(10);
elapsed time: 0.0418593 seconds (3457828 bytes allocated)
sol2==sol1
true
The result is correct, but the performance is much worse. So why? Because parallelization isn't for a dual core old lap-top, or I'm wrong again?
Well, I admit that the only thing I know about parallelization is it can speed up codes but not every piece of code can be paralleled, is there any basic knowledge that one should know before trying parallel programming?
Any help would be appreciated.
There are several things going on. First, notice the difference in memory consumed. That's a sign that something is wrong. You'll get greater clarity by separating allocation (your zeros and dzeros lines) from the core algorithm. However, it's unlikely that very much of that memory is being used by allocation; more likely, something in your loop is using memory. Notice that you're describing the localpart on the left hand side, but you're using the raw DArray on the right hand side. That may be triggering some IPC traffic. If you need to debug the memory consumption, see the ProfileView package.
Second, it's not obvious to me that you're really breaking the problem up among processes. You're looping over each element of the whole array, instead you should have each worker loop over its own piece of the array. However, you're going to run into problems at the edges between localparts, because the updates require the neighboring values. You'd be much better off using a SharedArray.
Finally, launching threads has overhead; for small problems, you're better off not parallelizing and just using simple algorithms. Only when the computation time gets to hundreds of milliseconds (or more) would I even think about going to the effort to parallelize.
N.B.: I'm a relative Julia, FDTD, Maxwell's Equations, and parallel processing noob.
#tholy provided a good answer presenting the important issues to be considered.
In addition, the Wikipedia Finite-difference time-domain method page presents some good info with references and links to software packages, some of which use some style of parallel processing.
It seems that many parallel processing approaches to FDTD partition the physical environment into smaller chunks and then calculate the chunks in parallel. One complication is that the boundary conditions must be passed between adjacent chunks.
Using your toy 1D problem, and my limited Julia skills, I implemented the toy to use two cores on my machine. It's not the most general, modular, extendable, effective, nor efficient, but it does demonstrate parallel processing. Hopefully a Julia wizard will improve it.
Here's the Julia code I used:
addprocs(2)
#everywhere function ez_front(n::Int, ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
ez_local[1]=sin(n/10)
#simd for i=2:length(ez_local)
#inbounds ez_local[i] += (hy_local[i] - hy_local[i-1])
end
end
#everywhere function ez_back(ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
index_boundary::Int = first(localindexes(hy)[1])-1
ez_local[1] += (hy_local[1]-hy[index_boundary])
#simd for i=2:length(ez_local)
#inbounds ez_local[i] += (hy_local[i] - hy_local[i-1])
end
end
#everywhere function hy_front(ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
index_boundary = last(localindexes(ez)[1])+1
#simd for i=1:(length(hy_local)-1)
#inbounds hy_local[i] += (ez_local[i+1] - ez_local[i])
end
hy_local[end] += (ez[index_boundary] - ez_local[end])
end
#everywhere function hy_back(ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
#simd for i=2:(length(hy_local)-1)
#inbounds hy_local[i] += (ez_local[i+1] - ez_local[i])
end
hy_local[end] -= ez_local[end]
end
function fdtd1d_parallel(steps::Int, ie::Int = 200)
ez = dzeros((ie,),workers()[1:2],2)
hy = dzeros((ie,),workers()[1:2],2)
for n = 1:steps
#sync begin
#async begin
remotecall(workers()[1],ez_front,n,ez,hy)
remotecall(workers()[2],ez_back,ez,hy)
end
end
#sync begin
#async begin
remotecall(workers()[1],hy_front,ez,hy)
remotecall(workers()[2],hy_back,ez,hy)
end
end
end
(convert(Array{Float64},ez), convert(Array{Float64},hy))
end
fdtd1d_parallel(1);
#time sol2=fdtd1d_parallel(10);
On my machine (an old 32-bit 2-core laptop), this parallel version wasn't faster than the local version until ie was set to somewhere around 5000000.
This is an interesting case for learning parallel processing in Julia, but if I needed to solve Maxwell's equations using FDTD, I'd first consider the many FDTD software libraries that are already available. Perhaps a Julia package could interface to one of those.