How to measure time of julia program? - time

If I want to calculate for things in Julia
invQa = ChebyExp(g->1/Q(g),0,1,5)
a1Inf = ChebyExp(g->Q(g),1,10,5)
invQb = ChebyExp(g->1/Qd(g),0,1,5)
Qb1Inf = ChebyExp(g->Qd(g),1,10,5)
How can I count the time? How many seconds do i have to wait for the four things up be done? Do I put tic() at the beginning and toc() at the end?
I tried #elapsed, but no results.

The basic way is to use
#time begin
#code
end
But note that you never should benchmark in the global scope.
A package that can help you benchmark your code is BenchmarkTools.jl which you should check out as well.

You could do something like this (I guess that g is input parameter):
function cheby_test(g::Your_Type)
invQa = ChebyExp(g->1/Q(g),0,1,5)
a1Inf = ChebyExp(g->Q(g),1,10,5)
invQb = ChebyExp(g->1/Qd(g),0,1,5)
Qb1Inf = ChebyExp(g->Qd(g),1,10,5)
end
function test()
g::Your_Type = small_quick #
cheby_test(g) #= function is compiled here and
you like to exclude compile time from test =#
g = real_data()
#time cheby_test(g) # here you measure time for real data
end
test()
I propose to call #time not in global scope if you like to get proper allocation info from time macro.

Related

Parallelising nested do loop Fortran

I don't get any speedup when I try to do the following in the subroutine:
!$ call omp_set_num_threads(threadno)
call system_clock(x1)
!$OMP PARALLEL do private(i), reduction(+:total)
do i = 1,m
total = 0.d0
call result(a,l,b,qm,q,en) !here l is input for subroutine and en is output
qm(:,i) = q
qtv(i) = qt
mean = sum(q)/size(q)
do i2 = 1,k
total = total + ((mean-q(i2))**2)/(a+b)
end do
qvv(i1) = total
end do
call system_clock(x2)
print *, x2-x1
!$OMP END PARALLEL do
Comments on the OpenMP part:
total should not be reset in the loop but before the !$OMP clause.
i2 and mean should be private.
If q does not change between iterations of the loop, sum(q)/size(q) should be placed outside.
The lack of private setting can lead to memory access conflicts (and thus slowdowns).
I guess that the code you show is close to but not really the one that you compile. It would be useful to have a compiled code to provide a better help.
Cheers,
Pierre
EDIT: for timing OpenMP code, you should use omp_get_wtime (see https://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fwtime.html) that gives the walltime https://en.wikipedia.org/wiki/Wall-clock_time. The module for openmp routines is loaded with use omp_lib

Code running very slow

My code seems to run very slowly and I can't think of any way to make it faster. All my arrays have been preallocated. S is a large number of element (say 10000 element, for example). I know my code runs slowly because of the "for k=1:S" but i cant think of another way to perform this loop at a relatively fast speed. Can i please get help because it takes hours to run.
[M,~] = size(Sample2000_X);
[N,~] = size(Sample2000_Y);
[S,~] = size(Prediction_Point);
% Speed Preallocation
Distance = zeros(M,N);
Distance_Prediction = zeros(M,1);
for k=1:S
for i=1:M
for j=1:N
Distance(i,j) = sqrt(power((Sample2000_X(i)-Sample2000_X(j)),2)+power((Sample2000_Y(i)-Sample2000_Y(j)),2));
end
Distance_Prediction(i,1) = sqrt(power((Prediction_Point(k,1)-Sample2000_X(i)),2)+power((Prediction_Point(k,2)-Sample2000_Y(i)),2));
end
end
Thanks.
I realized the major problem was organization of my code. I was performing calculation in a loop where it was absolutely unnecessary. So i seperated the code in two blocks and it Works much faster.
for i=1:M
for j=1:N
Distance(i,j) = sqrt(power((Sample2000_X(i)-Sample2000_X(j)),2)+power((Sample2000_Y(i)-Sample2000_Y(j)),2));
end
end
for k=1:S
for i=1:M
Distance_Prediction(i,1) = sqrt(power((Prediction_Point(k,1)-Sample2000_X(i)),2)+power((Prediction_Point(k,2)-Sample2000_Y(i)),2));
end
end
Thanks to the community for the help.
Your matrix Distance does not depend on k, so you can easily calculate it outside the main for-loop, for instance using:
d = sqrt((repmat(Sample2000_X, [1,M]) - repmat(Sample2000_X', [M,1])).^2 + (repmat(Sample2000_Y, [1,N]) - repmat(Sample2000_Y', [N,1])).^2);
I assume M=N, because elsewise your code won't work. Next, you can calculate your Distance_Prediction matrix. It is rather strange that you calculate this inside the for-loop over k, because the matrix will be changed in every iteration without using it. Anyway, this will do exactly the same as your code:
for k=1:S
Distance_Prediction = sqrt((Sample2000_X - Prediction_Point(k,1)).^2 + (Sample2000_Y - Prediction_Point(k,1)).^2);
end

julia #parallel for loop does not update array

I am new to julia and to get started I wanted to port some numpy code to julia and hoped to get some nice performance increase. So far not to my satisfaction.
This is the function I want to compute
function s(x_list, r_list)
result_list = zeros(size(x_list,1))
for i = 1:size(x_list,1)
dotprods = r_list * x_list[i,:]'
expcall = exp(im * dotprods)
sumprod = sum(expcall) * sum(conj(expcall))
result_list[i] = sumprod
end
return result_list
end
with data input that looks like
v = rand(3)
r = rand(6000,3)
x = linspace(1.0, 2.0, 300) * (v./sqrt(sumabs2(v)))'
for this function and the given input, #time s(x,r) gives me
0.110619 seconds (3.60 k allocations: 96.256 MB, 8.47% gc time)
For this case, numpy does the same job in ~70ms, so I'm not very happy! Now if I do a #parallel for loop with julia -p 2:
function s(x_list, r_list)
result_list = SharedArray(Float64, size(x_list,1))
#parallel for i = 1:size(x_list,1)
dotprods = r_list * x_list[i,:]'
expcall = exp(im * dotprods)
sumprod = sum(expcall) * sum(conj(expcall))
result_list[i] = sumprod
end
return result_list
end
the problem is that
result_list[i] = sumprod
doesn't get updated and I get the list of zeros returned from the array initialization. What am I doing wrong here?
Further attempts to increase speed also did not show any benefit, e.g.
#vectorize_2arg Array{Float64,2} s
and declaring types
function s{T<:Float64}(x_list::Array{T,2}, r_list::Array{T,2})
But now, starting the same #parallel for loop in a session with just one thread (no -p2, just julia) the array does get updated and #time s(x,r) tells me
0.000040 seconds (36 allocations: 4.047 KB)
which is actually impossible for the function and input given! Is this a bug?
Any help is very appreciated!
Julia's #parallel macro does a distributed for loop: it copies all the data to other processes and does computations on each of them, reducing over the results and returning that result. The processes do not share memory – and may even be on other machines altogether. Your original data is never touched because each worker is modifying its own copy of that data. You may be thinking of threads, which is a currently-experimental feature that Julia will be adding in the future.
One problem is that you're not waiting for the #parallel call to complete. From the docs:
...the reduction operator can be omitted if it is not needed. In that case, the loop executes asynchronously, i.e. it spawns independent tasks on all available workers and returns an array of Future immediately without waiting for completion. The caller can wait for the Future completions at a later point by calling fetch() on them, or wait for completion at the end of the loop by prefixing it with #sync, like #sync #parallel for.
Try prefixing for loop with #sync

MATLAB parfor slicing a 3D array

I'm trying to speed up my code using parfor. The purpose of the code is to slide a 3D square window on a 3D image and for each block of mxmxm apply a function.
I wrote this code:
function [ o_image ] = SlidingWindow( i_image, i_padSize, i_fun, i_options )
%SLIDINGWINDOW Summary of this function goes here
% Detailed explanation goes here
o_image = zeros(size(i_image,1),size(i_image,2),size(i_image,3));
i_image = padarray(i_image,i_padSize,'symmetric');
i_padSize = num2cell(i_padSize);
[m,n,p] = deal(i_padSize{:});
[row,col,depth] = size(i_image);
windowShape = i_options.windowShape;
mask = i_options.mask;
parfor (i = m+1:row-m,i_options.cores)
temp = i_image(i-m:i+m,:,:);
for j = n+1:col-n
for h = p+1:depth-p
ii = i-m;
jj = j-n;
hh = h-p;
temp = temp(:,j-n:j+n, h-p:h+p);
o_image(ii,jj,hh) = parfeval(i_fun, temp, windowShape, mask);
end
end
end
end
I get one warning and one error that I don't understand how to solve.
The warning says:
the entire array or structure 'i_image' is a broadcast variable.
The error says:
the PARFOR loop can not run due to the way variable 'o_image' is used.
I don't understand how to fix these two things. Any help is greatly appreciated!
As far as I understand, parfeval takes care of running your function on the available number of workers, which is why it doesn't need to be surrounded by parfor. Assuming you already have an active parpool, changing the external parfor into for eliminates both problems.
Unfortunately, I can't support my answer with a benchmark or suggest a more fitting solution because your inputs are unknown.
It seems to me that the code can be optimized in other ways, mainly by vectorization. I would suggest you looked into the following resources:
This question, for additional info on parfeval.
Examples on how to use bsxfun and permute and benchmarks thereof: ex1, ex2, ex3.
P.S.: The 2nd part of (i = m+1:row-m,i_options.cores) seems out of place...

How to decide on runtime whether to use ´for´ or ´parfor´ loop in matlab and why the differences between both of them are minor?

I am trying to test the effect of parfor compared to for in matlab, I built simple function calculates π :
here is the function with the parfor:
function [calc_pi,epsilon] = calcPi(max)
format long;
in = 0;
tic
parfor k=1:max
x = rand();
y = rand();
if sqrt(x^2 + y^2)<1
in = in + 1;
end
end
toc
calc_pi = 4*in/max;
epsilon = abs(pi - calc_pi);
end
I run it with parfor and got this output:
>> [calc,err] = calcPi(1000000000)
Elapsed time is 92.2923 seconds.
calc =
3.141638468000000
err =
4.581441020690136e-05
>>
with the for loop I came with:
>> [calc,err] = calcPi(1000000000)
Elapsed time is 121.3432 seconds.
calc =
3.141645132000000
err =
5.247841020672439e-05
I have two questions:
Why both take about the same amount of time ? (Unlike showed here)
I would like to add an argument to the function indicates whether to
use for or parfor with the minimal change in code:
i.e. :
if (use_par):
parfor k=1:10
else
for k=1:10
end
<--rest of code here-->
How can I write it with the minimal amount of code ?
The main requirement of parfor is that the loop executions are independant. Here they are clearly not as each iteration can update the variable in.
The good news is that you may be able to solve this by using in(k) instead.
One way to use one loop or the other without using extra code would be to put everything you do in a function or script, for example doeverything.m
then write
if (use_par):
parfor k=1:10
doeverything
end
else
for k=1:10
doeverything
end
end

Resources