In trying to optimise some code I find that using OpenMP linearly increases the time it takes to run. The representative section of code that I am trying to speed up is as follow:
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = REAL(cr)
CALL SYSTEM_CLOCK(c1)
DO k=1,ntotal
CALL OMP_INIT_LOCK(locks(k))
END DO
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i,j,k)
DO k=1,niac
i = pair_i(k)
j = pair_j(k)
dvx(:,k) = vx(:,i)-vx(:,j)
CALL omp_set_lock(locks(i))
CALL DGER(dim,dim,-1.d0, (disp_nmh(:,j)-disp_nmh(:,i)),1, &
(dwdx_nor(dim+1:2*dim,k)*V_0(j)),1, particle_data(i)%def_grad,dim)
CALL DGER(dim,dim,-1.d0, (-dvx(:,k)),1, &
(dwdx_nor(dim+1:2*dim,k)*V_0(j)) ,1, particle_data(i)%vel_grad(1:dim,1:dim),dim)
CALL omp_unset_lock(locks(i))
CALL omp_set_lock(locks(j))
CALL DGER(dim,dim,-1.d0, (dvx(:,k)),1, &
(dwdx_nor(3*dim+1:4*dim,k)*V_0(i)) ,1, particle_data(j)%vel_grad(1:dim,1:dim),dim)
CALL DGER(dim,dim,-1.d0, (disp_nmh(:,i)-disp_nmh(:,j)),1, &
(dwdx_nor(3*dim+1:4*dim,k)*V_0(i)),1, particle_data(j)%def_grad,dim)
CALL omp_unset_lock(locks(j))
END DO
!$OMP END PARALLEL DO
CALL SYSTEM_CLOCK(c2)
t_el = t_el + (c2-c1)/rate
WRITE(*,*) "Wall time elapsed: ", t_el
Note that for the simulation I am testing k=14000 which I thought was a reasonable candidate for running in parallel. So far as I know I have to use the locks to ensure that threads which are given the same value of "i" (but a different value of "j") cannot access the same index of the arrays which are being written to at the same time. I cannot figure out if the version of BLAS (sudo apt-get install libblas-dev liblapack-dev) which I use is thread safe. I ran a simulation with 8 cores and got the same result as without OpenMP so I am guessing that it could be. BLAS is used, in this case, to calculate and sum the outer product of many 3x3 matrices.
Is the implementation of OpenMP above the best way to speed up this code? I know very little about OpenMP but my guesses are that:
the memory being all over the place ("i" is sequential but "j" is not)
the overhead in starting and closing down all the threads
the constant locking and unlocking
and maybe the small loop size (although I thought 14000 would be sufficient)
are significantly outweighing the performance benefits. Is this correct? Or can the code above be modified to get some performance gain?
EDIT
I should probably add that the code above is part of a time integration loop. Hopefully this explains why the elapsed time is summed.
Related
I ran into a problem for understanding the logic behind "the last warp loop unrolling" technique in Nvidia's parallel reduction tutorial available here.
In case of thread31 (for which tid=31), before unrolling the loop:
this thread only executes these operations:
sdata[31] += sdata[31+64]
sdata[31] += sdata[31+32]
But after the loop unrolling (as shown below):
The condition if(tid < 32) becomes true for thread31 and the warpReduce function will be executed for it and therefore all these operations which wouldn't be executed in the unrolled loop version will be executed now:
sdata[31] += sdata[31+32] //for second time
sdata[31] += sdata[31+16]
...
sdata[31] += sdata[31+1]
What's the logic behind it?
First:
sdata[31] += sdata[31+32] //for second time
No, that's not the case, it doesn't get executed a second time. The loop terminates when the s variable is shifted right from 64 to 32, and the body of the loop is not executed for s=32. Therefore the above statement is not executed during the body of the loop, because that would imply s=32, which is excluded by the loop termination condition.
Now, on to your question. It's true there is a behavioral difference between the two cases, however the only result that matters at the end is sdata[0] and this behavioral difference does not affect the results calculation for sdata[0]. So the only thing left would be "does it matter for performance?"
I don't have an answer for you, but I doubt it would make a significant difference. In the non-warp-reduce case, at each loop iteration there is a shift-right operation on a register variable, followed by a test, followed by a predicated set of shared memory instructions. In the warp-reduce case, there is some extra shared memory load/store activity and add arithmetic, but no shift arithmetic or testing per reduction step.
With respect to the extra load/store activity, the only portion of this that matters is the portion that will reach "above" the warp range (i.e. 0-31). There is extra shared loading activity going on here. The extra store activity and extra add arithmetic is irrelevant, because constraining these operations to less than a single warp is not any better performance-wise (this point is covered in the presentation itself, "We don’t need if (tid < s) because it doesn’t save any
work"). So the only consideration here is the once-per-step "extra" read of shared memory, one additional transaction, basically, per step. Against that we have the shifting, conditional test, and predication.
I don't know which is faster, but my guess as to the "logic" would be:
The difference would be small. Shared memory pressure is unlikely to be an issue at this point in this code.
The person who wrote it either didn't consider this at all, or considered it and decided it was probably so trivial as to be not worthy of cluttering a presentation that is really focused on other things, and will be read by many people.
EDIT: Based on comments, there appears to still be some question about my claim that the behavioral difference does not affect the results calculdation for sdata[0].
First, let's acknowledge that the only item we care about at the end is sdata[0]. sdata[1] or any other "result" is irrelevant for this discussion.
Let's make an observation about which thread calculations matter, at each step. We can observe that at a given step in the final-warp reduction, the only threads that matter (i.e. that can have an effect on the final value in sdata[0]) are those that are less then the offset value:
sdata[tid] += sdata[tid + offset]; // where offset is 32, then 16, then 8, etc.
Why is this? In order to understand that, we need to understand 2 things. First, we must understand at this point that there is an expectation of warp-synchronous behavior. This is already identified in the presentation (slide 21) as a necessary precondition to convert the loop reduction to the unrolled final warp reduction. I'm not going to spend a lot of time on the definition of warp-synchronous, but it essentially means we are depending on the warp to execute in lockstep. A warp is 32 threads, and it means that when one thread is executing a particular instruction, every thread in the warp is executing that instruction, at that point in the instruction stream. Second, we need to carefully decompose the above line to understand the sequence of operations. The above line of C++ code will decompose into the following pseudo-machine-language code that the GPU is actually executing:
LD R0, sdata[tid]
LD R1, sdata[tid+offset]
ADD R3, R2, R1
ST sdata[tid], R3
In english, at each step in the final warp unrolled reduction, each thread will load its sdata[tid] value, then each thread will load its sdata[tid+offset] value, then each thread will add those 2 values together, then each thread will store the result. Because the warp is executing in lockstep at this point, when each thread loads its sdata[tid] value, it means that every thread is loading its respective value, at that instruction cycle/clock cycle, i.e. at that instant.
now, lets revisit the overall operation. At the point in the sequence where we have:
sdata[tid] += sdata[tid + 16];
how can we justify the statement that the only threads here that matter are those whose tid value is less than the offset? The first thing each thread does is load sdata[tid]. Then each thread loads sdata[tid+16]. So at this point, threads 0-15 have loaded their own value, plus the values from locations 16-31. Threads 16-31 have loaded their own value, plus the values from locations 32-47. Then all 32 threads perform the addition, then all 32 threads perform the store operation. So thread 16, which also picked up the value from location 32, did not update the location 16 value until after the previous value at location 16 had been consumed (by thread 0 in this case). So the behavior of threads 16-31 at this point have no impact on the value computed for thread 0.
We can repeat the above process to show that for each offset, the threads whose indexes lie at or above the offset have no impact on the calculation for thread 0.
I'm trying to write a code that will port openmp thread to a single gpu. I found very less case studies /codes on this.Since I`m not from computer science background.
I have less skills in programming.
This is how the basic idea look's like
And this is the code so far developed.
CALL OMP_SET_NUM_THREADS(2)
!$omp parallel num_threads(acc_get_num_devices(acc_device_nvidia))
do while ( num.gt.iteration)
id = omp_get_thread_num()
call acc_set_device_num(id+1, acc_device_nvidia)
!!$acc kernels
!error=0.0_rk
!!$omp do
!$acc kernels
!!$omp do
do j=2,nj-1
!!$acc kernels
do i=2,ni-1
T(i,j)=0.25*(T_o(i+1,j)+T_o(i-1,j)+ T_o(i,j+1)+T_o(i,j-1) )
enddo
!!$acc end kernels
enddo
!!$omp end do
!$acc end kernels
!!$acc update host(T,T_o)
error=0.0_rk
do j=2,nj-1
do i=2,ni-1
error = max( abs(T(i,j) - T_o(i,j)), error)
T_o(i,j) = T(i,j)
enddo
enddo
!!$acc end kernels
!!$acc update host(T,T_o,error)
iteration = iteration+1
print*,iteration , error
!print*,id
enddo
!$omp end parallel
There's a number of issues here.
First, you can't put an OpenMP (or OpenACC) parallel loop on a do while. Do while have indeterminant number to iterations therefor create a dependency in that exiting the loop depends on the previous iteration of the loop. You need to use a DO loop where the number of iterations is known upon entry into the loop.
Second, even if you convert this to a DO loop, you'd get a race condition if run in parallel. Each OpenMP thread would be assigning values to the same elements of the T and T_o arrays. Plus the results of T_o is used as input to the next iteration creating a dependency. In other words, you'd get wrong answers if you tried to parallelize the outer iteration loop.
For the OpenACC code, I'd suggest adding a data region around the iteration loop, i.e. "!$acc data copy(T,T_o) " before the iteration loop and then after the loop "!$acc end data", so that the data is created on the device only once. As you have it now, the data would be implicitly created and copied each time through the iteration loop causing unnecessary data movement. Also add a kernels region around the max error reduction loop so this is offloaded as well.
In general, I prefer using MPI+OpenCC for multi-GPU programming rather than OpenMP. With MPI, the domain decomposition is inherent and you then have a one-to-one mapping of MPI rank to a device. Not that OpenMP can't work, but you then often need to manually decompose the domain. Also trying to manage multiple device memories and keep them in sync can be tricky. Plus with MPI, your code can also go across nodes rather than be limited to a single node.
I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.
I'm working on a code in which I have to perform a vector-matrix multiplication on a chunk of data, copying the results back to CPU and then start multiplying another chunk. I perform the vector to matrix multiplication using cublas library (following code).
clock_t a,b;
a = clock();
for(int i=0;i<n;i++)
{
cublasSgemv(handle,CUBLAS_OP_T,m,k,&alpha, dev_b1+((i+1)*m), m, dev_b1+(i*m),1, &beta,out,1);
out+=(n-(i+1));
cudaMemcpy(b3,dev_b3, sizeof(float)*(cor_size), cudaMemcpyDeviceToHost);
}
b = clock();
cout<<"Running time is: "<<(double)(b-a)/clocks_per_sec;
I have to measure the running time of this for loop. I read something about CudaEvent but in my case, I want to measure the time of total loop not a kernel so I used clock function. I am wondering is this a correct way to measure the time for this chunk of code or there are more accurate ways to do that?
I know that for measuring elapsed time we have to repeat running the code multiple times and take the average of elapsed times of all runs, so another question is that is there any trade-off for the number of times that running code should be repeated?
Thanks
cudaMemcpy synchronizes host and device, so a CPU timer such as clock_t should give results that are identical with those produced by a CUDA timer, making the necessary allowances for the granularity/resolution of clock_t.
As regards the accuracy of the measurements is concerned, from what I have seen, the first iteration timings could be disregarded in the calculations. Subsequent timing measurements should yield numbers depending on factors such as load imbalance in the algorithm being run, which might decide on whether we get the same numbers at every iteration. I would reckon that that would not be an issue here, with Sgemm.
You can still use CUDA events to measure the entire loop runtime, by recording two events (one before starting the loop, one after the end, i.e. in the positions where you are currently using clock()), synchronizing on the second event and then getting the elapsed time using cudaEventElapsedTime(). This should have the advantage of being more accurate than clock().
The following loop in fortran almost takes no time
j=0
do i=1,1000000000000000000
j=j+1
end do
print*,j
But I just don't understand, our cpu is about GHz, which means 10^9 cycle in a second, while the above loop cycle is way too much than 10^9, why it almost takes no time?
It seems that the values is not computed at compiled time. We can add outer loop, until
do m=1,1000000000
do i=1,1000000000000000000
j=j+1
end do
end do
print*,j
Now it takes a second on my computer
Edit
I am using windows, intel parallel studio 15, with no extra compilation option: simply ifort test.f90. Timing method is simple, just wait after I press Enter in command line to execute the .exe
don't know fortran, but if this would be C, the compiler could optimize the above code removing the loop altogether as the value of j can be computed at compile time.
So the above code would be reduced to
print 1000000000000000000
Your logic about cycles and instructions is flawed. Modern CPUs parallelize code on hardware level, even if the code is serial:
a cpu has more a few ALU who can compute arithmetic instructions in parallel
instructions are executed in a pipeline, so at any one point, different stages of consecutive instructions are executed in parallel.
So "max of one instruction per cycle" doesn't hold.
Also increment by one is one of the fastest instruction in the CPU.