On the order of CUDA printf outputs - parallel-processing

I'm new to CUDA and I'm trying to do parallel printing with CUDA printf.
In my example below, I have 6 threads and 6 data arrays and I need to print all 6 arrays "at the same time" in CUDA. Each array should be assigned to 1 thread which will print it. I'm trying since more than a week and don't get it how to do it, because always I get results ordered in a row: first array printed first, second array printed second and etc. However, I would like to observe mixed printing, to prove "randomness" of the parallel executions. Here is my code:
no code
What I did wrong?

Since you have a very tiny kernel containing only 1 block with 6 threads, all the threads runs in a warp. Within a warp, different threads have to wait each other.
Please refer to the programming guide for more details.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture
A warp executes one common instruction at a time, so full efficiency
is realized when all 32 threads of a warp agree on their execution
path. If threads of a warp diverge via a data-dependent conditional
branch, the warp serially executes each branch path taken, disabling
threads that are not on that path, and when all paths complete, the
threads converge back to the same execution path. Branch divergence
occurs only within a warp; different warps execute independently
regardless of whether they are executing common or disjoint code
paths.
As a result, your data will be printed out with the same order of your code (First if(id==1){...}, then if(id==2){...}, ...).

Have a look at the CUDA C Programming Guide pp. 113-114: it provides some information on how printf flushes its output.
EDIT
Also according to Eric's answer, by printf you will see only a "granular" randomness, the randomness being related to the random nature of the warp execution. Everything is inside a warp can appear to be ordered.
Have also a look to this other thread
CUDA : unexpected printf behavior
where Robert Crovella explains the logic behind CUDA printf outputs.

Related

Octave parallel function worsens the running time on a single machine

I tried to create a test code by octave to evaluate the time efficiency on my Windows machine with an 8-core processor (parallelization on a single machine); starting with the simple code example provided in the documentation, as follows:
pkg load parallel
fun = #(x) x^2;
vector_x = 1:20000;
# Serial Format of the Program
tic()
for i=1:20000
vector_y(1,i)=vector_x(1,i)^2;
endfor
toc()
# Parallel Format of the Program
tic()
vector_y1 = pararrayfun(nproc, fun, vector_x);
toc()
To my surprise, the time required for a serial code is much faster than using the parallel function. The serial case ran in 0.0758219 s, the parallel one in 3.79864 s.
Would someone explain me if it is a parallel overhead or I should set up something in my Octave setting, or in which cases is the parallization really helpful?
TL;DR: open up your pool outside the timer and choose a more difficult operation.
There's two main issues. One is what Ander mentioned in his comment, starting up the parallel pool takes a second or two. You can open up it beforehand (in MATLAB you can do this via parpool) to speed that up. Alternatively, run a single parallel operation, thus opening the pool, and then redo the timing.
The second issue is the simplicity of your operation. Just squaring a number cannot go much faster than it already goes in serial. There's no point in passing data back-and-forth between workers for such a simple operation. Redo your test with a more expensive function, e.g. eig() as MATLAB does in their examples.
Parallellisation is thus useful if the runtime of your operations greatly outweighs the overhead of passing data to and from workers. Basically this means you either have a very large data set, which you need to perform the same operation on each item (e.g. taking the mean of every 1000 rows or so), or you have a few heavy, but independent, tasks to perform.
For a more in-depth explanation I can recommend this answer of mine and references therein.
Just as a sidenote, I'm surprised your serial for is that fast, given that you do not initialise your output vector. Preallocation is very important, as "growing" arrays in loops requires the creation of a new array and copying all previous content to it every iteration.
You also might want to consider not using i or j as variable names, as they denote the imaginary unit. It won't affect runtime much, but can result in very hard to debug errors. Simply use idx, ii, or a more descriptive variable name.

CUDA critical sections, thread/warp execution model and NVCC compiler decisions

Recently I posted this question, about a critical section. Here is a similar question. In those questions the given answer says, that is up to the compiler if the code "works" or not, because the order of the various paths of execution is up to the compiler.
To elaborate the rest of the question I need the following excerpts from The CUDA programming guide:
... Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently....
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path....
The execution context (program counters, registers, etc.) for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.
What I understand from this three excerpts is that, threads can diverge freely from the rest, all the branch possibilities will be serialized if there is divergence between threads, and if a branch is taken it will execute till completion. And that is why the questions mentioned above ends on deadlock, because the ordering of the execution paths imposed by the compiler, results in the taking of a branch that doesn't get the lock.
Now the question is: the compiler shouldn't always put the branches in the order written by the user?, is there a high level way to enforce the order? I know, the compiler can optimize, do a reordering of the instructions, etc, but it should not fundamentally change the logic of the code (yes there are exceptions like some memory access without the volatile keyword, but that is why the keyword exists, to give control to the user).
Edit
The main point of this question is not about critical sections, is about the compiler, for example in the first link, a compilation flag change drastically the logic of the code. One "working", and the other doesn't. What bothers me, is that in all the reference, it only says be careful, nothing about undefined behaviour from the nvcc compiler.
I believe the order of execution is not set, nor guaranteed, by the CUDA compiler. It's the hardware that sets it - as far as I can recall.
Thus,
the compiler shouldn't always put the branches in the order written by the user?
It doesn't control execution order anyway
is there a high level way to enforce the order?
Just the synchronization instructions like __syncthreads().
The compiler... should not fundamentally change the logic of the code
The semantics of CUDA code is not the same as for C++ code... sequential execution of if branches is not part of the semantics.
I realize this answer may not be satisfying to you, but that's how things stand, for better or for worse.

Parallel code slower than serial code (value function iteration example)

I'm trying to make the code faster in Julia using parallelization. My code has nested serial for-loops and performs value function iteration. (as decribed in http://www.parallelecon.com/vfi/)
The following link shows the serial and parallelized version of the code I wrote:
https://github.com/minsuc/MyProject/blob/master/VFI_parallel.ipynb (You can find the functions defined in DefinitionPara.jl in the github page too.) Serial code is defined as main() and parallel code is defined as main_paral().
The third for-loop in main() is the step where I find the maximizer given (nCapital, nProductivity). As suggested in the official parallel documentation, I distribute the work over nCapital grid, which consists of many points.
When I do #time for the serial and the parallel code, I get
Serial: 0.001041 seconds
Parallel: 0.004515 seconds
My questions are as follows:
1) I added two workers and each of them works for 0.000714 seconds and 0.000640 seconds as you can see in the ipython notebook. The reason why parallel code is slower is due to the cost of overhead?
2) I increased the number of grid points by changing
vGridCapital = collect(0.5*capitalSteadyState:0.000001:1.5*capitalSteadyState)
Even though each worker does significant amount of work, serial code is way faster than the parallel code. When I added more workers, serial code is still faster. I think something is wrong but I haven't been able to figure out... Could it be related to the fact that I pass too many arguments in the parallelized function
final_shared(mValueFunctionNew, mPolicyFunction, pparams, vGridCapital, mOutput, expectedValueFunction)?
I will really appreciate your comments and suggestions!
If the amount of work is really small between synchronizations, the task sync overhead may be too long. Remember that a common OS timeslicing quantum is 10ms, and you are measuring in the 1ms range, so with a bit of load, 4ms latency for getting all work threads synced is perfectly reasonable.
In the case of all tasks accessing the same shared data structure, access locking overhead may well be the culprit, if the shared data structure is thread safe, even with longer parallel tasks.
In some cases, it may be possible to use non-thread-safe shared arrays for both input and output, but then it must be ensured that the workers don't clobber each other's results.
Depending on what exactly the work threads are doing, for example if they are outputting to the same array elements, it might be necessary to give each worker its own output array, and merge them together in the end, but that doesn't seem to be the case with your task.

Distributing independant iterations in a subroutine over multiple machines

I have an application that's written in Fortran and there is one particular subroutine call that takes a long time to execute. I was wondering if it's possible to distribute the tasks for computation over multiple nodes.
The current serial flow of the code is as follows:
D = Some computations that give me D and it is in memory
subroutine call
<within the subroutine>
iteration from 1 .. n
{
independent operations on D
}
I wish to distribute the iterations over n/4 machines. Can someone please guide me with this? Do let me know if something's not very clear!
Depending on the underlying implementation, coarrays (F2008) may allow processing to be distributed over multiple nodes. Partitioning the iteration space across the images is relatively straightforward, communication of the results back to one image (or to all images) is where some complexity might arise. Some introductory material on coarrays can be found here.
Again, depending on the underlying implementation, DO CONCURRENT (F2008) may allow parallel processing of iterations (though unlikely to be across nodes). Restrictions exist on what can be done in the scope of a DO CONCURRENT construct that mean that iterations can be executed in any order, appropriately capable compilers may be able to then transform that further into concurrent execution.
When one has existing code and wants to parallelize incrementally (or just one routine), shared memory approaches are the "quick hit". Especially when it is known that the iterations are independant I'd first recommend looking at compiler flags for auto-parallelization, language constructs such as DO CONCURRENT (thanks to #IanH for reminding me of that), and OpenMP compiler directives.
As my extended comment is about distributed memory, however, I'll come to that.
I'll assume you don't have access to some advanced process-spawning setup on all of your potential machines. That is, you'll have processes running on various machines each being charged for the time regardless of what work is being done. Then, the work-flow looks like
Serial outer loop
Calculate D
Distribute D to the parallel environment
Inner parallel loop on subsets of D
Gather D on the master
If the processors/processes in the parallel environment are doing nothing else - or you're being charged regardless - then this is the same to you as
Outer loop
All processes calculate D
Each process works on its subset of D
Synchronize D
The communication side, MPI or coarrays (which I'd recommend in this case, again see #IanH's answer, where image synchronization etc., is as limited as a few loops with [..]) here is just in the synchronization.
As an endnote: multi-machine coarray support is very limited. ifort as I understand requires an licence beyond the basic, g95 has some support, the Cray compiler may well. That's a separate question, however. MPI would be well supported.

How to remove Fortran race condition?

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO
You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

Resources