How to remove Fortran race condition? - parallel-processing

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO

You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

Related

Octave parallel function worsens the running time on a single machine

I tried to create a test code by octave to evaluate the time efficiency on my Windows machine with an 8-core processor (parallelization on a single machine); starting with the simple code example provided in the documentation, as follows:
pkg load parallel
fun = #(x) x^2;
vector_x = 1:20000;
# Serial Format of the Program
tic()
for i=1:20000
vector_y(1,i)=vector_x(1,i)^2;
endfor
toc()
# Parallel Format of the Program
tic()
vector_y1 = pararrayfun(nproc, fun, vector_x);
toc()
To my surprise, the time required for a serial code is much faster than using the parallel function. The serial case ran in 0.0758219 s, the parallel one in 3.79864 s.
Would someone explain me if it is a parallel overhead or I should set up something in my Octave setting, or in which cases is the parallization really helpful?
TL;DR: open up your pool outside the timer and choose a more difficult operation.
There's two main issues. One is what Ander mentioned in his comment, starting up the parallel pool takes a second or two. You can open up it beforehand (in MATLAB you can do this via parpool) to speed that up. Alternatively, run a single parallel operation, thus opening the pool, and then redo the timing.
The second issue is the simplicity of your operation. Just squaring a number cannot go much faster than it already goes in serial. There's no point in passing data back-and-forth between workers for such a simple operation. Redo your test with a more expensive function, e.g. eig() as MATLAB does in their examples.
Parallellisation is thus useful if the runtime of your operations greatly outweighs the overhead of passing data to and from workers. Basically this means you either have a very large data set, which you need to perform the same operation on each item (e.g. taking the mean of every 1000 rows or so), or you have a few heavy, but independent, tasks to perform.
For a more in-depth explanation I can recommend this answer of mine and references therein.
Just as a sidenote, I'm surprised your serial for is that fast, given that you do not initialise your output vector. Preallocation is very important, as "growing" arrays in loops requires the creation of a new array and copying all previous content to it every iteration.
You also might want to consider not using i or j as variable names, as they denote the imaginary unit. It won't affect runtime much, but can result in very hard to debug errors. Simply use idx, ii, or a more descriptive variable name.

Parallel code slower than serial code (value function iteration example)

I'm trying to make the code faster in Julia using parallelization. My code has nested serial for-loops and performs value function iteration. (as decribed in http://www.parallelecon.com/vfi/)
The following link shows the serial and parallelized version of the code I wrote:
https://github.com/minsuc/MyProject/blob/master/VFI_parallel.ipynb (You can find the functions defined in DefinitionPara.jl in the github page too.) Serial code is defined as main() and parallel code is defined as main_paral().
The third for-loop in main() is the step where I find the maximizer given (nCapital, nProductivity). As suggested in the official parallel documentation, I distribute the work over nCapital grid, which consists of many points.
When I do #time for the serial and the parallel code, I get
Serial: 0.001041 seconds
Parallel: 0.004515 seconds
My questions are as follows:
1) I added two workers and each of them works for 0.000714 seconds and 0.000640 seconds as you can see in the ipython notebook. The reason why parallel code is slower is due to the cost of overhead?
2) I increased the number of grid points by changing
vGridCapital = collect(0.5*capitalSteadyState:0.000001:1.5*capitalSteadyState)
Even though each worker does significant amount of work, serial code is way faster than the parallel code. When I added more workers, serial code is still faster. I think something is wrong but I haven't been able to figure out... Could it be related to the fact that I pass too many arguments in the parallelized function
final_shared(mValueFunctionNew, mPolicyFunction, pparams, vGridCapital, mOutput, expectedValueFunction)?
I will really appreciate your comments and suggestions!
If the amount of work is really small between synchronizations, the task sync overhead may be too long. Remember that a common OS timeslicing quantum is 10ms, and you are measuring in the 1ms range, so with a bit of load, 4ms latency for getting all work threads synced is perfectly reasonable.
In the case of all tasks accessing the same shared data structure, access locking overhead may well be the culprit, if the shared data structure is thread safe, even with longer parallel tasks.
In some cases, it may be possible to use non-thread-safe shared arrays for both input and output, but then it must be ensured that the workers don't clobber each other's results.
Depending on what exactly the work threads are doing, for example if they are outputting to the same array elements, it might be necessary to give each worker its own output array, and merge them together in the end, but that doesn't seem to be the case with your task.

Distributing independant iterations in a subroutine over multiple machines

I have an application that's written in Fortran and there is one particular subroutine call that takes a long time to execute. I was wondering if it's possible to distribute the tasks for computation over multiple nodes.
The current serial flow of the code is as follows:
D = Some computations that give me D and it is in memory
subroutine call
<within the subroutine>
iteration from 1 .. n
{
independent operations on D
}
I wish to distribute the iterations over n/4 machines. Can someone please guide me with this? Do let me know if something's not very clear!
Depending on the underlying implementation, coarrays (F2008) may allow processing to be distributed over multiple nodes. Partitioning the iteration space across the images is relatively straightforward, communication of the results back to one image (or to all images) is where some complexity might arise. Some introductory material on coarrays can be found here.
Again, depending on the underlying implementation, DO CONCURRENT (F2008) may allow parallel processing of iterations (though unlikely to be across nodes). Restrictions exist on what can be done in the scope of a DO CONCURRENT construct that mean that iterations can be executed in any order, appropriately capable compilers may be able to then transform that further into concurrent execution.
When one has existing code and wants to parallelize incrementally (or just one routine), shared memory approaches are the "quick hit". Especially when it is known that the iterations are independant I'd first recommend looking at compiler flags for auto-parallelization, language constructs such as DO CONCURRENT (thanks to #IanH for reminding me of that), and OpenMP compiler directives.
As my extended comment is about distributed memory, however, I'll come to that.
I'll assume you don't have access to some advanced process-spawning setup on all of your potential machines. That is, you'll have processes running on various machines each being charged for the time regardless of what work is being done. Then, the work-flow looks like
Serial outer loop
Calculate D
Distribute D to the parallel environment
Inner parallel loop on subsets of D
Gather D on the master
If the processors/processes in the parallel environment are doing nothing else - or you're being charged regardless - then this is the same to you as
Outer loop
All processes calculate D
Each process works on its subset of D
Synchronize D
The communication side, MPI or coarrays (which I'd recommend in this case, again see #IanH's answer, where image synchronization etc., is as limited as a few loops with [..]) here is just in the synchronization.
As an endnote: multi-machine coarray support is very limited. ifort as I understand requires an licence beyond the basic, g95 has some support, the Cray compiler may well. That's a separate question, however. MPI would be well supported.

On the order of CUDA printf outputs

I'm new to CUDA and I'm trying to do parallel printing with CUDA printf.
In my example below, I have 6 threads and 6 data arrays and I need to print all 6 arrays "at the same time" in CUDA. Each array should be assigned to 1 thread which will print it. I'm trying since more than a week and don't get it how to do it, because always I get results ordered in a row: first array printed first, second array printed second and etc. However, I would like to observe mixed printing, to prove "randomness" of the parallel executions. Here is my code:
no code
What I did wrong?
Since you have a very tiny kernel containing only 1 block with 6 threads, all the threads runs in a warp. Within a warp, different threads have to wait each other.
Please refer to the programming guide for more details.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture
A warp executes one common instruction at a time, so full efficiency
is realized when all 32 threads of a warp agree on their execution
path. If threads of a warp diverge via a data-dependent conditional
branch, the warp serially executes each branch path taken, disabling
threads that are not on that path, and when all paths complete, the
threads converge back to the same execution path. Branch divergence
occurs only within a warp; different warps execute independently
regardless of whether they are executing common or disjoint code
paths.
As a result, your data will be printed out with the same order of your code (First if(id==1){...}, then if(id==2){...}, ...).
Have a look at the CUDA C Programming Guide pp. 113-114: it provides some information on how printf flushes its output.
EDIT
Also according to Eric's answer, by printf you will see only a "granular" randomness, the randomness being related to the random nature of the warp execution. Everything is inside a warp can appear to be ordered.
Have also a look to this other thread
CUDA : unexpected printf behavior
where Robert Crovella explains the logic behind CUDA printf outputs.

Is there ever a situation where an infinite loop may be desired?

Infinite loops are taught as evil. Is there ever a good use?
When coding them by accident, the CPU peaks and I imagine memory does too, especially if assigning variables inside the loop.
If there is a good use, how are those issues prevented?
Basicly every operating system or server spins in an infinte loop.
To avoid these memory issues normally you wouldn't allocate memory inside the loop unless it can be freed later inside the same loop. For example you would allocate memory for a request and delete it once it was served.
To avoid cpu peaks you would wait for interrupts in case of an os or call a blocking function like poll() which waits for a new event once per iteration.
First of all, the word "infinite" in this phrase should be taken a bit more loosely. I am presuming you are talking about a while (true) loop with a break instruction, which will eventually end, as opposed to a loop which will run until the end of time and all humanity.
In the former sense, yes, there are use cases where it's appropriate:
Games use infinite game loops.
Embedded programs use infinite main loops.
Windows applications use infinite message loops.
One example where they might be used inappropriately is when they are used to create time delays by spinning the CPU, which is what novice programmers tend to do to avoid dealing with timer interrupts (or timer events, or other non-procedural constructs). However, when spinning the CPU is done to acquire a shared resource, then the "infinite loop" is also a perfectly valid implementation choice. Even the .NET CLR Monitor, for example, tries spinning for several hundred cycles before issuing a true wait on a kernel event handle and creating a more expensive thread switch.
In addition to programs that run on event loops (like the the system processes that #Christoph mentions), some languages have a concept known as a generator, that allow and even encourage you to write an infinite loop. The trick is that the object only runs for a finite time when it "yields" (returns) some expression. After that its state is "frozen" until it is needed again. For example, in Python you can have an object that alternates between LEFT and RIGHT:
def side():
while True:
yield "LEFT"
yield "RIGHT"
a = side()
print a.next()
print a.next()
print a.next()
Which would give LEFT RIGHT LEFT. The side function looks like an infinite loop with the statement While True:, but it will only ever run for a finite amount of time per call.
All the applications on your handset run in infinite event loops.

Resources