Can Ruby threads not collide on a write? - ruby

From past work in C# and Java, I am accustomed to a statement such as this not being thread-safe:
x += y;
However, I have not been able to observe any collision among threads when running the above code in parallel with Ruby.
I have read that Ruby automatically prevents multiple threads from writing to the same data concurrently. Is this true? Is the += operator therefore thread-safe in Ruby?

Well, it depends on your implementation and a lot of things. In MRI, there is a such thing as the GVL (Giant VM Lock) which controls which thread is actually executing code at a time. You see, in MRI, only one thread can execute Ruby code at a time. So while the C librarys underneath can let another thread run while they use CPU in C code to multiply giant numbers, the code itself can't execute at the same time. That means, a statement such as the assignment might not run at the same time as another one of the assignments (though the additions may run in parallel). The other thing that could be happening is this: I think I heard that assignments to ints are atomic on Linux, so if you're on Linux, that might be something too.

x += 1
is equivalent in every way to
x = x + 1
(if you re-define +, you also automatically redefine the result of +=)
In this notation, it's clearer this is not an atomic operation, and is therefore not guaranteed thread-safe.

Related

Or-equals on constant as reduction operation (ex. value |= 1 ) thread-safe?

Let's say that I have a variable x.
x = 0
I then spawn some number of threads, and each of them may or may not run the following expression WITHOUT the use of atomics.
x |= 1
After all threads have joined with my main thread, the main thread branches on the value.
if(x) { ... } else { ... }
Is it possible for there to be a race condition in this situation? My thoughts say no, because it doesn't seem to matter whether or not a thread is interrupted by another thread between reading and writing 'x' (in both cases, either 'x == 1', or 'x == 1'). That said, I want to make sure I'm not missing something stupid obvious or ridiculously subtle.
Also, if you happen to provide an answer to the contrary, please provide an instruction-by-instruction example!
Context:
I'm trying to, in OpenCL, have my threads indicate the presence or absence of a feature among any of their work-items. If any of the threads indicate the presence of the feature, my host ought to be able to branch on the result. I'm thinking of using the above method. If you guys have a better suggestion, that works too!
Detail:
I'm trying to add early-exit to my OpenCL radix-sort implementation, to skip radix passes if the data is banded (i.e. 'x' above would be x[RADIX] and I'd have all work groups, right after partial reduction of the data, indicate presence or absence of elements in the RADIX bins via 'x').
It may work within a work-group. You will need to insert a barrier before testing x. I'm not sure it will be faster than using atomic increments.
It will not work across several work-groups. Imagine you have 1000 work-groups to run on 20 cores. Typically, only a small number of work-groups can be resident on a single core, for example 4, meaning only 80 work-groups can be in flight inside the GPU at a given time. Once a work-group is done executing, it is retired, and another one is started. Halting a kernel in the middle of execution to wait for all 1000 work-groups to reach the same point is impossible.

OpenMP ensemble execution

I am new to the OpenMP and at the moment with no access to my workstation where I can check the details. Had a quick question to set the basics right before moving on to the hands on part.
Suppose I have a serial program written in FORTRAN90 which evolves a map with iterations and gives the final value of the variable after the evolution, the code looks like:
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
I want to run the same code as independent processes on a cluster for 100 different random initial conditions and see how the output changes with the initial conditions. A serial program for this purpose would look like:
do iter=1,100 !! THE INITIAL CONDITION LOOP
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
Will the OpenMP implementation that I could think of work? The code I could come up with is as follows:
!$ OMP PARALLEL PRIVATE(xi,xf,i)
!$ OMP DO
do iter=1,100 !! THE INITIAL CONDITION LOOP
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
!$ OMP ENDDO
!$ OMP END PARALLEL
Thank you in advance for any suggestions or help.
I think that this line
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
might cause some problems. Is the implementation of random_number on your system thread-safe ? I haven't a clue, I know nothing about your compiler or operating system. If it isn't thread-safe then your program might do a number of things when the OpenMP threads all start using the random number generator; those things include crashing or deadlocking.
If the implementation is thread-safe you will want to figure out how to ensure that the threads either do or don't all generate the same sequence of random numbers. It's entirely sensible to write programs which use the same random numbers in each thread, or that use different sequences in different threads, but you ought to figure out that what you get is what you want.
And if the random number generator is thread safe and generates different sequences for each thread, do those sequences pass the sort of tests for randomness that a single-threaded random number generator might pass ?
It's quite tricky to generate properly independent sequences of pseudo-random numbers in parallel programs; certainly not something I can cover in the space of an SO answer.
While you figure all that out one workaround which might help would be to generate, in a sequential part of your code, all the random numbers you need (into an array perhaps) and let the different threads read different elements out of the array.
I want to run the same code as independent processes on a cluster
Then you do not want OpenMP. OpenMP is about exploiting parallelism inside a single address space.
I suggest you look at MPI, if you want to operate on a cluster

Parallelizing an algorithm with many exit points?

I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.
Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(n1,o1) .eq. .TRUE.) then
retval =.TRUE.
RETURN
endif
end do
end do
Or C pseudocode:
for (n1=nlo,n1<=nhi,n++) {
for (o1=olo,o1<=ohi,o++) {
if(somecondition(n1,o1)!=0) {
return (bool)true;
}
}
}
There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.
What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition() becomes True, execution among all the threads must immediately stop and a value of True set in a shared location.
What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.
Are there message passing mechanisms in OpenMP?
EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?
Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?
It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.
One possibility is to use OpenMP to parallelize over the 6 loops -- declare logical :: array(6), allow each loop to run to completion, and then retval = any(array). Then you can check this value and return outside the parallelized loop. Add a schedule(dynamic) to the parallel do statement if you do this. Or, have a separate !$omp parallel and then put !$omp do schedule(dynamic) ... !$omp end do nowait around each of the 6 loops.
Or, you can follow the good advice by #M.S.B. and parallelize the outermost loop over the whole array. The problem here is that you cannot have a RETURN inside a parallel loop -- so label the second outermost loop (the largest one within the parallel part), and EXIT that loop -- smth like
retval = .FALSE.
!$omp parallel do default(private) shared(BIGARRAY,retval) schedule(dynamic,1)
do k=1,NN
if(.not. retval) then
outer2: do j=1,NN
do i=1,NN
! --- your loop #1
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(BIGARRAY(i,j,k),n1,o1)) then
retval =.TRUE.
exit outer2
endif
end do
end do
! --- your loops #2 ... #6 go here
end do
end do outer2
end if
end do
!$omp end parallel do
[edit: the if statement is there presuming that you need to find out if there is at least one element like that in the big array. If you need to figure the condition for every element, you can similarly either add a dummy loop exit or goto, skipping the rest of the processing for that element. Again, use schedule(dynamic) or schedule(guided).]
As a separate point, you might also want to check if it may be a good idea to go through the innermost loop by some larger step (depending on float size), compute a vector of logicals on each iteration and then aggregate the results, eg. smth like if(count(somecondition(x(o1:o1+step,n1,k)))>0); in this case the compiler may be able to vectorize somecondition.
I believe you can do what you want with the task construct introduced in OpenMP 3; Intel Fortran supports tasking in OpenMP. I don't use tasks often so I won't offer you any wonky pseudocode.
You already mentioned the obvious way to stop all threads as soon as any thread finds the ending condition: have each check some shared variable which gives the status of the ending condition, thereby determining whether to break out of the loops. Obviously this is an overhead, so if you decide to take this approach I would suggest a few things:
Use atomics to check the ending condition, this avoids expensive memory flushing as just the variable in question is flushed. Move to OpenMP 3.1, there are some new atomic operations supported.
Check infrequently, maybe like once per outer iteration. You should only be parallelizing large cases to overcome the overhead of multithreading.
This one is optional, but you can try adding compiler hints, e.g. if you expect a certain condition to be false most of the time, the compiler will optimize the code accordingly.
Another (somewhat dirty) approach is to use shared variables for the loop ranges for each thread, maybe use a shared array where index n is for thread n. When one thread finds the ending condition, it changes the loop ranges of all the other threads so that they stop. You'll need the appropriate memory synchronization. Basically the overhead has now moved from checking a dummy variable to synchronizing/checking loop conditions. Again probably not so good to do this frequently, so maybe use shared outer loop variables and private inner loop variables.
On another note, this reminds me of the classic polling versus interrupt problem. Unfortunately I don't think OpenMP supports interrupts where you can send some kind of kill signal to each thread.
There are hacking work-arounds like using a child process for just this parallel work and invoking the operating system scheduler to emulate interrupts, however this is rather tricky to get correct and would make your code extremely unportable.
Update in response to comment:
Try something like this:
char shared_var = 0;
#pragma omp parallel
{
//you should have some method for setting loop ranges for each thread
for (n1=nlo; n1<=nhi; n1++) {
for (o1=olo; o1<=ohi; o1++) {
if (somecondition(n1,o1)!=0) {
#pragma omp atomic write
shared_var = 1; //done marker, this will also trigger the other break below
break; //could instead use goto to break out of both loops in 1 go
}
}
#pragma omp atomic read
private_var = shared_var;
if (private_var!=0) break;
}
}
A suitable parallel approach might be, to let each worker examine a part of the overall problem, exactly as in the serial case and use a local (non-shared) variable for the result (retval). Finally do a reduction over all workers on these local variables into a shared overall result.

What is atomic in boost / C++0x / C++1x / computer sciences?

What is atomic in C/C++ programming ?
I just visited the dearly cppreference.com (well I don't take the title for granted but wait for my story to finish), and the home changed to describe some of the C++0x/C++1x (let's call it C+++, okay ?) new features.
There was a mysterious and never seen by my zombie programmer's eye, the new <atomic>.
I guess its purpose is not to program atomic bombs or black holes (but I highly doubt this could have ANY connection with black holes, I don't know how those 2 words slipped here), but I'd like to know something:
What is the purpose of this feature ? Is it a type ? A function ? Is it a data container ? Is it related to threads ? May it have some relation with python's "import antigravity" ? I mean, we are programming here, we're not bloody physicist or semanticists !
Atomic refers to something which is not divisible.
An atomic expression is one that is actually executed by a single operation.
For example a++ is not atomic, since to exec it you need first to get the value of a, then to sum 1 to it, then to store the result into a.
Reading the value of an int should instead be atomic.
Atomic-ness is important in shared-memory parallel computations (eg: when using threads): because it tells you that an expression will give you the result you're expecting no matter what the other threads are doing.
AFAIK you could use atomic functions to create your own semaphores etc. The name atomic came from atom, you cant break it smaller, so those function calls can't be "broken apart" and paused by the operating system. This is for thread programming.
Is intended for multithreading. It avoids you to have concurrent threads mix operations. An atomic operation is an indivisible operation. You can’t observe such an operation half-done from any thread in the system; it’s either done or not done. With an atomic operation you cannot get a data race between threads. In a real world analogy you will use atomic not for physics but for semaphores and other traffic signals on roads. Cars will be threads, roads will be rules, locations will be data. Semaphores will be atomic. You don't need semaphores when there is only one car on all roads, right?

Thread safety of simultaneous updates of a variable to the same value

Is the following construct thread-safe, assuming that the elements of foo are aligned and sized properly so that there is no word tearing? If not, why not?
Note: The code below is a toy example of what I want to do, not my actual real world scenario. Obviously, there are better ways of coding the observable behavior in my example.
uint[] foo;
// Fill foo with data.
// In thread one:
for(uint i = 0; i < foo.length; i++) {
if(foo[i] < SOME_NUMBER) {
foo[i] = MAGIC_VAL;
}
}
// In thread two:
for(uint i = 0; i < foo.length; i++) {
if(foo[i] < SOME_OTHER_NUMBER) {
foo[i] = MAGIC_VAL;
}
}
This obviously looks unsafe at first glance, so I'll highlight why I think it could be safe:
The only two options are for an element of foo to be unchanged or to be set to MAGIC_VAL.
If thread two sees foo[i] in an intermediate state while it's being updated, only two things can happen: The intermediate state is < SOME_OTHER_NUMBER or it's not. If it is < SOME_OTHER_NUMBER, thread two will also try to set it to MAGIC_VAL. If not, thread two will do nothing.
Edit: Also, what if foo is a long or a double or something, so that updating it can't be done atomically? You may still assume that alignment, etc. is such that updating one element of foo will not affect any other element. Also, the whole point of multithreading in this case is performance, so any type of locking would defeat this.
On a modern multicore processor your code is NOT threadsafe (at least in most languages) without a memory barrier. Simply put, without explicit barriers each thread can see a different entirely copy of foo from caches.
Say that your two threads ran at some point in time, then at some later point in time a third thread read foo, it could see a foo that was completely uninitialized, or the foo of either of the other two threads, or some mix of both, depending on what's happened with CPU memory caching.
My advice - don't try to be "smart" about concurrency, always try to be "safe". Smart will bite you every time. The broken double-checked locking article has some eye-opening insights into what can happen with memory access and instruction reordering in the absence of memory barriers (though specifically about Java and it's (changing) memory model, it's insightful for any language).
You have to be really on top of your language's specified memory model to shortcut barriers. For example, Java allows a variable to be tagged volatile, which combined with a type which is documented as having atomic assignment, can allow unsynchronized assignment and fetch by forcing them through to main memory (so the thread is not observing/updating cached copies).
You can do this safely and locklessly with a compare-and-swap operation. What you've got looks thread safe but the compiler might create a writeback of the unchanged value under some circumstances, which will cause one thread to step on the other.
Also you're probably not getting as much performance as you think out of doing this, because having both threads writing to the same contiguous memory like this will cause a storm of MESI transitions inside the CPU's cache, each of which is quite slow. For more details on multithread memory coherence you can look at section 3.3.4 of Ulrich Drepper's "What Every Programmer Should Know About Memory".
If reads and writes to each array element are atomic (i.e. they're aligned properly with no word tearing as you mentioned), then there shouldn't be any problems in this code. If foo[i] is less than either of SOME_NUMBER or SOME_OTHER_NUMBER, then at least one thread (possibly both) will set it to MAGIC_VAL at some point; otherwise, it will be untouched. With atomic reads and writes, there are no other possibilities.
However, since your situation is more complicated, be very very careful -- make sure that foo[i] is truly only read once per loop and stored in a local variable. If you read it more than once during the same iteration, you could get inconsistent results. Even the slightest change you make to your code could immediately make it unsafe with race conditions, so comment heavily about the code with big red warning signs.
It's bad practice, you should never be in the state where two threads are accessesing the same variable at the same time, regardless of the consequences. The example you give is over simplified, any majority complex samples will almost always have problems associated with it.. ...
Remember: Semaphores are your friend!
That particular example is thread-safe.
There are no intermediate states really involved here.
That particular program would not get confused.
I would suggest a Mutex on the array, though.

Resources