What is atomic in C/C++ programming ?
I just visited the dearly cppreference.com (well I don't take the title for granted but wait for my story to finish), and the home changed to describe some of the C++0x/C++1x (let's call it C+++, okay ?) new features.
There was a mysterious and never seen by my zombie programmer's eye, the new <atomic>.
I guess its purpose is not to program atomic bombs or black holes (but I highly doubt this could have ANY connection with black holes, I don't know how those 2 words slipped here), but I'd like to know something:
What is the purpose of this feature ? Is it a type ? A function ? Is it a data container ? Is it related to threads ? May it have some relation with python's "import antigravity" ? I mean, we are programming here, we're not bloody physicist or semanticists !
Atomic refers to something which is not divisible.
An atomic expression is one that is actually executed by a single operation.
For example a++ is not atomic, since to exec it you need first to get the value of a, then to sum 1 to it, then to store the result into a.
Reading the value of an int should instead be atomic.
Atomic-ness is important in shared-memory parallel computations (eg: when using threads): because it tells you that an expression will give you the result you're expecting no matter what the other threads are doing.
AFAIK you could use atomic functions to create your own semaphores etc. The name atomic came from atom, you cant break it smaller, so those function calls can't be "broken apart" and paused by the operating system. This is for thread programming.
Is intended for multithreading. It avoids you to have concurrent threads mix operations. An atomic operation is an indivisible operation. You can’t observe such an operation half-done from any thread in the system; it’s either done or not done. With an atomic operation you cannot get a data race between threads. In a real world analogy you will use atomic not for physics but for semaphores and other traffic signals on roads. Cars will be threads, roads will be rules, locations will be data. Semaphores will be atomic. You don't need semaphores when there is only one car on all roads, right?
Related
Concerning NVIDIA GPU the author in High Performance and Scalable GPU Graph Traversal paper says:
1-A sequence of kernel invocations is bulk- synchronous: each kernel is
initially presented with a consistent view of the results from the
previous.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I am unable to understand these two points (I know GPU based prefix sum though), Can smeone help me this concept.
1-A sequence of kernel invocations is bulk- synchronous: each kernel is initially presented with a consistent view of the results from the previous.
It's about parallel computation model: each processor has its own memory which is fast (like cache in CPU) and is performing computation using values stored there without any synchronization. Then non-blocking synchronization takes place - processor puts data it has computed so far and gets data from its neighbours. Then there's another synchronization - barrier, which makes all of them wait for each other.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I believe that's about the second step of BSP model - synchronization. That's the way processors store and get data for the next step.
Name of the model implies that it is highly concurrent (many many processes that work synchronously relatively to each other). And this is how we get to the second point.
As far as we want to live up to the name (be highly concurrent) we want get rid of sequential parts where it is possible. We can achieve that with prefix-sum.
Consider prefix-sum associative operator +. Then scan on set [5 2 0 3 1] returns the set [0 5 7 7 10 11]. So, now we can replace such sequential pseudocode:
foreach i = 1...n
foo[i] = foo[i-1] + bar(i);
with this pseudocode, which now can be parallel(!):
foreach(i)
baz[i] = bar(i);
scan(foo, baz);
That is very much naive version, but it's okay for explanation.
Let's say that I have a variable x.
x = 0
I then spawn some number of threads, and each of them may or may not run the following expression WITHOUT the use of atomics.
x |= 1
After all threads have joined with my main thread, the main thread branches on the value.
if(x) { ... } else { ... }
Is it possible for there to be a race condition in this situation? My thoughts say no, because it doesn't seem to matter whether or not a thread is interrupted by another thread between reading and writing 'x' (in both cases, either 'x == 1', or 'x == 1'). That said, I want to make sure I'm not missing something stupid obvious or ridiculously subtle.
Also, if you happen to provide an answer to the contrary, please provide an instruction-by-instruction example!
Context:
I'm trying to, in OpenCL, have my threads indicate the presence or absence of a feature among any of their work-items. If any of the threads indicate the presence of the feature, my host ought to be able to branch on the result. I'm thinking of using the above method. If you guys have a better suggestion, that works too!
Detail:
I'm trying to add early-exit to my OpenCL radix-sort implementation, to skip radix passes if the data is banded (i.e. 'x' above would be x[RADIX] and I'd have all work groups, right after partial reduction of the data, indicate presence or absence of elements in the RADIX bins via 'x').
It may work within a work-group. You will need to insert a barrier before testing x. I'm not sure it will be faster than using atomic increments.
It will not work across several work-groups. Imagine you have 1000 work-groups to run on 20 cores. Typically, only a small number of work-groups can be resident on a single core, for example 4, meaning only 80 work-groups can be in flight inside the GPU at a given time. Once a work-group is done executing, it is retired, and another one is started. Halting a kernel in the middle of execution to wait for all 1000 work-groups to reach the same point is impossible.
I have a function that I use to look up a value based on an index. The value takes some time to calculate, so I want to do it with ParallelMap, and references another similar such function that returns a list of expressions, also based on an index.
However, when I set it all up in a seemingly reasonable fashion, I see some very bizarre behaviour. First, I see that the function appears to work, albeit very slowly. For large indexes, however, the processor activity in Taskmangler stays entirely at zero for an extended period of time (i.e. 2-4 minutes) where all instances of Mathematica are seemingly inert. Then, without the slightest blip of CPU use, a result appears. Is this another case of Mathematica spukhafte Fernwirkung?
That is, I want to create a variable/function that stores an expression, here a list of integers (ListOfInts), and then on the parallel workers I want to perform some function on that expression (here I apply a set of replacement rules and take the Min). I want the result of that function to also be indexed by the same index under another variable/function (IndexedFunk), whose result is then available back on the main instance of Mathematica:
(*some arbitrary rules that will convert some of the integers to negative values:*)
rulez=Dispatch[Thread[Rule[Range[222],-Range[222]]]];
maxIndex = 333;
Clear[ListOfInts]
Scan[(ListOfInts[#]=RandomInteger[{1,999},55])&,Range[maxIndex ]]
(*just for safety's sake:*)
DistributeDefinitions[rulez, ListOfInts]
Clear[IndexedFunk]
(*I believe I have to have at least one value of IndexedFunk defined before I Share the definition to the workers:*)
IndexedFunk[1]=Min[ListOfInts[1]]/.rulez
(*... and this should let me retrieve the values back on the primary instance of MMA:*)
SetSharedFunction[IndexedFunk]
(*Now, here is the mysterious part: this just sits there on my multiprocessor machine for many minutes until suddenly a result appears. If I up maxIndex to say 99999 (and of course re-execute the above code again) then the effect can more clearly be seen.*)
AbsoluteTiming[Short[ParallelMap[(IndexedFunk[#]=Min[ListOfInts[#]/.rulez])&, Range[maxIndex]]]]
I believe this is some bug, but then I am still trying to figure out Mathematica Parallel, so I can't be too confident in this conclusion. Despite its being depressingly slow, it is nonetheless impressive in its ability to perform calculations without actually requiring a CPU to do so.
I thought perhaps it was due to whatever communications protocol is being used between the master and slave processes, perhaps it is so slow that it just appears that the processors are doing nothing when if fact they are just waiting to send the next bit of some definition or other. In which case I thought ParallelMap[..., Method->"CoarsestGrained"] would be of some use. But no, that doesn't work neither.
A question: "Am I doing something obviously wrong, or is this a bug?"
I am afraid you are. The problem is with the shared definition of a variable. Mathematica maintains a single coherent value in all copies of the variable across kernels, and therefore that variable becomes a single point of huge contention. CPU is idle because kernels line up to the queue waiting for the variable IndexedFunk, and most time is spent in interprocess or inter-machine communication. Go figure.
By the way, there is no function SetSharedDefinition in any Mathematica version I know of. You probably intended to write SetSharedVariable. But remove that evil call anyway! To avoid contention, return results from the parallelized computation as a list of pairs, and then assemble them into downvalues of your variable at the main kernel:
Clear[IndexedFunk]
Scan[(IndexedFunk[#[[1]]] = #[[2]]) &,
ParallelMap[{#, Min[ListOfInts[#] /. rulez]} &, Range[maxIndex]]
]
ParallelMap takes care of distributing definition automagically, so the call to DistributeDefinitions is superfluous. (As a minor note, it is not correct as written, omitting the maxIndex variable, but the omission is automatically taken care of by ParallelMap in this particular case.)
EDIT, NB!: The automatic distribution applies only to the version 8 of Mathematica. Thanks #MikeHoneychurch for the correction.
Consider something like...
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
test[i].bar();
}
Now consider..
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
}
for (int i = 0; i < test.size(); ++i) {
test[i].bar();
}
Is there a large difference in time spent between these two? I.e. what is the cost of the actual iteration? It seems like the only real operations you are repeating are an increment and a comparison (though I suppose this would become significant for a very large n). Am I missing something?
First, as noted above, if your compiler can't optimize the size() method out so it's just called once, or is nothing more than a single read (no function call overhead), then it will hurt.
There is a second effect you may want to be concerned with, though. If your container size is large enough, then the first case will perform faster. This is because, when it gets to test[i].bar(), test[i] will be cached. The second case, with split loops, will thrash the cache, since test[i] will always need to be reloaded from main memory for each function.
Worse, if your container (std::vector, I'm guessing) has so many items that it won't all fit in memory, and some of it has to live in swap on your disk, then the difference will be huge as you have to load things in from disk twice.
However, there is one final thing that you have to consider: all this only makes a difference if there is no order dependency between the function calls (really, between different objects in the container). Because, if you work it out, the first case does:
test[0].foo();
test[0].bar();
test[1].foo();
test[1].bar();
test[2].foo();
test[2].bar();
// ...
test[test.size()-1].foo();
test[test.size()-1].bar();
while the second does:
test[0].foo();
test[1].foo();
test[2].foo();
// ...
test[test.size()-1].foo();
test[0].bar();
test[1].bar();
test[2].bar();
// ...
test[test.size()-1].bar();
So if your bar() assumes that all foo()'s have run, you will break it if you change the second case to the first. Likewise, if bar() assumes that foo() has not been run on later objects, then moving from the second case to the first will break your code.
So be careful and document what you do.
There are many aspects in such comparison.
First, complexity for both options is O(n), so difference isn't very big anyway. I mean, you must not care about it if you write quite big and complex program with a large n and "heavy" operations .foo() and bar(). So, you must care about it only in case of very small simple programs (this is kind of programs for embedded devices, for example).
Second, it will depend on programming language and compiler. I'm assured that, for instance, most of C++ compilers will optimize your second option to produce same code as for the first one.
Third, if compiler haven't optimized your code, performance difference will heavily depend on the target processor. Consider loop in a term of assembly commands - it will look something like this (pseudo assembly language):
LABEL L1:
do this ;; some commands
call that
IF condition
goto L1
;; some more instructions, ELSE part
I.e. every loop passage is just IF statement. But modern processors don't like IF. This is because processors may rearrange instructions to execute them beforehand or just to avoid idles. With the IF (in fact, conditional goto or jump) instructions, processors do not know if they may rearrange operation or not.
There's also a mechanism called branch predictor. From material of Wikipedia:
branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure.
This "soften" effect of IF's, through if the predictor's guess is wrong, no optimization will be performed.
So, you can see that there's a big amount of conditions for both your options: target language and compiler, target machine, it's processor and branch predictor. This all makes very complex system, and you cannot foresee what exact result you will get. I believe, that if you don't deal with embedded systems or something like that, the best solution is just to use the form which your are more comfortable with.
For your examples you have the additional concern of how expensive .size() is, since it's compared for each time i increments in most languages.
How expensive is it? Well that depends, it's certainly all relative. If .foo() and .bar() are expensive, the cost of the actual iteration is probably minuscule in comparison. If they're pretty lightweight, then it'll be a larger percentage of your execution time. If you want to know about a particular case test it, this is the only way to be sure about your specific scenario.
Personally, I'd go with the single iteration to be on the cheap side for sure (unless you need the .foo() calls to happen before the .bar() calls).
I assume .size() will be constant. Otherwise, the first code example might not give the same as the second one.
Most compilers would probably store .size() in a variable before the loop starts, so the .size() time will be cut down.
Therefore the time of the stuff inside the two for loops will be the same, but the other part will be twice as much.
Performance tag, right.
As long as you are concentrating on the "cost" of this or that minor code segment, you are oblivious to the bigger picture (isolation); and your intention is to justify something that, at a higher level (outside your isolated context), is simply bad practice, and breaks guidelines. The question is too low level and therefore too isolated. A system or program which is set of integrated components will perform much better that a collection of isolated components.
The fact that this or that isolated component (work inside the loop) is fast or faster is irrelevant when the loop itself is repeated unnecessarily, and which would therefore take twice the time.
Given that you have one family car (CPU), why on Earth would you:
sit at home and send your wife out to do her shopping
wait until she returns
take the car, go out and do your shopping
leaving her to wait until you return
If it needs to be stated, you would spend (a) almost half of your hard-earned resources executing one trip and shopping at the same time and (b) have those resources available to have fun together when you get home.
It has nothing to do with the price of petrol at 9:00 on a Saturday, or the time it takes to grind coffee at the café, or cost of each iteration.
Yes, there is a large diff in the time and the resources used. But the cost is not merely in the overhead per iteration; it is in the overall cost of the one organised trip vs the two serial trips.
Performance is about architecture; never doing anything twice (that you can do once), which are the higher levels of organisation; integrated of the parts that make up the whole. It is not about counting pennies at the bowser or cycles per iteration; those are lower orders of organisation; which ajust a collection of fragmented parts (not a systemic whole).
Masseratis cannot get through traffic jams any faster than station wagons.
Is the following construct thread-safe, assuming that the elements of foo are aligned and sized properly so that there is no word tearing? If not, why not?
Note: The code below is a toy example of what I want to do, not my actual real world scenario. Obviously, there are better ways of coding the observable behavior in my example.
uint[] foo;
// Fill foo with data.
// In thread one:
for(uint i = 0; i < foo.length; i++) {
if(foo[i] < SOME_NUMBER) {
foo[i] = MAGIC_VAL;
}
}
// In thread two:
for(uint i = 0; i < foo.length; i++) {
if(foo[i] < SOME_OTHER_NUMBER) {
foo[i] = MAGIC_VAL;
}
}
This obviously looks unsafe at first glance, so I'll highlight why I think it could be safe:
The only two options are for an element of foo to be unchanged or to be set to MAGIC_VAL.
If thread two sees foo[i] in an intermediate state while it's being updated, only two things can happen: The intermediate state is < SOME_OTHER_NUMBER or it's not. If it is < SOME_OTHER_NUMBER, thread two will also try to set it to MAGIC_VAL. If not, thread two will do nothing.
Edit: Also, what if foo is a long or a double or something, so that updating it can't be done atomically? You may still assume that alignment, etc. is such that updating one element of foo will not affect any other element. Also, the whole point of multithreading in this case is performance, so any type of locking would defeat this.
On a modern multicore processor your code is NOT threadsafe (at least in most languages) without a memory barrier. Simply put, without explicit barriers each thread can see a different entirely copy of foo from caches.
Say that your two threads ran at some point in time, then at some later point in time a third thread read foo, it could see a foo that was completely uninitialized, or the foo of either of the other two threads, or some mix of both, depending on what's happened with CPU memory caching.
My advice - don't try to be "smart" about concurrency, always try to be "safe". Smart will bite you every time. The broken double-checked locking article has some eye-opening insights into what can happen with memory access and instruction reordering in the absence of memory barriers (though specifically about Java and it's (changing) memory model, it's insightful for any language).
You have to be really on top of your language's specified memory model to shortcut barriers. For example, Java allows a variable to be tagged volatile, which combined with a type which is documented as having atomic assignment, can allow unsynchronized assignment and fetch by forcing them through to main memory (so the thread is not observing/updating cached copies).
You can do this safely and locklessly with a compare-and-swap operation. What you've got looks thread safe but the compiler might create a writeback of the unchanged value under some circumstances, which will cause one thread to step on the other.
Also you're probably not getting as much performance as you think out of doing this, because having both threads writing to the same contiguous memory like this will cause a storm of MESI transitions inside the CPU's cache, each of which is quite slow. For more details on multithread memory coherence you can look at section 3.3.4 of Ulrich Drepper's "What Every Programmer Should Know About Memory".
If reads and writes to each array element are atomic (i.e. they're aligned properly with no word tearing as you mentioned), then there shouldn't be any problems in this code. If foo[i] is less than either of SOME_NUMBER or SOME_OTHER_NUMBER, then at least one thread (possibly both) will set it to MAGIC_VAL at some point; otherwise, it will be untouched. With atomic reads and writes, there are no other possibilities.
However, since your situation is more complicated, be very very careful -- make sure that foo[i] is truly only read once per loop and stored in a local variable. If you read it more than once during the same iteration, you could get inconsistent results. Even the slightest change you make to your code could immediately make it unsafe with race conditions, so comment heavily about the code with big red warning signs.
It's bad practice, you should never be in the state where two threads are accessesing the same variable at the same time, regardless of the consequences. The example you give is over simplified, any majority complex samples will almost always have problems associated with it.. ...
Remember: Semaphores are your friend!
That particular example is thread-safe.
There are no intermediate states really involved here.
That particular program would not get confused.
I would suggest a Mutex on the array, though.