lock free programming - c++ atomic - c++11

I am trying to develop the following lock free code (c++11):
int val_max;
std::array<std::atomic<int>, 255> vector;
if (vector[i] > val_max) {
val_max =vector[i];
}
The problem is that when there are many threads (128 threads), the result is not correct, because if for example val_max=1, and three threads ( with vector[i] = 5, 15 and 20) will execute the code in the next line, and it will be a data race..
So, I don't know the best way to solve the problem with the available functions in c++11 (I cannot use mutex, locks), protecting the whole code.
Any suggestions? Thanks in advance!

You need to describe the bigger problem you need to solve, and why you want multiple threads for this.
If you have a lot of data and want to find the maximum, and you want to split that problem, you're doing it wrong. If all threads try to access a shared maximum, not only is it hard to get right, but by the time you have it right, you have fully serialized accesses, thus making the entire thing an exercise in adding complexity and thread overhead to a serial program.
The correct way to make it parallel is to give every thread a chunk of the array to work on (and the array members are not atomics), for which the thread calculates a local maximum, then when all threads are done have one thread find the maximum of the individual results.

Do an atomic fetch of val_max.
If the value fetched is greater than or equal to vector[i], stop, you are done.
Do an atomic compare exchange -- compare val_max to the value you read in step 1 and exchange it for the value of vector[i] if it compares.
If the compare succeeded, stop, you are done.
Go to step 1, you raced with another thread that made forward progress.

Related

Get all possible letter combinations with more than 5 letters

This code successfully would make array with all possible letter combinations with 5 characters.
a = ('aaaaa'..'zzzzz').to_a
however, when I try to go for 6+ characters, it takes like 10 mins, and then kills the task. Is there any way for it to actually load without killing the task? Is it limited by hardware?
You are indeed limited by hardware. In oversimplified terms, there are two limitation that you are facing here - processing power and memory capacity.
The "k-permutations of n" formula will tell us that you are trying to generate and process 26**6 = 308_915_776 elements.
(x..y) creates a Range, which knows how to generate all of its elements, but doesn't eagerly do so. When you call Range#to_a however, your processor tries to generate all those elements. After some time, the process runs out of memory and dies.
To avoid the memory restriction, you could instead take advantage of the fact that Range is also Enumerable. For example:
('aaaaaaa'..'zzzzzzz').each { |seven_letter_word| puts seven_letter_word }
will instantly start printing strings. Eventually (after a lot of waiting) it will loop through all of them.
However, note that this will let you bypass the memory restriction, but not the processing one. For that there are no shortcuts other than understand the specifics of the problem at hand.

How Many cuRand States Required for Thread-Unique Random Numbers? (CUDA)

How many cuRand states are required to get unique random numbers in every thread? From other questions posted on the site, some have said that you need one per thread and others say you need one per block.
Does using one cuRand state per thread mean better random numbers?
Does using 1 cuRand state per thread slow down CUDA applications significantly (5000 + threads)?
Also for the implementation of using 1 cuRand state per thread, does this kernel look right and efficient?:
__global__ void myKernel (const double *seeds) // seeds is an array of length = #threads
int tid = ... // set tid = global thread ID
{
curandState s;
curand_init (seeds[tid],0,0,&s)
....
double r = cuRand_uniform(&s);
...
}
Assuming that all your threads stay synchronized, then you want to generate the random numbers in all threads as shown in your sample code all at the same time. However, from what I understand, you do not need to seed the cuRAND differently in each thread. I may be wrong on that one though...
Now, they use the term "block" in the documentation as in "create all your random numbers in one block". They do not mean that one block of threads will do the work, instead it means one block of memory will hold all the random numbers all generated in one single call. So if you need, say, 4096 random numbers in your loop, you should create them all at once at the start, then load them back from memory later... You'll have to test to see whether it makes things faster in your case anyway. Often, many memory accesses slow down things, but calling the generator many times is not unlikely slower as it certainly needs to reload a heavy set of values to compute the next pseudo random number(s).
Source:
http://docs.nvidia.com/cuda/curand/host-api-overview.html#performance-notes2

How does goroutines behave on a multi-core processor

I am a newbie in Go language, so please excuse me if my question is very basic. I have written a very simple code:
func main(){
var count int // Default 0
cptr := &count
go incr(cptr)
time.Sleep(100)
fmt.Println(*cptr)
}
// Increments the value of count through pointer var
func incr(cptr *int) {
for i := 0; i < 1000; i++ {
go func() {
fmt.Println(*cptr)
*cptr = *cptr + 1
}()
}
}
The value of count should increment by one the number of times the loop runs. Consider the cases:
Loop runs for 100 times--> value of count is 100 (Which is correct as the loop runs 100 times).
Loop runs for >510 times --> Value of count is either 508 OR 510. This happens even if it is 100000.
I am running this on an 8 core processor machine.
First of all: prior to Go 1.5 it runs on a single processor, only using multiple threads for blocking system calls. Unless you tell the runtime to use more processors by using GOMAXPROCS.
As of Go 1.5 GOMAXPROCS is set to the number of CPUS. See 6, 7 .
Also, the operation *cptr = *cptr + 1 is not guaranteed to be atomic. If you look carefully, it can be split up into 3 operations: fetch old value by dereferencing pointer, increment value, save value into pointer address.
The fact that you're getting 508/510 is due to some magic in the runtime and not defined to stay that way. More information on the behaviour of operations with concurrency can be found in the Go memory model.
You're probably getting the correct values for <510 started goroutines because any number below these are not (yet) getting interrupted.
Generally, what you're trying to do is neither recommendable in any language, nor the "Go" way to do concurrency. A very good example of using channels to synchronize is this code walk: Share Memory By Communicating (rather than communicating by sharing memory)
Here is a little example to show you what I mean: use a channel with a buffer of 1 to store the current number, fetch it from the channel when you need it, change it at will, then put it back for others to use.
You code is racy: You write to the same memory location from different, unsynchronized goroutines without any locking. The result is basically undefined. You must either a) make sure that all the goroutine writes after each other in a nice, ordered way, or b) protect each write by e.g. e mutex or c) use atomic operations.
If you write such code: Always try it under the race detector like $ go run -race main.go and fix all races.
A nice alternative to using channels in this case might be the sync/atomic package, which contains specifically functions for atomically incrementing/decrementing numbers.
You are spawning 500 or 1000 routines without synchronization among routines. This is creating a race condition, which makes the result un-predictable.
Imagine You are working in an office to account for expense balance for your boss.
Your boss was in a meeting with 1000 of his subordinates and in this meeting he said, "I would pay you all a bonus, so we will have to increment the record of our expense". He issued following command:
i) Go to Nerve and ask him/ her what is the current expense balance
ii) Call my secretary to ask how much bonus you would receive
iii) Add your bonus as an expense to the additional expense balance. Do the math yourself.
iv) Ask Nerve to write the new balance of expense on my authority.
All of the 1000 eager participants rushed to record their bonus and created a race condition.
Say 50 of the eager gophers hit Nerve at the same time (almost), they ask,
i) "What is the current expense balance?
-- Nerve says $1000 to all of those 50 gophers, as they asked at the same question at the same time(almost) when the balance was $1000.
ii) The gophers then called secretary, how much bonus should be paid to me?
Secretary answers, "just $1 to be fair"
iii) Boos said do the math, they all calculates $1000+ $1 = $1001 should be the new cost balance for the company
iv) They all will ask Nerve to put back $1001 to the balance.
You see the problem in the method of Eager gophers?
There are $50 units of computation done, every time the added $1 to the existing balance, but the cost didn't increase by $50; only increased by $1.
Hope that clarifies the problem. Now for the solutions, other contributors gave very good solutions that would be sufficient I believe.
All of that approaches failed to me, noobie here. But i have found a better way http://play.golang.org/p/OcMsuUpv2g
I'm using sync package to solve that problem and wait for all goroutines to finish, without Sleep or Channel.
And don't forget to take a look at that awesome post http://devs.cloudimmunity.com/gotchas-and-common-mistakes-in-go-golang/

Or-equals on constant as reduction operation (ex. value |= 1 ) thread-safe?

Let's say that I have a variable x.
x = 0
I then spawn some number of threads, and each of them may or may not run the following expression WITHOUT the use of atomics.
x |= 1
After all threads have joined with my main thread, the main thread branches on the value.
if(x) { ... } else { ... }
Is it possible for there to be a race condition in this situation? My thoughts say no, because it doesn't seem to matter whether or not a thread is interrupted by another thread between reading and writing 'x' (in both cases, either 'x == 1', or 'x == 1'). That said, I want to make sure I'm not missing something stupid obvious or ridiculously subtle.
Also, if you happen to provide an answer to the contrary, please provide an instruction-by-instruction example!
Context:
I'm trying to, in OpenCL, have my threads indicate the presence or absence of a feature among any of their work-items. If any of the threads indicate the presence of the feature, my host ought to be able to branch on the result. I'm thinking of using the above method. If you guys have a better suggestion, that works too!
Detail:
I'm trying to add early-exit to my OpenCL radix-sort implementation, to skip radix passes if the data is banded (i.e. 'x' above would be x[RADIX] and I'd have all work groups, right after partial reduction of the data, indicate presence or absence of elements in the RADIX bins via 'x').
It may work within a work-group. You will need to insert a barrier before testing x. I'm not sure it will be faster than using atomic increments.
It will not work across several work-groups. Imagine you have 1000 work-groups to run on 20 cores. Typically, only a small number of work-groups can be resident on a single core, for example 4, meaning only 80 work-groups can be in flight inside the GPU at a given time. Once a work-group is done executing, it is retired, and another one is started. Halting a kernel in the middle of execution to wait for all 1000 work-groups to reach the same point is impossible.

What is atomic in boost / C++0x / C++1x / computer sciences?

What is atomic in C/C++ programming ?
I just visited the dearly cppreference.com (well I don't take the title for granted but wait for my story to finish), and the home changed to describe some of the C++0x/C++1x (let's call it C+++, okay ?) new features.
There was a mysterious and never seen by my zombie programmer's eye, the new <atomic>.
I guess its purpose is not to program atomic bombs or black holes (but I highly doubt this could have ANY connection with black holes, I don't know how those 2 words slipped here), but I'd like to know something:
What is the purpose of this feature ? Is it a type ? A function ? Is it a data container ? Is it related to threads ? May it have some relation with python's "import antigravity" ? I mean, we are programming here, we're not bloody physicist or semanticists !
Atomic refers to something which is not divisible.
An atomic expression is one that is actually executed by a single operation.
For example a++ is not atomic, since to exec it you need first to get the value of a, then to sum 1 to it, then to store the result into a.
Reading the value of an int should instead be atomic.
Atomic-ness is important in shared-memory parallel computations (eg: when using threads): because it tells you that an expression will give you the result you're expecting no matter what the other threads are doing.
AFAIK you could use atomic functions to create your own semaphores etc. The name atomic came from atom, you cant break it smaller, so those function calls can't be "broken apart" and paused by the operating system. This is for thread programming.
Is intended for multithreading. It avoids you to have concurrent threads mix operations. An atomic operation is an indivisible operation. You can’t observe such an operation half-done from any thread in the system; it’s either done or not done. With an atomic operation you cannot get a data race between threads. In a real world analogy you will use atomic not for physics but for semaphores and other traffic signals on roads. Cars will be threads, roads will be rules, locations will be data. Semaphores will be atomic. You don't need semaphores when there is only one car on all roads, right?

Resources