I've seen many questions scattered across the Internet about branch divergence, and how to avoid it. However, even after reading dozens of articles on how CUDA works, I can't seem to see how avoiding branch divergence helps in most cases. Before anyone jumps on on me with claws outstretched, allow me to describe what I consider to be "most cases".
It seems to me that most instances of branch divergence involve a number of truly distinct blocks of code. For example, we have the following scenario:
if (A):
foo(A)
else:
bar(B)
If we have two threads that encounter this divergence, thread 1 will execute first, taking path A. Following this, thread 2 will take path B. In order to remove the divergence, we might change the block above to read like this:
foo(A)
bar(B)
Assuming it is safe to call foo(A) on thread 2 and bar(B) on thread 1, one might expect performance to improve. However, here's the way I see it:
In the first case, threads 1 and 2 execute in serial. Call this two clock cycles.
In the second case, threads 1 and 2 execute foo(A) in parallel, then execute bar(B) in parallel. This still looks to me like two clock cycles, the difference is that in the former case, if foo(A) involves a read from memory, I imagine thread 2 can begin execution during that latency, which results in latency hiding. If this is the case, the branch divergent code is faster.
You're assuming (at least it's the example you give and the only reference you make) that the only way to avoid branch divergence is to allow all threads to execute all the code.
In that case I agree there's not much difference.
But avoiding branch divergence probably has more to do with algorithm re-structuring at a higher level than just the addition or removal of some if statements and making code "safe" to execute in all threads.
I'll offer up one example. Suppose I know that odd threads will need to handle the blue component of a pixel and even threads will need to handle the green component:
#define N 2 // number of pixel components
#define BLUE 0
#define GREEN 1
// pixel order: px0BL px0GR px1BL px1GR ...
if (threadIdx.x & 1) foo(pixel(N*threadIdx.x+BLUE));
else bar(pixel(N*threadIdx.x+GREEN));
This means that every alternate thread is taking a given path, whether it be foo or bar. So now my warp takes twice as long to execute.
However, if I rearrange my pixel data so that the color components are contiguous perhaps in chunks of 32 pixels:
BL0 BL1 BL2 ... GR0 GR1 GR2 ...
I can write similar code:
if (threadIdx.x & 32) foo(pixel(threadIdx.x));
else bar(pixel(threadIdx.x));
It still looks like I have the possibility for divergence. But since the divergence happens on warp boundaries, a give warp executes either the if path or the else path, so no actual divergence occurs.
This is a trivial example, and probably stupid, but it illustrates that there may be ways to work around warp divergence that don't involve running all the code of all the divergent paths.
Related
Let's say that I have a variable x.
x = 0
I then spawn some number of threads, and each of them may or may not run the following expression WITHOUT the use of atomics.
x |= 1
After all threads have joined with my main thread, the main thread branches on the value.
if(x) { ... } else { ... }
Is it possible for there to be a race condition in this situation? My thoughts say no, because it doesn't seem to matter whether or not a thread is interrupted by another thread between reading and writing 'x' (in both cases, either 'x == 1', or 'x == 1'). That said, I want to make sure I'm not missing something stupid obvious or ridiculously subtle.
Also, if you happen to provide an answer to the contrary, please provide an instruction-by-instruction example!
Context:
I'm trying to, in OpenCL, have my threads indicate the presence or absence of a feature among any of their work-items. If any of the threads indicate the presence of the feature, my host ought to be able to branch on the result. I'm thinking of using the above method. If you guys have a better suggestion, that works too!
Detail:
I'm trying to add early-exit to my OpenCL radix-sort implementation, to skip radix passes if the data is banded (i.e. 'x' above would be x[RADIX] and I'd have all work groups, right after partial reduction of the data, indicate presence or absence of elements in the RADIX bins via 'x').
It may work within a work-group. You will need to insert a barrier before testing x. I'm not sure it will be faster than using atomic increments.
It will not work across several work-groups. Imagine you have 1000 work-groups to run on 20 cores. Typically, only a small number of work-groups can be resident on a single core, for example 4, meaning only 80 work-groups can be in flight inside the GPU at a given time. Once a work-group is done executing, it is retired, and another one is started. Halting a kernel in the middle of execution to wait for all 1000 work-groups to reach the same point is impossible.
Consider the following two loops where N = 10^9 or something large enough to notice inefficiencies with.
Loop x = 1 to N
total += A(x)
total += B(x)
or
Loop x = 1 to N
total += A(x)
Loop x=1 to N
total += B(x)
Where each function takes x, performs some arbitrary arithmetic calculation (e.g. x^2 and 3x^3 or something, doesn't matter), and returns a value.
Are there going to be any differences in overall runtime, and when would this not be the case, if at all?
Each loop requires four actions:
Preparation (once per loop)
Checking of the stopping condition (once per iteration)
Executing the body of the loop (once per iteration)
Adjusting the values used to determine if the iteration should continue (once per iteration)
when you have one loop, you "pay" for items 1, 2 and 4 only once; when you have two loops, you "pay" for everything exactly twice.
Assuming that the order of invoking the two functions is not important, the difference will not be noticeable in most common situations. However, in very uncommon situations of extremely tight loops a single loop will take less CPU resources. In fact, a common technique of loop unwinding relies on reducing the share of per-iteration checks and setup operations in the overall CPU load during the loop by repeating the body several times, and reducing the number of iterations by the corresponding factor.
There are a few things to think about. One is you're doing twice as many instructions for the loop itself (condition check, incrementing x, etc) in the second version. If your functions are really trivial, that could be a major cost.
However, in more realistic situations, cache performance, register sharing, and things like that are going to make a bigger difference. For instance, if both functions need to use a lot of registers, you might find that the second version performs worse than the first because the compiler needs to spill more registers to memory since it's doing it once per loop. Or if A and B both access the same memory, the second version might be faster than the second because all of B's accesses will be cache hits in the second version but misses in the first version.
All of this is highly program- and platform-specific. If there's some particular program you want to optimize, you need to benchmark it.
The primary difference is that the first one test X against N, N times, while the second one tests X against N, 2N times.
There is a slight overhead on the loop itself.
In each Iteration you need to do at least 2 operations, increase the counter, and then compare it to the end value.
So you're doing 2*10^9 more operations.
If both functions used lot's of memory, for example they created some big array, and recursively modified it in each iteration, it could be possible that first loop is slower due to the memory cache or such.
There are a lot of potential factors to be considered;
1) number of iterations -- does loop setup dominate over the task
2) loop comparison penalty vs. the task complexity
for (i=0;i<2;i++) a[i]=b[i];
3) general complexity of function
- with two complex functions one might run out of registers
4) register dependency or are the task serial in nature
- two independent tasks intermixed vs. result of other loop depends on the first one
5) can the loop be executed completely on a prefetch queue -- no need for cache access
- mixing in the second tasks may ruin the throughput
6) what kind of cache hit patterns there are
I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.
Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(n1,o1) .eq. .TRUE.) then
retval =.TRUE.
RETURN
endif
end do
end do
Or C pseudocode:
for (n1=nlo,n1<=nhi,n++) {
for (o1=olo,o1<=ohi,o++) {
if(somecondition(n1,o1)!=0) {
return (bool)true;
}
}
}
There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.
What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition() becomes True, execution among all the threads must immediately stop and a value of True set in a shared location.
What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.
Are there message passing mechanisms in OpenMP?
EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?
Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?
It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.
One possibility is to use OpenMP to parallelize over the 6 loops -- declare logical :: array(6), allow each loop to run to completion, and then retval = any(array). Then you can check this value and return outside the parallelized loop. Add a schedule(dynamic) to the parallel do statement if you do this. Or, have a separate !$omp parallel and then put !$omp do schedule(dynamic) ... !$omp end do nowait around each of the 6 loops.
Or, you can follow the good advice by #M.S.B. and parallelize the outermost loop over the whole array. The problem here is that you cannot have a RETURN inside a parallel loop -- so label the second outermost loop (the largest one within the parallel part), and EXIT that loop -- smth like
retval = .FALSE.
!$omp parallel do default(private) shared(BIGARRAY,retval) schedule(dynamic,1)
do k=1,NN
if(.not. retval) then
outer2: do j=1,NN
do i=1,NN
! --- your loop #1
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(BIGARRAY(i,j,k),n1,o1)) then
retval =.TRUE.
exit outer2
endif
end do
end do
! --- your loops #2 ... #6 go here
end do
end do outer2
end if
end do
!$omp end parallel do
[edit: the if statement is there presuming that you need to find out if there is at least one element like that in the big array. If you need to figure the condition for every element, you can similarly either add a dummy loop exit or goto, skipping the rest of the processing for that element. Again, use schedule(dynamic) or schedule(guided).]
As a separate point, you might also want to check if it may be a good idea to go through the innermost loop by some larger step (depending on float size), compute a vector of logicals on each iteration and then aggregate the results, eg. smth like if(count(somecondition(x(o1:o1+step,n1,k)))>0); in this case the compiler may be able to vectorize somecondition.
I believe you can do what you want with the task construct introduced in OpenMP 3; Intel Fortran supports tasking in OpenMP. I don't use tasks often so I won't offer you any wonky pseudocode.
You already mentioned the obvious way to stop all threads as soon as any thread finds the ending condition: have each check some shared variable which gives the status of the ending condition, thereby determining whether to break out of the loops. Obviously this is an overhead, so if you decide to take this approach I would suggest a few things:
Use atomics to check the ending condition, this avoids expensive memory flushing as just the variable in question is flushed. Move to OpenMP 3.1, there are some new atomic operations supported.
Check infrequently, maybe like once per outer iteration. You should only be parallelizing large cases to overcome the overhead of multithreading.
This one is optional, but you can try adding compiler hints, e.g. if you expect a certain condition to be false most of the time, the compiler will optimize the code accordingly.
Another (somewhat dirty) approach is to use shared variables for the loop ranges for each thread, maybe use a shared array where index n is for thread n. When one thread finds the ending condition, it changes the loop ranges of all the other threads so that they stop. You'll need the appropriate memory synchronization. Basically the overhead has now moved from checking a dummy variable to synchronizing/checking loop conditions. Again probably not so good to do this frequently, so maybe use shared outer loop variables and private inner loop variables.
On another note, this reminds me of the classic polling versus interrupt problem. Unfortunately I don't think OpenMP supports interrupts where you can send some kind of kill signal to each thread.
There are hacking work-arounds like using a child process for just this parallel work and invoking the operating system scheduler to emulate interrupts, however this is rather tricky to get correct and would make your code extremely unportable.
Update in response to comment:
Try something like this:
char shared_var = 0;
#pragma omp parallel
{
//you should have some method for setting loop ranges for each thread
for (n1=nlo; n1<=nhi; n1++) {
for (o1=olo; o1<=ohi; o1++) {
if (somecondition(n1,o1)!=0) {
#pragma omp atomic write
shared_var = 1; //done marker, this will also trigger the other break below
break; //could instead use goto to break out of both loops in 1 go
}
}
#pragma omp atomic read
private_var = shared_var;
if (private_var!=0) break;
}
}
A suitable parallel approach might be, to let each worker examine a part of the overall problem, exactly as in the serial case and use a local (non-shared) variable for the result (retval). Finally do a reduction over all workers on these local variables into a shared overall result.
Consider something like...
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
test[i].bar();
}
Now consider..
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
}
for (int i = 0; i < test.size(); ++i) {
test[i].bar();
}
Is there a large difference in time spent between these two? I.e. what is the cost of the actual iteration? It seems like the only real operations you are repeating are an increment and a comparison (though I suppose this would become significant for a very large n). Am I missing something?
First, as noted above, if your compiler can't optimize the size() method out so it's just called once, or is nothing more than a single read (no function call overhead), then it will hurt.
There is a second effect you may want to be concerned with, though. If your container size is large enough, then the first case will perform faster. This is because, when it gets to test[i].bar(), test[i] will be cached. The second case, with split loops, will thrash the cache, since test[i] will always need to be reloaded from main memory for each function.
Worse, if your container (std::vector, I'm guessing) has so many items that it won't all fit in memory, and some of it has to live in swap on your disk, then the difference will be huge as you have to load things in from disk twice.
However, there is one final thing that you have to consider: all this only makes a difference if there is no order dependency between the function calls (really, between different objects in the container). Because, if you work it out, the first case does:
test[0].foo();
test[0].bar();
test[1].foo();
test[1].bar();
test[2].foo();
test[2].bar();
// ...
test[test.size()-1].foo();
test[test.size()-1].bar();
while the second does:
test[0].foo();
test[1].foo();
test[2].foo();
// ...
test[test.size()-1].foo();
test[0].bar();
test[1].bar();
test[2].bar();
// ...
test[test.size()-1].bar();
So if your bar() assumes that all foo()'s have run, you will break it if you change the second case to the first. Likewise, if bar() assumes that foo() has not been run on later objects, then moving from the second case to the first will break your code.
So be careful and document what you do.
There are many aspects in such comparison.
First, complexity for both options is O(n), so difference isn't very big anyway. I mean, you must not care about it if you write quite big and complex program with a large n and "heavy" operations .foo() and bar(). So, you must care about it only in case of very small simple programs (this is kind of programs for embedded devices, for example).
Second, it will depend on programming language and compiler. I'm assured that, for instance, most of C++ compilers will optimize your second option to produce same code as for the first one.
Third, if compiler haven't optimized your code, performance difference will heavily depend on the target processor. Consider loop in a term of assembly commands - it will look something like this (pseudo assembly language):
LABEL L1:
do this ;; some commands
call that
IF condition
goto L1
;; some more instructions, ELSE part
I.e. every loop passage is just IF statement. But modern processors don't like IF. This is because processors may rearrange instructions to execute them beforehand or just to avoid idles. With the IF (in fact, conditional goto or jump) instructions, processors do not know if they may rearrange operation or not.
There's also a mechanism called branch predictor. From material of Wikipedia:
branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure.
This "soften" effect of IF's, through if the predictor's guess is wrong, no optimization will be performed.
So, you can see that there's a big amount of conditions for both your options: target language and compiler, target machine, it's processor and branch predictor. This all makes very complex system, and you cannot foresee what exact result you will get. I believe, that if you don't deal with embedded systems or something like that, the best solution is just to use the form which your are more comfortable with.
For your examples you have the additional concern of how expensive .size() is, since it's compared for each time i increments in most languages.
How expensive is it? Well that depends, it's certainly all relative. If .foo() and .bar() are expensive, the cost of the actual iteration is probably minuscule in comparison. If they're pretty lightweight, then it'll be a larger percentage of your execution time. If you want to know about a particular case test it, this is the only way to be sure about your specific scenario.
Personally, I'd go with the single iteration to be on the cheap side for sure (unless you need the .foo() calls to happen before the .bar() calls).
I assume .size() will be constant. Otherwise, the first code example might not give the same as the second one.
Most compilers would probably store .size() in a variable before the loop starts, so the .size() time will be cut down.
Therefore the time of the stuff inside the two for loops will be the same, but the other part will be twice as much.
Performance tag, right.
As long as you are concentrating on the "cost" of this or that minor code segment, you are oblivious to the bigger picture (isolation); and your intention is to justify something that, at a higher level (outside your isolated context), is simply bad practice, and breaks guidelines. The question is too low level and therefore too isolated. A system or program which is set of integrated components will perform much better that a collection of isolated components.
The fact that this or that isolated component (work inside the loop) is fast or faster is irrelevant when the loop itself is repeated unnecessarily, and which would therefore take twice the time.
Given that you have one family car (CPU), why on Earth would you:
sit at home and send your wife out to do her shopping
wait until she returns
take the car, go out and do your shopping
leaving her to wait until you return
If it needs to be stated, you would spend (a) almost half of your hard-earned resources executing one trip and shopping at the same time and (b) have those resources available to have fun together when you get home.
It has nothing to do with the price of petrol at 9:00 on a Saturday, or the time it takes to grind coffee at the café, or cost of each iteration.
Yes, there is a large diff in the time and the resources used. But the cost is not merely in the overhead per iteration; it is in the overall cost of the one organised trip vs the two serial trips.
Performance is about architecture; never doing anything twice (that you can do once), which are the higher levels of organisation; integrated of the parts that make up the whole. It is not about counting pennies at the bowser or cycles per iteration; those are lower orders of organisation; which ajust a collection of fragmented parts (not a systemic whole).
Masseratis cannot get through traffic jams any faster than station wagons.
I currently have an application which can contain 100s of user defined formulae. Currently, I use reverse polish notation to perform the calculations (pushing values and variables on to a stack, then popping them off the stack and evaluating). What would be the best way to start parallelizing this process? Should I be looking at a functional language?
The calculations are performed on arrays of numbers so for example a simple A+B could actually mean 100s of additions. I'm currently using Delphi, but this is not a requirement going forward. I'll use the tool most suited to the job. Formulae may also be dependent on each other So we may have one formula C=A+B and a second one D=C+A for example.
Let's assume your formulae (equations) are not cyclic, as otherwise you cannot "just" evaluate them. If you have vectorized equations like A = B + C where A, B and C are arrays, let's conceptually split them into equations on the components, so that if the array size is 5, this equation is split into
a1 = b1 + c1
a2 = b2 + c2
...
a5 = b5 + c5
Now assuming this, you have a large set of equations on simple quantities (whether integer, rational or something else).
If you have two equations E and F, let's say that F depends_on E if the right-hand side of F mentions the left-hand side of E, for example
E: a = b + c
F: q = 2*a + y
Now to get towards how to calculate this, you could always use randomized iteration to solve this (this is just an intermediate step in the explanation), following this algorithm:
1 while (there is at least one equation which has not been computed yet)
2 select one such pending equation E so that:
3 for every equation D such that E depends_on D:
4 D has been already computed
5 calculate the left-hand side of E
This process terminates with the correct answer regardless on how you make your selections on line // 2. Now the cool thing is that it also parallelizes easily. You can run it in an arbitrary number of threads! What you need is a concurrency-safe queue which holds those equations whose prerequisites (those the equations depend on) have been computed but which have not been computed themselves yet. Every thread pops out (thread-safely) one equation from this queue at a time, calculates the answer, and then checks if there are now new equations so that all their prerequisites have been computed, and then adds those equations (thread-safely) to the work queue. Done.
Without knowing more, I would suggest taking a SIMD style approach if possible. That is, create threads to compute all formulas for a single data set. Trying to divide the computation of formulas to parallelise them wouldn't yield much speed improvement as the logic required to be able to split up the computations into discrete units suitable for threading would be hard to write and harder to get right, the overhead would cancel out any speed gains. It would also suffer quickly from diminishing returns.
Now, if you've got a set of formulas that are applied to many sets of data then the parallelisation becomes easier and would scale better. Each thread does all computations for one set of data. Create one thread per CPU core and set its affinity to each core. Each thread instantiates one instance of the formula evaluation code. Create a supervisor which loads a single data set and passes it an idle thread. If no threads are idle, wait for the first thread to finish processing its data. When all data sets are processed and all threads have finished, then exit. Using this method, there's no advantage to having more threads than there are cores on the CPU as thread switching is slow and will have a negative effect on overall speed.
If you've only got one data set then it is not a trivial task. It would require parsing the evaluation tree for branches without dependencies on other branches and farming those branches to separate threads running on each core and waiting for the results. You then get problems synchronizing the data and ensuring data coherency.