Why doesn't the OpenMP atomic directive support assignment? - parallel-processing

The atomic directive in openmp supports stuff like
x += expr
x *= expr
where expr is an expression of scalar type that does not reference x. I get that, but I don't get why you can't do:
#pragma omp atomic
x = y;
Is this somehow more taxing cpu instruction-wise? Seems to me that both the legal and illegal statement loads the value of x and some other scalar value, changes the register value of x and writes it back. If anyone could explain to me how these instructions are (I assume) fundamentally different I would be very grateful.

Because the suggested atomic assignment does not protect against anything.
Remember that an atomic instruction can be thought of as a critical section that could be (but does not have to be) efficiently implemented by the compiler by using magic hardware. Think about two threads reaching x = y with shared x and private y. After all the threads finish, x is equal to the last thread to execute "wins" and sets x to its y. Wrap the assignment in a critical section and nothing changes, the last thread still "wins". Now, if the threads do something else with x afterwards the slowest thread may not have caught up and even if it has the compiler could legitimately end up using choosing to some cached value for x (i.e. the thread's local y). To avoid this, you would need a barrier (so the winning thread has won) and its implied flush (so the local cache has been invalidated):
x = y;
#pragma omp barrier
\\ do something with shared x...
but I cannot think of a good reason to do this. Why do all the work to find y on many threads if most of them will be (non-deterministically) thrown away?

Related

Is reusing variables bad for instruction-level parallelism and OoO execution?

I'm studying processors and one thing that caught my attention is the fact that high-performance CPUs have the ability to execute more than one instruction during a clock cycle and even execute them out of order in order to improve performance. All this without any help from the compilers.
As far as I understood, the processors are able to do that by analysing data dependencies to determine which instructions can be run first/in a same ILP-paralell-step (issue).
#edit
I'll try giving an example. Imagine these two pieces of code:
int myResult;
myResult = myFunc1(); // 1
myResult = myFunc2(); // 2
j = myResult + 3; // 3
-
int myFirstResult, mySecondResult;
myFirstResult = myFunc1(); // 1
mySecondResult = myFunc2(); // 2
j = mySecondResult + 3; // 3
They both do the same thing, the difference is that on the first I reuse my variable and in the second I don't.
I assume (and please correct me if I'm wrong) that the processor could run instructions 2 and 3 before instruction 1 on the second example, because the data would be stored in two different places (registers?).
The same would not be possible for the first example, because if it runs instruction 2 and 3 before instruction 1, the value assigned on instruction 1 would be kept in memory (instead of the value from instruction 2).
Question is :
Is there any strategy to run instructions 2 and 3 before 1 if I reuse the variable (like in the first example)?
Or reusing variables prevent instruction-level parallelism and OoO execution?
A modern microprocessor is an extremely sophisticated piece of equipment and already has enough complexity that understanding every single aspect of how it functions is beyond the reach of most people. There's an additional layer introduced by your compiler or runtime which increases the complexity. It's only really possible to speak in terms of generalities here, as ARM processor X might handle this than ARM processor Y, and both of those differently from Intel U or AMD V.
Looking more closely at your code:
int myResult;
myResult = myFunc1(); // 1
myResult = myFunc2(); // 2
j = myResult + 3; // 3
The int myResult line doesn't necessarily do anything CPU-wise. It's just instructing the compiler that there will be a variable named myResult of type int. It's not initialized, so there's no need to do anything yet.
On the first assignment the value is not used. By default the compiler usually does a pretty straight-forward conversion of your code to machine instructions, but when you turn on optimization, which you normally do for production code, that assumption goes out the window. A good compiler will recognize that this value is never used and will omit the assignment. A better compiler will warn you that the value is never used.
The second one actually assigns to the variable and that variable is later used. Obviously before the third assignment can happen the second assignment must be completed. There's not much optimizing that can go on here unless those functions are trivial and end up inlined. Then it's a matter of what those functions do.
A "superscalar" processsor, or one capable of running things out-of-order, has limitations on how ambitious it can get. The type of code it works best with resembles the following:
int a = 1;
int b = f();
int c = a * 2;
int d = a + 2;
int e = g(b);
The assignment of a is straightforward and immediate. b is a computed value. Where it gets interesting is that c and d have the same dependency and can actually execute in parallel. They also don't depend on b so theoretically they could run before, during, or after the f() call so long as the end-state is correct.
A single thread can execute multiple operations concurrently, but most processors have limits on the types and number of them. For example, a floating-point multiply and an integer add could happen, or two integer adds, but not two floating point multiply ops. It depends on what operations the CPU has, what registers those can operate on, and how the compiler has arranged the data in advance.
If you're looking to optimize code and shave nanoseconds off of things you'll need to find a really good technical manual on the CPU(s) you're targeting, plus spend untold hours trying different approaches and benchmarking things.
The short answer is variables don't matter. It's all about dependencies, your compiler, and what capabilities your CPU has.

Parallel programming dependency openacc

I am trying to parallelize this loops, but get some error in PGI compiler, I don't understand what's wrong
#pragma acc kernels
{
#pragma acc loop independent
for (i = 0;i < k; i++)
{
for(;dt*j <= Ms[i+1].t;j++)
{
w = (j*dt - Ms[i].t)/(Ms[i+1].t-Ms[i].t);
X[j] = Ms[i].x*(1-w)+Ms[i+1].x*w;
Y[j] = Ms[i].y*(1-w)+Ms[i+1].y*w;
}
}
}
Error
85, Generating Multicore code
87, #pragma acc loop gang
89, Accelerator restriction: size of the GPU copy of Y,X is unknown
Complex loop carried dependence of Ms->t,Ms->x,X->,Ms->y,Y-> prevents parallelization
Loop carried reuse of Y->,X-> prevents parallelization
So what i can do to solve this dependence problem?
I see a few issues here. Also given the output, I'm assuming that you're compiling with "-ta=multicore,tesla" (i.e. targeting both a multicore CPU and a GPU)
First, since "j" is not initialized in the "i" loop, the starting value of "j" will depended on the ending value of "j" from the previous iteration of "i". Hence, the loops are not parallelizable. By using "loop independent", you have forced parallelization on the outer loop, but you will get differing answers from running the code sequentially. You will need to rethink your algorithm.
I would suggest making X and Y two dimensional. With the first dimension of size "k". The second dimension can be a jagged array (i.e. each having a differing size) with the size corresponding to the "Ms[i+1].t" value.
I wrote an example of using jagged arrays as part of my Chapter (#5) of the Parallel Programming with OpenACC book. See: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/jagged_array.c
Alternatively, you might be able to set "j=Ms[i].t" assuming "Ms[0].t" is set.
for(j=Ms[i].t;dt*j <= Ms[i+1].t;j++)
"Accelerator restriction: size of the GPU copy of Y,X is unknown"
This is telling you that the compiler can not implicitly copy the X and Y arrays on the device. In C/C++, unbounded pointers don't have sizes so the compiler can't tell how big these arrays are. Often it can derive this information from the loop trip counts, but since the loop trip count is unknown (see above), it can't in this case. To fix, you need to include a data directive on the "kernels" directive or add a data region to your code. For example:
#pragma acc kernels copyout(X[0:size], Y[0:size])
or
#pragma acc data copyout(X[0:size], Y[0:size])
{
...
#pragma acc kernels
...
}
Another thing to keep in mind is pointer aliasing. In C/C++, pointers of the same type are allowed to point at the same object. Hence, without additional information such as the "restrict" attribute, the "independent" clause, or the PGI compiler flag "-Msafeptr", the compiler must assume your pointers do point to the same object making the loop not parallelizable.
This would most likely go away by either adding loop independent to the inner loop as well or using the collapse clause to flatted the loop, applying independent to both. Might also go away if all of your arrays are passed in using restrict, but maybe not.

Is there ever a point to swap two variables without using a third?

I know not to use them, but there are techniques to swap two variables without using a third, such as
x ^= y;
y ^= x;
x ^= y;
and
x = x + y
y = x - y
x = x - y
In class the prof mentioned that these were popular 20 years ago when memory was very limited and are still used in high-performance applications today. Is this true? My understanding as to why it's pointless to use such techniques is that:
It can never be the bottleneck using the third variable.
The optimizer does this anyway.
So is there ever a good time to not swap with a third variable? Is it ever faster?
Compared to each other, is the method that uses XOR vs the method that uses +/- faster? Most architectures have a unit for addition/subtraction and XOR so wouldn't that mean they are all the same speed? Or just because a CPU has a unit for the operation doesn't mean they're all the same speed?
These techniques are still important to know for the programmers who write the firmware of your average washing machine or so. Lots of that kind of hardware still runs on Z80 CPUs or similar, often with no more than 4K of memory or so. Outside of that scene, knowing these kinds of algorithmic "trickery" has, as you say, as good as no real practical use.
(I do want to remark though that nonetheless, the programmers who remember and know this kind of stuff often turn out to be better programmers even for "regular" applications than their "peers" who won't bother. Precisely because the latter often take that attitude of "memory is big enough anyway" too far.)
There's no point to it at all. It is an attempt to demonstrate cleverness. Considering that it doesn't work in many cases (floating point, pointers, structs), is unreadabe, and uses three dependent operations which will be much slower than just exchanging the values, it's absolutely pointless and demonstrates a failure to actually be clever.
You are right, if it was faster, then optimising compilers would detect the pattern when two numbers are exchanged, and replace it. It's easy enough to do. But compilers do actually notice when you exchange two variables and may produce no code at all, but start using the different variables after that. For example if you exchange x and y, then write a += x; b += y; the compiler may just change this to a += y; b += x; . The xor or add/subtract pattern on the other hand will not be recognised because it is so rare and won't get improved.
Yes, there is, especially in assembly code.
Processors have only a limited number of registers. When the registers are pretty full, this trick can avoid spilling a register to another memory location (posssibly in an unfetched cacheline).
I've actually used the 3 way xor to swap a register with memory location in the critical path of high-performance hand-coded lock routines for x86 where the register pressure was high, and there was no (lock safe!) place to put the temp. (on the X86, it is useful to know the the XCHG instruction to memory has a high cost associated with it, because it includes its own lock, whose effect I did not want. Given that the x86 has LOCK prefix opcode, this was really unnecessary, but historical mistakes are just that).
Morale: every solution, no matter how ugly looking when standing in isolation, likely has some uses. Its good to know them; you can always not use them if inappropriate. And where they are useful, they can be very effective.
Such a construct can be useful on many members of the PIC series of microcontrollers which require that almost all operations go through a single accumulator ("working register") [note that while this can sometimes be a hindrance, the fact that it's only necessary for each instruction to encode one register address and a destination bit, rather than two register addresses, makes it possible for the PIC to have a much larger working set than other microcontrollers].
If the working register holds a value and it's necessary to swap its contents with those of RAM, the alternative to:
xorwf other,w ; w=(w ^ other)
xorwf other,f ; other=(w ^ other)
xorwf other,w ; w=(w ^ other)
would be
movwf temp1 ; temp1 = w
movf other,w ; w = other
movwf temp2 ; temp2 = w
movf temp1,w ; w = temp1 [old w]
movwf other ; other = w
movf temp2,w ; w = temp2 [old other]
Three instructions and no extra storage, versus six instructions and two extra registers.
Incidentally, another trick which can be helpful in cases where one wishes to make another register hold the maximum of its present value or W, and the value of W will not be needed afterward is
subwf other,w ; w = other-w
btfss STATUS,C ; Skip next instruction if carry set (other >= W)
subwf other,f ; other = other-w [i.e. other-(other-oldW), i.e. old W]
I'm not sure how many other processors have a subtract instruction but no non-destructive compare, but on such processors that trick can be a good one to know.
These tricks are not very likely to be useful if you want to exchange two whole words in memory or two whole registers. Still you could take advantage of them if you have no free registers (or only one free register for memory-to-memoty swap) and there is no "exchange" instruction available (like when swapping two SSE registers in x86) or "exchange" instruction is too expensive (like register-memory xchg in x86) and it is not possible to avoid exchange or lower register pressure.
But if your variables are two bitfields in single word, a modification of 3-XOR approach may be a good idea:
y = (x ^ (x >> d)) & mask
x = x ^ y ^ (y << d)
This snippet is from Knuth's "The art of computer programming" vol. 4a. sec. 7.1.3. Here y is just a temporary variable. Both bitfields to exchange are in x. mask is used to select a bitfield, d is distance between bitfields.
Also you could use tricks like this in hardness proofs (to preserve planarity). See for example crossover gadget from this slide (page 7). This is from recent lectures in "Algorithmic Lower Bounds" by prof. Erik Demaine.
Of course it is still useful to know. What is the alternative?
c = a
a = b
b = c
three operations with three resources rather than three operations with two resources?
Sure the instruction set may have an exchange but that only comes into play if you are 1) writing assembly or 2) the optimizer figures this out as a swap and then encodes that instruction. Or you could do inline assembly but that is not portable and a pain to maintain, if you called an asm function then the compiler has to setup for the call burning a bunch more resources and instructions. Although it can be done you are not as likely to actually exploit the instruction sets feature unless the language has a swap operation.
Now the average programmer doesnt NEED to know this now any more than back in the day, folks will bash this kind of premature optimization, and unless you know the trick and use it often if the code isnt documented then it is not obvious so it is bad programming because it is unreadable and unmaintainable.
it is still a value programming education and exercise for example to have one invent a test to prove that it actually swaps for all combinations of bit patterns. And just like doing an xor reg,reg on an x86 to zero a register, it has a small but real performance boost for highly optimized code.

What is fastest way to copy a variable into another?

Lets say I have 2 variables.
x = 1
y = 2
The end result should be:
x = 2
y = 1
I thought about the following ways to do so:
temp = x // clone x
x = y
y = temp
or (XOR swap)
x = x XOR y
y = x XOR y
x = y XOR x
I'd like to get an answer regarding low level memory etc...
What is the fastest way to do so?
Note:
I would like to get a bonus answer, hypothetically, with no side effects (of the code, cpu), which is the fastest, or are there any other faster ones?
The problem is that modern CPU architectures will not let you get this answer. They will hide many effects and will expose many very subtle effects.
If you have the values in CPU registers and you have a spare register, then the temp way is either the fastest way, or the way which consumes the least power.
Using the XOR or the +/- (very neat by the way!) method is for situations where you cannot afford to have an extra location (extra memory variable or extra register). This might seem strange but inside a C preprocessor macro one cannot (easily) declare new variables for example.
When the variables are in memory all variants are very likely to behave the same on any high performance CPU. Even if the compiler does not optimize the code, the CPU will avoid virtually all memory accesses and make them as fast as register accesses.
In total I am inclined to say: Don't worry about the speed of this. It is unimportant to optimize at this level. Try to avoid the swap altogether, this will be the fastest!
http://en.wikipedia.org/wiki/XOR_swap_algorithm
Most modern compilers can optimize away the temporary variable in the
naive swap, in which case the naive swap uses the same amount of
memory and the same number of registers as the XOR swap and is at
least as fast, and often faster. The XOR swap is also much less
readable and completely opaque to anyone unfamiliar with the
technique. On modern CPU architectures, the XOR technique is
considerably slower than using a temporary variable to do swapping.
One reason is that modern CPUs strive to execute instructions in
parallel via instruction pipelines. In the XOR technique, the inputs
to each operation depend on the results of the previous operation, so
they must be executed in strictly sequential order.
Also see this question:
How fast is std::swap for integer types?
It's important to note that the XOR swap requires that you first check that the two variables do not reference the same memory location. If they did, you would end up setting it to zero.
XOR swap isn't always the most efficient, since most modern CPU architectures try and parallelize instructions but in the XOR swap, each line is dependent on the previous result (not parallelizable). For the temp variable swap, most compilers will optimize the temporary variable out which end up with the naive way running as fast or faster as well as using same amount of memory.
Another swap alternative is:
x = x + y
y = x - y
x = x - y
similarly, the arguments for efficiency and speed for the XOR swap apply here too.
EDIT: as hatchet said, the (+/-) approach also can cause overflow if not done carefully

Can Ruby threads not collide on a write?

From past work in C# and Java, I am accustomed to a statement such as this not being thread-safe:
x += y;
However, I have not been able to observe any collision among threads when running the above code in parallel with Ruby.
I have read that Ruby automatically prevents multiple threads from writing to the same data concurrently. Is this true? Is the += operator therefore thread-safe in Ruby?
Well, it depends on your implementation and a lot of things. In MRI, there is a such thing as the GVL (Giant VM Lock) which controls which thread is actually executing code at a time. You see, in MRI, only one thread can execute Ruby code at a time. So while the C librarys underneath can let another thread run while they use CPU in C code to multiply giant numbers, the code itself can't execute at the same time. That means, a statement such as the assignment might not run at the same time as another one of the assignments (though the additions may run in parallel). The other thing that could be happening is this: I think I heard that assignments to ints are atomic on Linux, so if you're on Linux, that might be something too.
x += 1
is equivalent in every way to
x = x + 1
(if you re-define +, you also automatically redefine the result of +=)
In this notation, it's clearer this is not an atomic operation, and is therefore not guaranteed thread-safe.

Resources