StoreStore reordering happens when compiling C++ for x86 - c++11

while(true) {
int x(0), y(0);
std::thread t0([&x, &y]() {
x=1;
y=3;
});
std::thread t1([&x, &y]() {
std::cout << "(" << y << ", " <<x <<")" << std::endl;
});
t0.join();
t1.join();
}
Firstly, I know that it is UB because of the data race.
But, I expected only the following outputs:
(3,1), (0,1), (0,0)
I was convinced that it was not possible to get (3,0), but I did. So I am confused- after all x86 doesn't allow StoreStore reordering.
So x = 1 should be globally visible before y = 3
I suppose that from theoretical point of view the output (3,0) is impossible because of the x86 memory model. I suppose that it appeared because of the UB. But I am not sure. Please explain.
What else besides StoreStore reordering could explain getting (3,0)?

You're writing in C++, which has a weak memory model. You didn't do anything to prevent reordering at compile-time.
If you look at the asm, you'll probably find that the stores happen in the opposite order from the source, and/or that the loads happen in the opposite order from what you expect.
The loads don't have any ordering in the source: the compiler can load x before y if it wants to, even if they were std::atomic types:
t2 <- x(0)
t1 -> x(1)
t1 -> y(3)
t2 <- y(3)
This isn't even "re"ordering, since there was no defined order in the first place:
std::cout << "(" << y << ", " <<x <<")" << std::endl; doesn't necessarily evaluate y before x. The << operator has left-to-right associativity, and operator overloading is just syntactic sugar for
op<<( op<<(op<<(y),x), endl); // omitting the string constants.
Since the order of evaluation of function arguments is undefined (even if we're talking about nested function calls), the compiler is free to evaluate x before evaluating op<<(y). IIRC, gcc often just evaluates right to left, matching the order of pushing args onto the stack if necessary. The answers on the linked question indicate that that's often the case. But of course that behaviour is in no way guaranteed by anything.
The order they're loaded is undefined even if they were std::atomic. I'm not sure if there's a sequence point between the evaluation of x and y. If not, then it would be the same as if you evaluated x+y: The compiler is free to evaluate the operands in any order because they're unsequenced. If there is a sequence point, then there is an order but it's undefined which order (i.e. they're indeterminately sequenced).
Slightly related: gcc doesn't reorder non-inline function calls in expression evaluation, to take advantage of the fact that C leaves the order of evaluation unspecified. I assume after inlining it does optimize better, but in this case you haven't given it any reason to favour loading y before x.
How to do it correctly
The key point is that it doesn't matter exactly why the compiler decided to reorder, just that it's allowed to. If you don't impose all the necessary ordering requirements, your code is buggy, full-stop. It doesn't matter if it happens to work with some compilers with some specific surrounding code; that just means it's a latent bug.
See http://en.cppreference.com/w/cpp/atomic/atomic for docs on how/why this works:
// Safe version, which should compile to the asm you expected.
while(true) {
int x(0); // should be atomic, too, because it can be read+written at the same time. You can use memory_order_relaxed, though.
std::atomic<int> y(0);
std::thread t0([&x, &y]() {
x=1;
// std::atomic_thread_fence(std::memory_order_release); // A StoreStore fence is an alternative to using a release-store
y.store(3, std::memory_order_release);
});
std::thread t1([&x, &y]() {
int tx, ty;
ty = y.load(std::memory_order_acquire);
// std::atomic_thread_fence(std::memory_order_acquire); // A LoadLoad fence is an alternative to using an acquire-load
tx = x;
std::cout << ty + tx << "\n"; // Don't use endl, we don't need to force a buffer flush here.
});
t0.join();
t1.join();
}
For Acquire/Release semantics to give you the ordering you want, the last store has to be the release-store, and the acquire-load has to be the first load. That's why I made y a std::atomic, even though you're setting x to 0 or 1 more like a flag.
If you don't want to use release/acquire, you could put a StoreStore fence between the stores and a LoadLoad fence between the loads. On x86, this would just prevent compile-time reordering, but on ARM you'd get a memory-barrier instruction. (Note that y still technically needs to be atomic to obey C's data-race rules, but you can use std::memory_order_relaxed on it.)
Actually, even with Release/Acquire ordering for y, x should be atomic as well. The load of x still happens even if we see y==0. So reading x in thread 2 is not synchronized with writing y in thread 1, so it's UB. In practice, int loads/stores on x86 (and most other architectures) are atomic. But remember that std::atomic implies other semantics, like the fact that the value can be changed asynchronously by other threads.
The hardware-reordering test could run a lot faster if you looped inside one thread storing i and -i or something, and looped inside the other thread checking that abs(y) is always >= abs(x). Creating and destroying two threads per test is a lot of overhead.
Of course, to get this right, you have to know how to use C to generate the asm you want (or write in asm directly).

Related

how post and pre increment works with multiplication operator? [duplicate]

What are "sequence points"?
What is the relation between undefined behaviour and sequence points?
I often use funny and convoluted expressions like a[++i] = i;, to make myself feel better. Why should I stop using them?
If you've read this, be sure to visit the follow-up question Undefined behavior and sequence points reloaded.
(Note: This is meant to be an entry to Stack Overflow's C++ FAQ. If you want to critique the idea of providing an FAQ in this form, then the posting on meta that started all this would be the place to do that. Answers to that question are monitored in the C++ chatroom, where the FAQ idea started out in the first place, so your answer is very likely to get read by those who came up with the idea.)
C++98 and C++03
This answer is for the older versions of the C++ standard. The C++11 and C++14 versions of the standard do not formally contain 'sequence points'; operations are 'sequenced before' or 'unsequenced' or 'indeterminately sequenced' instead. The net effect is essentially the same, but the terminology is different.
Disclaimer : Okay. This answer is a bit long. So have patience while reading it. If you already know these things, reading them again won't make you crazy.
Pre-requisites : An elementary knowledge of C++ Standard
What are Sequence Points?
The Standard says
At certain specified points in the execution sequence called sequence points, all side effects of previous evaluations
shall be complete and no side effects of subsequent evaluations shall have taken place. (§1.9/7)
Side effects? What are side effects?
Evaluation of an expression produces something and if in addition there is a change in the state of the execution environment it is said that the expression (its evaluation) has some side effect(s).
For example:
int x = y++; //where y is also an int
In addition to the initialization operation the value of y gets changed due to the side effect of ++ operator.
So far so good. Moving on to sequence points. An alternation definition of seq-points given by the comp.lang.c author Steve Summit:
Sequence point is a point in time at which the dust has settled and all side effects which have been seen so far are guaranteed to be complete.
What are the common sequence points listed in the C++ Standard?
Those are:
at the end of the evaluation of full expression (§1.9/16) (A full-expression is an expression that is not a subexpression of another expression.)1
Example :
int a = 5; // ; is a sequence point here
in the evaluation of each of the following expressions after the evaluation of the first expression (§1.9/18) 2
a && b (§5.14)
a || b (§5.15)
a ? b : c (§5.16)
a , b (§5.18) (here a , b is a comma operator; in func(a,a++) , is not a comma operator, it's merely a separator between the arguments a and a++. Thus the behaviour is undefined in that case (if a is considered to be a primitive type))
at a function call (whether or not the function is inline), after the evaluation of all function arguments (if any) which
takes place before execution of any expressions or statements in the function body (§1.9/17).
1 : Note : the evaluation of a full-expression can include the evaluation of subexpressions that are not lexically
part of the full-expression. For example, subexpressions involved in evaluating default argument expressions (8.3.6) are considered to be created in the expression that calls the function, not the expression that defines the default argument
2 : The operators indicated are the built-in operators, as described in clause 5. When one of these operators is overloaded (clause 13) in a valid context, thus designating a user-defined operator function, the expression designates a function invocation and the operands form an argument list, without an implied sequence point between them.
What is Undefined Behaviour?
The Standard defines Undefined Behaviour in Section §1.3.12 as
behavior, such as might arise upon use of an erroneous program construct or erroneous data, for which this International Standard imposes no requirements 3.
Undefined behavior may also be expected when this
International Standard omits the description of any explicit definition of behavior.
3 : permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or with-
out the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).
In short, undefined behaviour means anything can happen from daemons flying out of your nose to your girlfriend getting pregnant.
What is the relation between Undefined Behaviour and Sequence Points?
Before I get into that you must know the difference(s) between Undefined Behaviour, Unspecified Behaviour and Implementation Defined Behaviour.
You must also know that the order of evaluation of operands of individual operators and subexpressions of individual expressions, and the order in which side effects take place, is unspecified.
For example:
int x = 5, y = 6;
int z = x++ + y++; //it is unspecified whether x++ or y++ will be evaluated first.
Another example here.
Now the Standard in §5/4 says
Between the previous and next sequence point a scalar object shall have its stored value modified at most once by the evaluation of an expression.
What does it mean?
Informally it means that between two sequence points a variable must not be modified more than once.
In an expression statement, the next sequence point is usually at the terminating semicolon, and the previous sequence point is at the end of the previous statement. An expression may also contain intermediate sequence points.
From the above sentence the following expressions invoke Undefined Behaviour:
i++ * ++i; // UB, i is modified more than once btw two SPs
i = ++i; // UB, same as above
++i = 2; // UB, same as above
i = ++i + 1; // UB, same as above
++++++i; // UB, parsed as (++(++(++i)))
i = (i, ++i, ++i); // UB, there's no SP between `++i` (right most) and assignment to `i` (`i` is modified more than once btw two SPs)
But the following expressions are fine:
i = (i, ++i, 1) + 1; // well defined (AFAIK)
i = (++i, i++, i); // well defined
int j = i;
j = (++i, i++, j*i); // well defined
Furthermore, the prior value shall be accessed only to determine the value to be stored.
What does it mean? It means if an object is written to within a full expression, any and all accesses to it within the same expression must be directly involved in the computation of the value to be written.
For example in i = i + 1 all the access of i (in L.H.S and in R.H.S) are directly involved in computation of the value to be written. So it is fine.
This rule effectively constrains legal expressions to those in which the accesses demonstrably precede the modification.
Example 1:
std::printf("%d %d", i,++i); // invokes Undefined Behaviour because of Rule no 2
Example 2:
a[i] = i++ // or a[++i] = i or a[i++] = ++i etc
is disallowed because one of the accesses of i (the one in a[i]) has nothing to do with the value which ends up being stored in i (which happens over in i++), and so there's no good way to define--either for our understanding or the compiler's--whether the access should take place before or after the incremented value is stored. So the behaviour is undefined.
Example 3 :
int x = i + i++ ;// Similar to above
Follow up answer for C++11 here.
This is a follow up to my previous answer and contains C++11 related material..
Pre-requisites : An elementary knowledge of Relations (Mathematics).
Is it true that there are no Sequence Points in C++11?
Yes! This is very true.
Sequence Points have been replaced by Sequenced Before and Sequenced After (and Unsequenced and Indeterminately Sequenced) relations in C++11.
What exactly is this 'Sequenced before' thing?
Sequenced Before(§1.9/13) is a relation which is:
Asymmetric
Transitive
between evaluations executed by a single thread and induces a strict partial order1
Formally it means given any two evaluations(See below) A and B, if A is sequenced before B, then the execution of A shall precede the execution of B. If A is not sequenced before B and B is not sequenced before A, then A and B are unsequenced 2.
Evaluations A and B are indeterminately sequenced when either A is sequenced before B or B is sequenced before A, but it is unspecified which3.
[NOTES]
1 : A strict partial order is a binary relation "<" over a set P which is asymmetric, and transitive, i.e., for all a, b, and c in P, we have that:
........(i). if a < b then ¬ (b < a) (asymmetry);
........(ii). if a < b and b < c then a < c (transitivity).
2 : The execution of unsequenced evaluations can overlap.
3 : Indeterminately sequenced evaluations cannot overlap, but either could be executed first.
What is the meaning of the word 'evaluation' in context of C++11?
In C++11, evaluation of an expression (or a sub-expression) in general includes:
value computations (including determining the identity of an object for glvalue evaluation and fetching a value previously assigned to an object for prvalue evaluation) and
initiation of side effects.
Now (§1.9/14) says:
Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated.
Trivial example:
int x;
x = 10;
++x;
Value computation and side effect associated with ++x is sequenced after the value computation and side effect of x = 10;
So there must be some relation between Undefined Behaviour and the above-mentioned things, right?
Yes! Right.
In (§1.9/15) it has been mentioned that
Except where noted, evaluations of operands of individual operators and of subexpressions of individual expressions are unsequenced4.
For example :
int main()
{
int num = 19 ;
num = (num << 3) + (num >> 3);
}
Evaluation of operands of + operator are unsequenced relative to each other.
Evaluation of operands of << and >> operators are unsequenced relative to each other.
4: In an expression that is evaluated more than once during the execution
of a program, unsequenced and indeterminately sequenced evaluations of its subexpressions need not be performed consistently in different evaluations.
(§1.9/15)
The value computations of the operands of an
operator are sequenced before the value computation of the result of the operator.
That means in x + y the value computation of x and y are sequenced before the value computation of (x + y).
More importantly
(§1.9/15) If a side effect on a scalar object is unsequenced relative to either
(a) another side effect on the same scalar object
or
(b) a value computation using the value of the same scalar object.
the behaviour is undefined.
Examples:
int i = 5, v[10] = { };
void f(int, int);
i = i++ * ++i; // Undefined Behaviour
i = ++i + i++; // Undefined Behaviour
i = ++i + ++i; // Undefined Behaviour
i = v[i++]; // Undefined Behaviour
i = v[++i]: // Well-defined Behavior
i = i++ + 1; // Undefined Behaviour
i = ++i + 1; // Well-defined Behaviour
++++i; // Well-defined Behaviour
f(i = -1, i = -1); // Undefined Behaviour (see below)
When calling a function (whether or not the function is inline), every value computation and side effect associated with any argument expression, or with the postfix expression designating the called function, is sequenced before execution of every expression or statement in the body of the called function. [Note: Value computations and side effects associated with different argument expressions are unsequenced. — end note]
Expressions (5), (7) and (8) do not invoke undefined behaviour. Check out the following answers for a more detailed explanation.
Multiple preincrement operations on a variable in C++0x
Unsequenced Value Computations
Final Note :
If you find any flaw in the post please leave a comment. Power-users (With rep >20000) please do not hesitate to edit the post for correcting typos and other mistakes.
C++17 (N4659) includes a proposal Refining Expression Evaluation Order for Idiomatic C++
which defines a stricter order of expression evaluation.
In particular, the following sentence
8.18 Assignment and compound assignment operators:....
In all cases, the assignment is sequenced after the value
computation of the right and left operands, and before the value computation of the assignment expression.
The right operand is sequenced before the left operand.
together with the following clarification
An expression X is said to be sequenced before an expression Y if every
value computation and every side effect associated with the expression X is sequenced before every value
computation and every side effect associated with the expression Y.
make several cases of previously undefined behavior valid, including the one in question:
a[++i] = i;
However several other similar cases still lead to undefined behavior.
In N4140:
i = i++ + 1; // the behavior is undefined
But in N4659
i = i++ + 1; // the value of i is incremented
i = i++ + i; // the behavior is undefined
Of course, using a C++17 compliant compiler does not necessarily mean that one should start writing such expressions.
I am guessing there is a fundamental reason for the change, it isn't merely cosmetic to make the old interpretation clearer: that reason is concurrency. Unspecified order of elaboration is merely selection of one of several possible serial orderings, this is quite different to before and after orderings, because if there is no specified ordering, concurrent evaluation is possible: not so with the old rules. For example in:
f (a,b)
previously either a then b, or, b then a. Now, a and b can be evaluated with instructions interleaved or even on different cores.
In C99(ISO/IEC 9899:TC3) which seems absent from this discussion thus far the following steteents are made regarding order of evaluaiton.
[...]the order of evaluation of subexpressions and the order in which
side effects take place are both unspecified. (Section 6.5 pp 67)
The order of evaluation of the operands is unspecified. If an attempt
is made to modify the result of an assignment operator or to access it
after the next sequence point, the behavior[sic] is undefined.(Section
6.5.16 pp 91)

GLSL: Why is a random write in local array significantly slower than a looped write?

Lets look at a simplified example function in GLSL:
void foo() {
vec2 localData[16];
// ...
int i = ... // somehow dependent on dynamic data (not known at compile time)
localData[i] = x; // THE IMPORTANT LINE
}
It writes some value x to a dynamic determined index in a local array.
Now, replacing the line localData[i] = x; with
for( int j = 0; j < 16; ++j )
if( i == j )
localData[j] = x;
makes the code significantly faster. In several tested examples (different shaders) the execution time almost halved and there were much more things going on than this write.
For example: in an order-independent transparency shader which, among other things, fetches 16 texels the timings are 39ms with the direct write and 23ms with the looped write. Nothing else changed!
The test hardware is an GTX1080. The assembly returned by glGetProgramBinary is still too high-level. It contains one line in the first case and a loop+if surrounding an identical line in the second.
Why does this performance issue happen?
Is this true for all vendors?
Guess: localData is stored in 8 vec4 registers (the assembly does not say anything about that). Further I assume, that registers cannot be addressed with an index. If both are true, than the final binary must use some branch construct. The loop variant might be unrolled and result in a switch-like pattern which is faster. But is that common for all vendors? Why can't the compiler use whatever results from the for loop as the default for such writes?
Further experiments have shown that the reason is the use of a different memory type for the array. The (unrolled) looped variant uses registers, while the random access variant switches to local memory.
Local memory is usual placed in the global one, but private to each thread. It is likely that accesses to this local array are going to be cached (L2?).
The experiments to verify this reasoning were the following:
Manual versions of unrolled loops (measured in an insertion sort with 16 elements over 1M pixels):
Base line: localData[i] = x 33ms
For loop: for j + if i=j 16.8ms
Switch: switch(i) { case 0: localData[0] ...: 16.92ms
If else tree (splitting in halves): 16.92ms
If list (plain manual unrolled): 16.8ms
=> All kinds of branch constructs result in more or less the same timings. So it is not a bad branching behavior as initially guessed.
Multiple vs. one vs no random access (32 element insertion sort)
2x localData[i] = x 47ms
1x localData[i] = x 45ms
0x localData[i] = x 16ms
=> As long as there is at least one random access the performance will be bad. This means there is a global decision changing the behavior of localData -- most likely the use of a different memory. Using more than one random access does not make things worse much, because of caching.

How to optimize for time?

I'm trying to understand if there is a difference in speed when executing the following lines of code in a computer program:
myarray[1] = 5; return myarray[1];
myarray[0] = 5; return myarray[0];
x = 5; return x;
x = 5; y = x; return y;
return 5;
From what I understand, arrays are basically pointers (variables that store the memory addresses of other variables). Therefore (1) and (2) should be the same speed, but slower than (3), (4) and (5).
(5) should be the fastest, (3) should be slower than (5) because there is an equal sign, and (4) should be slower than (3) because there are two equal signs that need to be handled.
Would this be right?
You don't give a context what myarray, x and y are. Without that context, the question cannot be answered in any meaningful way. The extra assignments might have no side effects that cannot be optimised away.
Basically, looking at speed optimisation at this elementary level is completely pointless. If you want to look at speed, you need code that is substantial enough that the execution time can be measured. You cannot measure the time of one or two simple statements on a modern processor.

Iteraion 3u invokes unidenified error

#include <iostream>
int main()
{
for (int i = 0; i < 4; ++i)
std::cout << i*5000000000 << std::endl;
}
getting a warning from gcc whenever i try to run this.
:-
warning: iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
std::cout << i*5000000000 << std::endl;
Whats the cause of this error?
Signed integer overflow (as strictly speaking, there is no such thing as "unsigned integer overflow") means undefined behaviour.
Unsigned integers, declared unsigned, shall obey the laws of arithmetic modulo 2n where n is the number of bits in the value representation of that particular size of integer.I suspect that it's something like: (1) because every iteration with i of any value larger than 2 has undefined behavior -> (2) we can assume that i <= 2 for optimization purposes -> (3) the loop condition is always true -> (4) it's optimized away into an infinite loop.
What is going on is a case of strength reduction, more specifically, induction variable elimination. The compiler eliminates the multiplication by emitting code that instead increments i by 1e9 each iteration (and changing the loop condition accordingly). This is a perfectly valid optimization under the "as if" rule as this program could not observe the difference were it well-behaving. Alas, it's not, and the optimization "leaks"

set RNG state with openMP and Rcpp

I have a clarification question.
It is my understanding, that sourceCpp automatically passes on the RNG state, so that set.seed(123) gives me reproducible random numbers when calling Rcpp code. When compiling a package, I have to add a set RNG statement.
Now how does this all work with openMP either in sourceCpp or within a package?
Consider the following Rcpp code
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp1(int n, double mu, double sigma ){
Rcpp::NumericVector out(n);
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp2(int n, double mu, double sigma, int cores=1 ){
omp_set_num_threads(cores);
Rcpp::NumericVector out(n);
#pragma omp parallel for schedule(dynamic)
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
And then run
set.seed(123)
a1=rnormrcpp1(100,2,3,2)
set.seed(123)
a2=rnormrcpp1(100,2,3,2)
set.seed(123)
a3=rnormrcpp2(100,2,3,2)
set.seed(123)
a4=rnormrcpp2(100,2,3,2)
all.equal(a1,a2)
all.equal(a3,a4)
While a1 and a2 are identical, a3 and a4 are not. How can I adjust the RNG state with the openMP loop? Can I?
To expand on what Dirk Eddelbuettel has already said, it is next to impossible to both generate the same PRN sequence in parallel and have the desired speed-up. The root of this is that generation of PRN sequences is essentially a sequential process where each state depends on the previous one and this creates a backward dependence chain that reaches back as far as the initial seeding state.
There are two basic solutions to this problem. One of them requires a lot of memory and the other one requires a lot of CPU time and both are actually more like workarounds than true solutions:
pregenerated PRN sequence: One thread generates sequentially a huge array of PRNs and then all threads access this array in a manner that would be consistent with the sequential case. This method requires lots of memory in order to store the sequence. Another option would be to have the sequence stored into a disk file that is later memory-mapped. The latter method has the advantage that it saves some compute time, but generally I/O operations are slow, so it only makes sense on machines with limited processing power or with small amounts of RAM.
prewound PRNGs: This one works well in cases when work is being statically distributed among the threads, e.g. with schedule(static). Each thread has its own PRNG and all PRNGs are seeded with the same initial seed. Then each thread draws as many dummy PRNs as its starting iteration, essentially prewinding its PRNG to the correct position. For example:
thread 0: draws 0 dummy PRNs, then draws 100 PRNs and fills out(0:99)
thread 1: draws 100 dummy PRNs, then draws 100 PRNs and fills out(100:199)
thread 2: draws 200 dummy PRNs, then draws 100 PRNs and fills out(200:299)
and so on. This method works well when each thread does a lot of computations besides drawing the PRNs since the time to prewind the PRNG could be substantial in some cases (e.g. with many iterations).
A third option exists for the case when there is a lot of data processing besides drawing a PRN. This one uses OpenMP ordered loops (note that the iteration chunk size is set to 1):
#pragma omp parallel for ordered schedule(static,1)
for (int i=0; i < n; i++) {
#pragma omp ordered
{
rnum = R::rnorm(mu,sigma);
}
out(i) = lots of processing on rnum
}
Although loop ordering essentially serialises the computation, it still allows for lots of processing on rnum to execute in parallel and hence parallel speed-up would be observed. See this answer for a better explanation as to why so.
Yes, sourceCpp() etc and an instantiation of RNGScope so the RNGs are left in a proper state.
And yes one can do OpenMP. But inside of OpenMP segment you cannot control in which order the threads are executed -- so you longer the same sequence. I have the same problem with a package under development where I would like to have reproducible draws yet use OpenMP. But it seems you can't.

Resources