D / DLang / GDC - Alignment of SIMD fields in struct - gcc

struct vec_struct {
alias field this;
bool b;
int8 field; // ymm
}
In this code when you look at the generated x64 code output by GDC it seems to be doing a nice job, because it has got the offset right for the 256-bit YMM 'field' correct.
Q: Does D automatically propagate the alignment restrictions on the field to the allocation of static structs or structs on the stack?
In this case -
struct vec_struct {
bool b2;
struct {
alias field this;
bool b;
int8 field; // umm
}
}
it appears that the offset to 'field' is no longer aligned correctly - offset is 40 bytes in GDC. I don't suppose the compiler will use solely unaligned instructions? In any event, I could take the address of field and then pass that to someone expecting to pick up something with guaranteed correct alignment, if I have understood the D docs. Q: Is this correct - a seriously bad problem, or am I being daft?
Please don't bite. I'm both new to D and I hope I have understood the x86 SIMD instructions' docs. (Very experienced professional asm and C programmer, but v out-of-date.)
Noob q: I notice that the GDC opcodes look a bit odd, for example the compiler generates a 256-bit unaligned fetch followed by an aligned binary operation (I think), eg a movdqu followed by a vpaddd r, ymm ptr blah - is the latter aligned-only? Apologies if I have got this wrong, need to read up. Would someone be kind enough to sanity-check me?

Does D automatically propagate the alignment restrictions on the field to the allocation of static structs or structs on the stack?
I think it's supposed to and GDC/LDC should already support proper stack alignment. DMD probably has some bugs in this regard:
https://issues.dlang.org/show_bug.cgi?id=16098
it appears that the offset to 'field' is no longer aligned correctly - offset is 40 bytes in GDC.
As Iain already answered in the D.learn thread this is a bug in the shared DMD/GDC codebase.
Bug report: https://issues.dlang.org/show_bug.cgi?id=17237
DMD bugfix: https://github.com/dlang/dmd/pull/6582
Backport for GDC: https://github.com/D-Programming-GDC/GDC/pull/408

Related

Will compiler optimize away called-once functions

I've been reading "Clean Code" by Robert C Martin, and one of the pieces of advice is to use more but smaller functions; i.e. instead of
int main(){
// do one thing
// ... 10 lines of code
// do another thing
// ... another 10 lines of code
// one last thing
// ... another 10 lines
return 0;
}
you ought to
int main(){
doOneThing();
doAnotherThing();
oneLastThing();
}
void doOneThing(){
... 10 lines of code
}
// ... you get the idea
However, from my understanding of low level langauges, I know that when a function is called, variables are pushed onto the stack and the stack pointer is incremented, etc, whereas for the continuous code, there is no need to do this.
On the other hand, compiler optimisations can do cool things like inline class methods. Assuming doOneThing is called exactly once, could a compiler this code, deduce that the code can be unrolled into main(), and elimiate the function call and associated runtime overhead altogether?
A previous version of the question was about C++ or C. I am convinced that it needs the context of a language to discuss this topic in a meaningful way, so I chose to stay with C++.
C++ is not really "low level". Code you write is not instructions for your CPU. Between your code and what actually happens during runtime there is one of the most sophisticated pieces of software: Your compiler. When you turn on optimizations then your compiler will analyze your code and try to produce the most efficient executable that behaves at runtime as your code would if it was taken literally (i.e. no optimizations).
This is governed by the so-called as-if rule. Calling a function is not observable behavior. If the function is small, the compiler will inline the call to the function.
On the other hand, of course calling a function has a cost, but that cost is comparatively small. You need to start worry about this overhead, when the function is very small, like only 1 or 2 lines, and (and this is important!) for some reason the compiler cannot inline the call. This can be due the function being virtual for example.
You are asking if a compiler can optimize it, so I pick just one example. Gcc with -O3 will produce following output for following code:
int foo() { return 42;}
int bar() { return 0;}
int moo() { return 1;}
int main() {
return foo() + bar() + moo();
}
output:
foo():
mov eax, 42
ret
bar():
xor eax, eax
ret
moo():
mov eax, 1
ret
main:
mov eax, 43
ret
You can see hat in main no function is called. THe compiler examind the expression foo() + bar() + moo() and realized it always equals 43. No function has to be called to return 43.
This is a silly example, though for the general case, if you do want to see what the compiler did you need to do the same: look at the compilers output.
And to do that you need to write the code first. It is of no use to first speculate which code would be more efficient or less efficient. You first need to write the code. And because you need to do that anyhow, you can write clean simple human comprehensible code. That's the code your compiler understands best and knows how to optimize, because thats also what other programmers write.

GCC inline assembly read value from array

While learning gcc inline assembly I was playing a bit with memory access. I'm trying to read a value from an array using a value from a different array as index.
Both arrays are initialized to something.
Initialization:
uint8_t* index = (uint8_t*)malloc(256);
memset(index, 33, 256);
uint8_t* data = (uint8_t*)malloc(256);
memset(data, 44, 256);
Array access:
unsigned char read(void *index,void *data) {
unsigned char value;
asm __volatile__ (
" movzb (%1), %%edx\n"
" movzb (%2, %%edx), %%eax\n"
: "=r" (value)
: "c" (index), "c" (data)
: "%eax", "%edx");
return value;
}
This is how I use the function:
unsigned char value = read(index, data);
Now I would expect it to return 44. But it actually returns me some random value. Am I reading from uninitialzed memory? Also I'm not sure how to tell the compiler that it should assign the value from eax to the variable value.
You told the compiler you were going to put the output in %0, and it could pick any register for that "=r". But instead you never write %0 in your template.
And you use two temporaries for no apparent reason when you could have used %0 as the temporary.
As usual, you can debug your inline asm by adding comments like # 0 = %0 and looking at the compiler's asm output. (Not disassembly, just gcc -S to see what it fills in. e.g. # 0 = %ecx. (You didn't use an early-clobber "=&r" so it can pick the same register as inputs).
Also, this has 2 other bugs:
doesn't compile. Requesting 2 different operands in ECX with "c" constraints can't work unless the compiler can prove at compile-time that they have the same value so %1 and %2 can be the same register. https://godbolt.org/z/LgR4xS
You dereference pointer inputs without telling the compiler you're reading the pointed-to memory. Use a "memory" clobber or dummy memory operands. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
Or better https://gcc.gnu.org/wiki/DontUseInlineAsm because it's useless for this; just let GCC emit the movzb loads itself. unsigned char* is safe from strict-aliasing UB so you can safely cast any pointer to unsigned char* and dereference it, without even having to use memcpy or other hacks to fight against language rules for wider unaligned or type-punned accesses.
But if you insist on inline asm, read manuals and tutorials, links at https://stackoverflow.com/tags/inline-assembly/info. You can't just throw code at the wall until it sticks with inline asm: you must understand why your code is safe to have any hope of it being safe. There are many ways for inline asm to happen to work but actually be broken, or be waiting to break with different surrounding code.
This is a safe and not totally terrible version (other than the unavoidable optimization-defeating parts of inline asm). You do still want a movzbl load for both loads, even though the return value is only 8 bits. movzbl is the natural efficient way to load a byte, replacing instead of merging with the old contents of a full register.
unsigned char read(void *index, void *data)
{
uintptr_t value;
asm (
" movzb (%[idx]), %k[out] \n\t"
" movzb (%[arr], %[out]), %k[out]\n"
: [out] "=&r" (value) // early-clobber output
: [idx] "r" (index), [arr] "r" (data)
: "memory" // we deref some inputs as pointers
);
return value;
}
Note the early-clobber on the output: this stops gcc from picking the same register for output as one of the inputs. It would be safe for it to destroy the [idx] register with the first load, but I don't know how to tell GCC that in one asm statement. You could split your asm statement into two separate ones, each with their own input and output operands, connecting the output of the first to the input of the 2nd via a local variable. Then neither one would need early-clobber because they're just wrapping single instructions like GNU C inline asm syntax is designed to do nicely.
Godbolt with test caller to see how it inlines / optimizes when called twice, with i386 clang and x86-64 gcc. e.g. asking for index in a register forces an LEA, instead of letting the compiler see the deref and letting it pick an addressing mode for *index. Also the extra movzbl %al, %eax done by the compiler when adding to unsigned sum because we used a narrow return type.
I used uintptr_t value so this can compile for 32-bit and 64-bit x86. There's no harm in making the output from the asm statement wider than the return value of the function, and that saves us from having to use size modifiers like movzbl (%1), %k0 to get GCC to print the 32-bit register name (like EAX) if it chose AL for an 8-bit output variable, for example.
I did decided to actually use %k[out] for the benefit of 64-bit mode: we want movzbl (%rdi), %eax, not movzb (%rdi), %rax (wasting a REX prefix).
You might as well declare the function to return unsigned int or uintptr_t, though, so the compiler knows that it doesn't have to redo zero-extension. OTOH sometimes it can help the compiler to know that the value-range is only 0..255. You could tell it that you produce a correctly-zero-extend value using if(retval>255) __builtin_unreachable() or something. Or you could just not use inline asm.
You don't need asm volatile. (Assuming you want to let it optimize away if the result is unused, or be hoisted out of loops for constant inputs). You only need a "memory" clobber so if it does get used, the compiler knows that it reads memory.
(A "memory" clobber counts as all memory being an input, and all memory being an output. So it can't CSE, e.g. hoist out of a loop, because as far as the compiler knows one invocation might read something a previous one wrote. So in practice a "memory" clobber is about as bad as asm volatile. Even two back-to-back calls to this function without touching the input array force the compiler to emit the instructions twice.)
You could avoid this with dummy memory-input operands so the compiler knows this asm block doesn't modify memory, only read it. But if you actually care about efficiency, you shouldn't be using inline asm for this.
But like I said there is zero reason to use inline asm:
This will do exactly the same thing in 100% portable and safe ISO C:
// safe from strict-aliasing violations
// because unsigned char* can alias anything
inline
unsigned char read(void *index, void *data) {
unsigned idx = *(unsigned char*)index;
unsigned char * dp = data;
return dp[idx];
}
You could cast one or both pointers to volatile unsigned char* if you insist on the access happening every time and not being optimized away.
Or maybe even to atomic<unsigned char> * depending on what you're doing. (That's a hack, prefer C++20 atomic_ref to atomically load/store on objects that are normally not atomic.)

What are the limitations on the use of output registers in avr-gcc inline assembly?

Output register in inline assembly must be declared with the "=" constraint, meaning "write-only" [1]. What exactly does this mean - is it truly forbidden to read and modify them within the assembly? For example, consider this code:
uint8_t one ()
{
uint8_t res;
asm("ldi %[res],0\n"
"inc %[res]\n"
: [res] "=r" (res)
);
return res;
}
The assembly sets the output register to 0 then increments it. Is this breaking the "write-only" constraint?
UPDATE
I'm seeing problems where my inline asm breaks when I change it to work directly on an output register, as opposed to using r16 for the computation and finally mov'ing r16 into the output register. The code is here: http://ideone.com/JTpYma . It prints results to serial, you just need to define F_CPU and BAUD. The problem appears only when using gcc-4.8.0 and not using gcc-4.7.2.
[1] http://www.nongnu.org/avr-libc/user-manual/inline_asm.html
The compiler doesn't care whether you read it or not, it just won't put the initial value of the variable into the register. Your example is entirely legal, but people often wrongly expect to get result 2 from this code:
uint8_t one ()
{
uint8_t res = 1;
asm("inc %[res]\n"
: [res] "=r" (res)
);
return res;
}
Since it's only an output constraint, the initial value of res is not guaranteed to be loaded into the register. In fact, the initializer may even be optimized away on the assumption that the asm block will overwrite it anyway. The above code is compiled to this by my version of avr-gcc:
inc r24
ret
As you can see, the compiler indeed removed loading 1 into res and hence into r24 thus producing undefined result.
Update
The problem with the updated program in the question is that it also has an input register operand. By default the compiler assumes that all inputs are consumed before the outputs are assigned so it's safe to allocate overlapping registers. That's clearly not the case for your example. You should use an "early clobber" modifier (&) for the output. This is what the manual has to say about that:
& Means (in a particular alternative) that this operand is an
earlyclobber operand, which is modified before the instruction is
finished using the input operands. Therefore, this operand may not lie
in a register that is used as an input operand or as part of any
memory address.
Nobody said gcc inline asm was easy :D

LLVM equivalent of gcc's __BIGGEST_ALIGNMENT__?

GCC provides a __BIGGEST_ALIGNMENT__ pre-defined macro which is the largest alignment ever used for any data type on the target machine you are compiling for. I cannot seem to find an LLVM's equivalent for this. Is there any? If not, what is the best way to figure it out (preferably with pre-processor)?
This isn't accessible from the preprocessor, but __attribute__((aligned)) or __attribute__((__aligned__)) (with the alignment value omitted) will give the alignment you want. This is supposed to give the largest alignment of any built-in type, which is 16 on x86 and ARM.
For example:
$ cat align.c
struct foo {
char c;
} __attribute__((aligned)) var;
$ clang align.c -S -o - -emit-llvm
...
#var = global %struct.foo zeroinitializer, align 16
This is used by unwind.h for _Unwind_Exception:
struct _Unwind_Exception
{
_Unwind_Exception_Class exception_class;
_Unwind_Exception_Cleanup_Fn exception_cleanup;
_Unwind_Word private_1;
_Unwind_Word private_2;
/* ### The IA-64 ABI says that this structure must be double-word aligned.
Taking that literally does not make much sense generically. Instead we
provide the maximum alignment required by any type for the machine. */
} __attribute__((__aligned__));
This is in llvm internals as TargetData::PointerABIAlign, but it doesn't appear to be exposed to code. I'd just hard code to 16 bytes, as it seems like it'd be a while before we see any more aligned types or instruction sets.

likely/unlikely equivalent for MSVC

GCC compiler supports __builtin_expect statement that is used to define likely and unlikely macros.
eg.
#define likely(expr) (__builtin_expect(!!(expr), 1))
#define unlikely(expr) (__builtin_expect(!!(expr), 0))
Is there an equivalent statement for the Microsoft Visual C compiler, or something equivalent ?
C++20 standard will include [[likely]] and [[unlikely]] branch prediction attributes.
The latest revision of attribute proposal can be found from http://wg21.link/p0479
The original attribute proposal can be found from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0479r0.html
Programmers should prefer PGO. Attributes can easily reduce performance if applied incorrectly or they later become incorrect when program changes.
According to http://www.akkadia.org/drepper/cpumemory.pdf (page 57), it still makes sense to use static branch prediction even if CPU predicts correctly dynamically.
The reason for that is that L1i cache will be used even more efficiently if static prediction was done right.
I say just punt
There is nothing like it. There is __assume(), but don't use it, it's a different kind of optimizer directive.
Really, the reason the gnu builtin is wrapped in a macro is so you can just get rid of it automatically if __GNUC__ is not defined. There isn't anything the least bit necessary about those macros and I bet you will not notice the run time difference.
Summary
Just get rid of (null out) *likely on non-GNU. You won't miss it.
According to Branch and Loop Reorganization to Prevent Mispredicts document from Intel:
In order to effectively write your code to take advantage of these
rules, when writing if-else or switch statements, check the most
common cases first and work progressively down to the least common.
Unfortunately you cannot write something like
#define if_unlikely(cond) if (!(cond)); else
because MSVC optimizer as of VS10 ignores such "hint".
As I prefer to deal with errors first in my code, I seem to write less efficient code.
Fortunately, second time CPU encounters the branch it will use its statistics instead of a static hint.
__assume should be similar.
However, if you want to do this really well you should use Profile Guided Optimization rather than static hints.
I know this question is about Visual Studio, but I'm going to try to answer for as many compilers as I can (including Visual Studio)…
A decade later there is progress! As of Visual Studio 2019 MSVC still doesn't support anything like this (even though it's the most popular builtin/intrinsic), but as Pauli Nieminen mentioned above C++20 has likely / unlikely attributes which can be used to create likely/unlikely macros and MSVC usually adds support for new C++ standards pretty quickly (unlike C) so I expect Visual Studio 2021 to support them.
Currently (2019-10-14) only GCC supports these attributes, and even then only applied to labels, but it is sufficient to at least do some basic testing. Here is a quick implementation which you can test on Compiler Explorer:
#define LIKELY(expr) \
( \
([](bool value){ \
switch (value) { \
[[likely]] case true: \
return true; \
[[unlikely]] case false: \
return false; \
} \
}) \
(expr))
#define UNLIKELY(expr) \
( \
([](bool value){ \
switch (value) { \
[[unlikely]] case true: \
return true; \
[[likely]] case false: \
return false; \
} \
}) \
(expr))
Edit (2022-05-02): MSVC 2022 supports C++20, including [[likely]]/[[unlikely]], but generates absolutely terrible code for this (see the comments on this post)... don't use it there.
You'll probably want to #ifdef around it to support compilers that can't handle it, but luckily most compilers support __builtin_expect:
GCC 3.0
clang
ICC since at least 13, probably much longer.
Oracle Development Studio 12.6+, but only in C++ mode.
ARM 4.1
IBM XL C/C++ since at least 10.1, probably longer.
TI since 6.1
TinyCC since 0.9.27
GCC 9+ also supports __builtin_expect_with_probability. It's not available anywhere else, but hopefully one day… It takes a lot of the guesswork out of trying to figure out whether to use ilkely/unlikely or not—you just set the probability and the compiler (theoretically) does the right thing.
Also, clang supports a __builtin_unpredictable (since 3.8, but test for it with __has_builtin(__builtin_unpredictable)). Since a lot of compilers are based on clang these days it probably works in them, too.
If you want this all wrapped up and ready to go, you might be interested in one of my projects, Hedley. It's a single public-domain C/C++ header which works on pretty much all compilers and contains lots of useful macros, including HEDLEY_LIKELY, HEDLEY_UNLIKELY, HEDLEY_UNPREDICTABLE, HEDLEY_PREDICT, HEDLEY_PREDICT_TRUE, and HEDLEY_PREDICT_FALSE. It doesn't have the C++20 version quite yet, but it should be there soon…
Even if you don't want to use Hedley in your project, you might want to check the the implementations there instead of relying on the lists above; I'll probably forget to update this answer with new information, but Hedley should always be up-to-date.
Now MS said they have implemented likely/unlikely attributes
But in fact there isn't any different between using "likely" or not using.
I have compiled these codes and is produce same result.
int main()
{
int i = rand() % 2;
if (i) [[likely]]
{
printf("Hello World!\n");
}
else
{
printf("Hello World2%d!\n",i);
}
}
int main()
{
int i = rand() % 2;
if (i)
{
printf("Hello World!\n");
}
else [[likely]]
{
printf("Hello World2%d!\n",i);
}
}
int pdb._main (int argc, char **argv, char **envp);
0x00401040 push ebp
0x00401041 mov ebp, esp
0x00401043 push ecx
0x00401044 call dword [rand] ; pdb.__imp__rand
; 0x4020c4
0x0040104a and eax, 0x80000001
0x0040104f jns 0x401058
0x00401051 dec eax
0x00401052 or eax, 0xfffffffe ; 4294967294
0x00401055 add eax, 1
0x00401058 je 0x40106d
0x0040105a push str.Hello_World ; pdb.___C__0O_NFOCKKMG_Hello_5World__CB_6
; 0x402108 ; const char *format
0x0040105f call pdb._printf ; int printf(const char *format)
0x00401064 add esp, 4
0x00401067 xor eax, eax
0x00401069 mov esp, ebp
0x0040106b pop ebp
0x0040106c ret
0x0040106d push 0
0x0040106f push str.Hello_World2_d ; pdb.___C__0BB_DODJFBPJ_Hello_5World2__CFd__CB_6
; 0x402118 ; const char *format
0x00401074 call pdb._printf ; int printf(const char *format)
0x00401079 add esp, 8
0x0040107c xor eax, eax
0x0040107e mov esp, ebp
0x00401080 pop ebp
0x00401081 ret
As the question is old, the answers saying there's no [[likely]] / [[unlikely]] in MSVC, or that there's no impact are obsolete.
Latest MSVC supports [[likely]] / [[unlikely]] in /std:c++20 and /std:c++latest modes.
See demo on Godbolt's compiler explorer that shows the difference.
As can be seen from the link above, one visible effect on x86/x64 for if-else statement is that the conditional jump forward will be for unlikely branch. Before C++20 and supporting VS version the same could be achieved by placing the likely branch into if part, and the unlikely branch into else part, negating the condition as needed.
Note that the effect of such optimization is minimal. For frequently called code in a tight loop, the dynamic branch prediction would do the right thing anyway.

Resources