Retrieving the ZF in GCC inline assembly - gcc

I need to use some x86 instructions that have no GCC intrinsics, such as BSF and BSR.
With GCC inline assembly, I can write something like the following
__INTRIN_INLINE unsigned char bsf64(unsigned long* const index, const uint64_t mask)
{
__asm__("bsf %[mask], %[index]" : [index] "=r" (*index) : [mask] "mr" (mask));
return mask ? 1 : 0;
}
Code like if (bsf64(x, y)) { /* use x */ } is translated by GCC to something like
0x000000010001bf04 <bsf64+0>: bsf %rax,%rdx
0x000000010001bf08 <bsf64+4>: test %rax,%rax
0x000000010001bf0b <bsf64+7>: jne 0x10001bf44 <...>
However if mask is zero, BSF already sets the ZF flag, so the test after bsf is redundant.
Instead of returning mask ? 1 : 0, is it possible to retrieve the ZF flag and returning it, making GCC not generate the test?
EDIT: made the if example more clear
EDIT: In response to Damon, __builtin_ffsl generates even less optimal code. If I use the following code
int b = __builtin_ffsl(mask);
if (b) {
*index = b - 1;
return true;
} else {
return false;
}
GCC generates this assembly
0x000000000044736d <+1101>: bsf %r14,%r14
0x0000000000447371 <+1105>: cmove %r12,%r14
0x0000000000447375 <+1109>: add $0x1,%r14d
0x0000000000447379 <+1113>: je 0x4471c0 <...>
0x000000000044737f <+1119>: lea -0x1(%r14),%ecx
So the test is gone, but redundant conditional move, increment and decrement are generated.

A couple of remarks:
This is an "anti-optimization". You're trying to do a micro-optimization on something that the compiler already supports.
Your code does not generate the bsf instruction at all with my version of gcc with all optimization switches turned on. Looking at the code, that is not surprising, because you return mask, which is the source operand, not the destination operand (gcc uses AT&T syntax!). The compiler is intelligent enough to figure this out and drops the assembler code (which doesn't do anything) alltogether.
There is an intrinsic function __builtin_ffsl which does exactly the same as your inline assembly (though, correctly). An intrinsic is no less portable than inline assembler, but easier for the compiler to optimize.
Using the intrinsic function results in a bsf cmov sequence on my compiler (assuming the calling code forces it to actually emit the instruction), which shows that the compiler uses the zero-flag just fine without an additional test instruction.
Returning a char when you want a bool is not the best possible hint for the compiler, though it will probably figure it out anyway most of the time. However, telling the compiler to use a bitscan instruction when you are really only interested in "zero or not zero" is certainly sub-optimal. if(x) and if(!x) work perfectly well for that matter. It would be different if you returned the result as reference, so you could reuse it in another place, but as it is, your code is only a very complicated way of writing if(x).

Related

What do the GCC compiler messages tell me to focus on to get the loop to vectorize?

My problem: I am trying to get GCC to vectorize a nested loop.
Compiler flags I added to the basic flags:
-fopenmp
-march=native
-msse2 -mfpmath=sse
-ffast-math
-funsafe-math-optimizations
-ftree-vectorize
-fopt-info-vec-missed
Variables:
// local variables
float RTP, RPE, RR=0;
int SampleLoc, EN;
float THr;
long TN;
// global variables
extern float PR[MaxX - MinX][MaxY - MinY][MaxZ - MinZ];
extern float SampleOffset;
extern float SampleInt;
// global defines
Speed
Loops start at line 500. The code is parallelized but not vectorized. Here is the code:
#pragma omp parallel for
for (TN = 0; TN < NumXmtrFoci; TN++) {
for (XR = XlBound; XR <= XuBound; XR++) {
for (YR = YlBound; YR <= YuBound; YR++) {
for (ZR = ZlBound; ZR <= ZuBound; ZR++) {
for (EN = 0; EN < NUM_RCVR_ELE; EN++) {
RPE = REP[XR - MinX][YR - MinY][ZR - MinZ][EN];
RTP = RPT[XR - MinX][YR - MinY][ZR - MinZ][TN];
RR = RPE + RTP + ZT[TN];
SampleLoc = (int)(floor(RR/(SampleInt*Speed) + SampleOffset));
THr = TimeHistory[SampleLoc][EN];
PR[XR-MinX][YR-MinY][ZR-MinZ] += THr;
} /*for EN*/
} /*for ZR*/
} /*for YR*/
} /*for XR*/ /
} /*for TN*/
The loop bounds are all #define and range from -64 to 128. Loop iterator variables are of type int. Inner loop variables are of type float.
A sampling of GCC compiler 'NOT VECTORIZED' message relevant to this loop;
some repeated many times at many places in the code.
502|note: not vectorized: multiple nested loops.|
505|note: not vectorized: not suitable for gather load _61 = TimeHistory[_59][_85];|
500|note: not vectorized: no grouped stores in basic block.|
500|note: not vectorized: not enough data-refs in basic block.|
MY QUESTION IS: What do the GCC compiler messages tell me to focus on to get the loop to vectorize?
I do not understand the messages adequately. So far online has not provided the answers. I thought multiple nested loops were not a problem and what are: gather load; grouped stores; data-refs.
The main point of the GCC report is that the expression TimeHistory[SampleLoc][EN] cannot be easily vectorized unless gather instructions are used. Indeed, SampleLoc is a variable that can contains non linear values. Gather instruction are only available in the AVX2/AVX-512 instruction set. You compilation flags indicate you use SSE2 and possible a more advanced instruction set available on the target platform (but this one is not provided). Without AVX2/AVX-512, GCC cannot vectorize the loops because of such non contiguous access pattern. In fact, AVX-2 gather instructions are a bit limited compared to general reads so the compiler may not be able to use them because of that. You can see the list of intrinsics/instructions here. Additionally, the compiler needs to be sure TimeHistory is not modified in the loop. THis seems trivial here but in practice arrays can theoretically alias so the compiler can be afraid of a possible aliasing and not vectorise the gather loads because of a possible dependence between each read and next writes. Replicating the last loop may help. Using the restrict keyword and const can also help.

scanf not working as expected in Frama-C

In the program below, function dec uses scanf to read an arbitrary input from the user.
dec is called from main and depending on the input it returns 1 or 0 and accordingly an operation will be performed. However, the value analysis indicates that y is always 0, even after the call to scanf. Why is that?
Note: the comments below apply to versions earlier than Frama-C 15 (Phosphorus, 20170501); in Frama-C 15, the Variadic plugin is enabled by default (and its short name is now -variadic).
Solution
Enable Variadic (-va) before running the value analysis (-val), it will eliminate the warning and the program will behave as expected.
Detailed explanation
Strictly speaking, Frama-C itself (the kernel) only does the parsing; it's up to the plug-ins themselves (e.g. Value/EVA) to evaluate the program.
From your description, I believe you must be using Value/EVA to analyze a program. I do not know exactly which version you are using, so I'll describe the behavior with Frama-C Silicon.
One limitation of ACSL (the specification language used by Frama-C) is that it is not currently possible to specify contracts for variadic functions such as scanf. Therefore, the specifications shipped with the Frama-C standard library are insufficient. You can notice this in the following program:
#include <stdio.h>
int d;
int main() {
scanf("%d", &d);
Frama_C_show_each(d);
return 0;
}
Running frama-c -val file.c will output, among other things:
...
[value] using specification for function scanf
FRAMAC_SHARE/libc/stdio.h:150:[value] warning: no \from part for clause 'assigns *__fc_stdin;' of function scanf
[value] Done for function scanf
[value] Called Frama_C_show_each({0})
...
That warning means that the specification is incorrect, which explains the odd behavior.
The solution in this case is to use the Variadic plug-in (-va, or -va-help for more details), which will specialize variadic calls and add specifications to them, thus avoiding the warning and behaving as expected. Here's the resulting code (-print) after running the Variadic plug-in on the example above:
$ frama-c -va file.c -print
[... lots of definitions from stdio.h ...]
/*# requires valid_read_string(format);
requires \valid(param0);
ensures \initialized(param0);
assigns \result, *__fc_stdin, *param0;
assigns \result
\from (indirect: *__fc_stdin), (indirect: *(format + (0 ..)));
assigns *__fc_stdin
\from (indirect: *__fc_stdin), (indirect: *(format + (0 ..)));
assigns *param0
\from (indirect: *__fc_stdin), (indirect: *(format + (0 ..)));
*/
int scanf_0(char const *format, int *param0);
int main(void)
{
int __retres;
scanf_0("%d",& d);
Frama_C_show_each(d);
__retres = 0;
return __retres;
}
In this example, scanf was specialized to scanf_0, with a proper ACSL annotation. Running EVA on this program will not emit any warnings and produce the expected output:
# frama-c -va file.c -val
...
[value] Done for function scanf_0
[value] Called Frama_C_show_each([-2147483648..2147483647])
...
Note: the GUI in Frama-C 14 (Silicon) does not allow the Variadic plug-in to be enabled (even after ticking it in the Analyses panel), so you must use the command-line in this case to obtain the expected result and avoid the warning. Starting from Frama-C 15 (Phosphorus, to be released in 2017), this won't be necessary: Variadic will be enabled by default and so your example would work from the start.

Can evaluation of functions happen during compile time?

Consider the below function,
public static int foo(int x){
return x + 5;
}
Now, let us call it,
int in = /*Input taken from the user*/;
int x = foo(10); // ... (1)
int y = foo(in); // ... (2)
Here, can the compiler change
int x = foo(10); // ... (1)
to
int x = 15; // ... (1)
by evaluating the function call during compile time since the input to the function is available during compile time ?
I understand this is not possible during the call marked (2) because the input is available only during run time.
I do not want to know a way of doing it in any specific language. I would like to know why this can or can not be a feature of a compiler itself.
C++ does have a method for this:
Have a read up on the 'constexpr' keyword in C++11, it allows compile time evaluation of functions.
They have a limitation: the function must be a return statement (not multiple lines of code), but can call other constexpr functions (C++14 does not have this limitation AFAIK).
static constexpr int foo(int x){
return x + 5;
}
EDIT:
Why a compiler might not evaluate a function (just my guess):
It might not be appropriate to remove a function by evaluating it without being told.
The function could be used in different compilation units, and with static/dynamic inputs: thus evaluating it in some circumstances and adding a call in other places.
This use would provide inconsistent execution times (especially on a deterministic platform like AVR) where timing may be important, or at least need to be predictable.
Also interrupts (and how the compiler interacts with them) may come into play here.
EDIT:
constexpr is actually stronger -- it requires that the compiler do this. The compiler is free to fold away functions without constexpr, but the programmer can't rely on it doing so.
Can you give an example in the case where the user would have benefited from this but the compiler chose not to do it ?
inline functions may, or may not resolve to constant expressions which could be optimized into the end result.
However, a constexpr guarantees it. An inline function cannot be used as a compile time constant whereas constexpr can allow you to formulate compile time functions and more so, objects.
A basic example where constexpr makes a guarantee that inline cannot.
constexpr int foo( int a, int b, int c ){
return a+b+c;
}
int array[ foo(1, 2, 3) ];
And the same as a simple object.
struct Foo{
constexpr Foo( int a, int b, int c ) : val(a+b+c){}
int val;
};
constexpr Foo foo( 1,2,4 );
int array[ foo.val ];
Unless foo.val is a compile time constant, the code above will not compile.
Even as just a function, an inline function has no guarantee. And the linker can also do inlining over multiple compilation units, after the syntax has been compiled (array bounds checked for integer constants).
This is kind of like meta-programming, but without the templates. Of course these examples do not do the topic justice, however very complex solutions would benefit from the ability to use objects and functional programming to achieve a result.
Yes, evaluation can happen during compile time. This comes under the heading of constant folding and function inlining, both of which are common optimizations for optimizing compilers.
Many languages do not have strong distinction between "compile time" and "run time", but the general rule is that the language defines an "execution model" which defines the behavior of any particular program with any particular input (or specifies that it is undefined). The compiler must produce an executable that can read any input and produce the corresponding output as defined by the execution model. What happens inside the executable doesn't matter -- as long as the externally viewed behavior is correct.
Here "input", "output" and "behavior" includes all possible interactions with the environment that are defined in the execution model, including timing effects.

NEON inline assembly - store query

I am trying to learn how to utilize NEON using gcc and inline assembly.
While it is confusing and slow going, I making some progress (It's been 10 years since I last tried writing assembly).
My simple program loads a (small) vector, saturation sums it, and stores it. The problem I am having is that I cannot seem to store the result in the place I want.
When I use an unused array pointer (r) in my output list, I get an error "impossible constraint in asm". If I then create a second pointer to it (rptr), it assembles, but it re-uses an input register r2 which is a, effectively overwriting the input.
(I know my arrays are 32 elements in size and that I'm only processing one element, I plan to try to loop, or try load more registers for parallel processing next)
void vecSum()
{
//two input arrays of 32 bit types, one output
int32_t a[32];
int32_t b[32];
int32_t r[32];
//initialize
for(int cnt = 0; cnt < 32; cnt++)
{
a[cnt] = 0x33333333;
b[cnt] = 0x11111111;
r[cnt] = 0;
}
void *rptr = r;
__asm__ volatile(
"vld1.32 {d0},[%[ina]]!\n" //load the neon register with our data at a, post increment the reg
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n" //store the answer
: [result]"=r" (rptr) /*r*/
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
for(int g=0; g < 32; g++)
{
printf("0x[%d]%x ",g,a[g]);
}
}
Objdump:
for(int cnt = 0; cnt < 32; cnt++)
780: e3530080 cmp r3, #128 ; 0x80
784: 1afffff7 bne 768 <_Z8vecSum32v+0x28>
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n"
: [result]"=r" (rptr)
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
788: f422078f vld1.32 {d0}, [r2]
78c: f421178d vld1.32 {d1}, [r1]!
790: f2200011 vqadd.s32 d0, d0, d1
794: f402078f vst1.32 {d0}, [r2]
In summary, if I try vst1.32 d0,[%[result]] where result is the array pointer r, I get a compilation error. If I rptr ( another pointer to r) it comiles, but uses r2 (the array a) as the output.
Can anybody explain why I get the error outputting to r? And why the ptr to r is a?
rptr is declared as an output when it should be an input and "memory" is missing from the clobber list.
Alternatively you may put the arrays in structs and use the structs (rather than pointers) as arguments to the asm statement.
Consider if the asm contained add %[result], %[ina], %[inb]. There's no harm whatsoever in allocating r2 for both result and ina there. Since GCC doesn't go analysing the contents of the asm statement, its default assumption is that it contains a single instruction like that, so if yours is more complicated then you need to say so in order for things to work as expected.
Specifically, to prevent the problematic overlapping register allocation here, you need to be honest about the fact that you that your asm modifies the input registers - most simply via the + modifier (which then actually makes them outputs as far as GCC is concerned). Another unpleasant side effect of not doing that, is that the compiler would assume that e.g. r1 still holds the address of b afterwards, and may generate later code relying on that which will then go horribly wrong thanks to what the asm actually did.
Furthermore, you don't modify the result pointer, and only use its value as an input, so saying it's a write-only output operand is very wrong.
As for the issue with r, well, by specifying it as an output operand, you're saying that the asm writes a value back to that variable. Except you can't do that with an array variable in C (<languagelawyer> arrays are not modifiable lvalues) - you need to give the asm a variable which holds the address of the array and can be assigned back to, i.e. a pointer variable. The reason you can use the arrays directly as input operands, is because input operands are expressions, not variables, and an expression that evaluates to an array is automatically converted to a pointer to first element of that array (but is still not an lvalue </languagelawyer>).
All in all then, with appropriate pointer variables for a and b, suitable operands and constraints for this code as-is would look more like this:
: [ina] "+r" (aptr), [inb] "+r" (bptr)
: [result] "r" (r)
: "d0", "d1", "memory" /* getting clobbers right is also important */
Side note: if you just want to get to grips with NEON instructions rather than fighting with GCC, intrinsics are an alternative to consider.

Can pretty variable names be used for registers in GCC inline assembly?

I have some inline assembly. I want GCC to have total freedom in choosing GP registers to allocate. I also want to use pretty names for registers inside the assembly for ease of comprehension for future maintainers. I think I did this previously (10+ years ago) for ARM 5te but am now scratching my head while writing some AArch64 code.
In a simpler example, this is what I want:
uint32_t arg1 = 1, arg2 = 2, result;
asm volatile(
"add %result, %arg1, %arg2\n"
// Outputs:
: ???
// Inputs:
: ???
// Clobbered:
: ???
);
I figure I need the right voodoo to go where I have written "???" above.
Is it possible?
Yes.
[arg1] "r" (arg1)
For example. The two names([arg1] and (arg1) above) can be different.
Inside the assembly code, you'd use:
add %[result], %[arg1], %[arg2]
Documentation link.
Here's your whole example reworked (case changed for the assembly variables just to illustrate that they needn't be the same):
uint32_t arg1 = 1, arg2 = 2, result;
asm volatile(
"add %[RESULT], %[ARG1], %[ARG2]\n"
: [RESULT]"=r"(result) /* output */
: [ARG1]"r"(arg1), [ARG2]"r"(arg2) /* inputs */
: /* no clobbers */
);

Resources