Does ARM GCC have a builtin function for the assembly 'REV' instruction? - gcc

It's pretty common for compilers to have builtin intrinsic functions for processor features, but I'm having trouble finding them. Is there one to get at the 'REV' (reverse byte order of a word) instruction in ARM?
Where can I find the list of builtin functions?

Is there one to get at the 'REV' (reverse byte order of a word) instruction in ARM?
There is a more 'portable' form that is available on all architectures. It is __builtin_bswap32. For example the compiler explorer has,
unsigned int foo(unsigned int a)
{
return __builtin_bswap32(a);
}
Giving,
foo(unsigned int):
rev r0, r0
bx lr
This is better than __builtin_rev would be as it will only be available on certain ARM targets (and certainly only ARM CPUs). You can use __builtin_bswap32 even on PowerPC, x86, etc.

Related

What C instructions do I need to use to get gcc's x86-64 autovectorizer to output pshufb opcodes?

I'd like gcc's autovectorization (i.e. not intrinsics) to convert 0xPQ to the 64-bit value 0xPQPQPQPQPQPQPQPQ using the ssse3 opcode pshufb. However, even though I can see pshufb opcodes being output by gcc for other uses (so the compiler is definitely able to output it), I can't work out the series of C instructions needed to trigger it for this particualr case.
Any suggestions? Thanks!
I doubt that pshufb will be the most efficient solution, unless you intend to have the result in the lower part of an xmm register. If you do, provide an actual usage example.
If you write something like:
long long foo(char x)
{
long long ret;
std::memset(&ret, x, sizeof ret);
return ret;
}
Both gcc and clang essentially just multiply x by 0x0101010101010101 which is as fast as a pshufb (assuming you have that value in a register already). However, with imul you have the result already in a general purpose register (and no additional movq is required).
Godbolt compilation results: https://godbolt.org/z/dTvcsM (the -msse3 makes no difference, nor do other compilation options, as long as it is at least -O1).

How to instrument specific assembly instructions and get their arguments [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Given any C/C++ source.c compiled with gcc
int func()
{
// bunch of code
...
}
will result in some assembly (example). . .
func():
str fp, [sp, #-4]!
add fp, sp, #0
sub sp, sp, #12
mov r3, #0
str r3, [fp, #-8]
mov r3, #55
sub sp, fp, #0
ldr fp, [sp], #4
bx lr
. . . which eventually gets turned into a binary source.obj
What I want is the ability to specify: before each assembly instruction X, call my custom function and pass as arguments the arguments of instruction X
I'm really only interested in whether a given assembly instruction executes. If I say I care about mult, I'm not necessarily saying I care whether a multiplication occurred in the original source. I understand that multiply by 2^N will result in a shift instruction. I get it.
Let's say I specify mov as the asm of interest.
The resulting assembly would be changed to the following
func():
str fp, [sp, #-4]!
add fp, sp, #0
sub sp, sp, #12
// custom code inserted here:
// I want to call another function with the arguments of **mov**
mov r3, #0
str r3, [fp, #-8]
// custom code inserted here:
// I want to call another function with the arguments of **mov**
mov r3, #55
sub sp, fp, #0
ldr fp, [sp], #4
bx lr
I understand that the custom code may have to push/pop any registers it uses depending how much gcc "knows" about it with respect to the registers it uses. The custom function may be a naked function
WHY
To toggle a pin to do real-time profiling every time instruction X is executed.
To record every time the arguments of X meet certain criteria.
Your question is unclear (even with the additional edit; the -finstrument-functions is not transforming assembler code, it is changing the way the compiler works, during optimizations and code generation; it works on intermediate compiler representations - probably at the GIMPLE level, not at the assembler or RTL level).
Perhaps you could code some GCC plugin which would work at the GIMPLE level (by adding an optimization pass transforming the appropriate GIMPLE; BTW the -finstrument-functions option is adding more passes). This could take months of work (you need to understand the internals of GCC), and you'll add your own instrumentation generating pass in the compiler.
Perhaps you are using some asm in your code. Then you might use some preprocessor macro to insert some code around it.
Perhaps you want to change your ABI or calling conventions (or the way GCC is generating assembler code). Then you need to patch the compiler itself (and implement a new target in it). This might require more than a year of work.
Be aware of various optimizations done by GCC. Sometimes you might want volatile asm instead of just asm.
My documentation page of GCC MELT gives many slides and links which should help you.
Is it possible to do this with any compiler?
Both GCC and Clang are free software, so you can study their source code and improve it for your needs. But both are very complex (many millions of lines of source code), and you'll need several years of work to fork them. By the time you did that, they would evolve significantly.
what I’d like to do is choose a set of assembly instructions - like { add, jump } - and tell the compiler to insert a snippet of my own custom assembly code just before any instruction in that set
You should read some book on compilers (e.g. the Dragon Book) and read another book on Instruction Set Architecture and Computer Architecture. You can't just insert arbitrarily some instructions in the assembler code generated by the compiler (because what you insert requires some processor resources that the compiler did manage, e.g. thru register allocation etc...)
after edition
// I want to call another function with the arguments of mov
mov r3, #0
This is not possible (or very difficult) in general. Because calling that other function will use r3 and spoil its content.
gcc -c source.c -o source.obj
is the wrong way to use GCC. You want optimization (specially for production binaries). If you care about assembler code, use gcc -O -Wall -fverbose-asm -S source.c (perhaps -O2 -march=native instead of -O ...) then look into source.s
Let's say I specify mul as the asm of interest.
Again, that is the wrong approach. You care about multiplication in the source code, or in some intermediate representation. Perhaps mul might be emitted for x*3 without -O but probably not with -O2
think and work at the GIMPLE level not at the assembler level.
examples
First, look into the source code of GCC. It is free software. If you want to understand how -finstrument-functions really works, take a few months to read about GCC internals (I gave links and references), study the actual source code of GCC, and ask on gcc#gcc.gnu.org after that.
Now, imagine you want to count and instrument how many multiplications are done (which is not the same as how many IMUL instruction, e.g. because 8*x will probably be optimized as a shift machine code instruction). Of course it depends upon the optimizations enabled, and you'll work at the GIMPLE level. You'll probably increment some counter at the end of every GCC basic blocks. So after each BB exit you'll insert an additional GIMPLE statement. Such a simple instrumentation could need months of work.
Or imagine that you want to instrument loads to detect, when possible, undefined behavior or addressing issues. This is what the address sanitizer is doing. It tooks several years of work.
Things are much more complex than what you believe.
(it is not in vain that GCC has about ten millions of source code lines; C compilers need to be complex today.)
If you don't care about the C source code, you should not care about GCC. The assembler code could be produced by Bones, by Clang, by a JVM implementation, by ocamlopt etc (and all these don't even use GCC). Or could be produced by some other version of GCC (not the one you are instrumenting).
So spend a few weeks reading more about compilers, then ask another question. That question should mention what kind of binary or of assembler you want to instrument. Instrumenting assembler code (or binary executable) is a lot harder than instrumenting GCC (and don't use textual techniques at all). It extracts first an abstracted form of the control flow graph and refines and reasons on it.
BTW, you'll find lots of textbooks and conferences on both source instrumentation and binary instrumentation (these topics are different, even if in relation). Spend a few months reading them. Your naive textual approaches have some 1960-s smells which won't scale and won't work on today's software.
See also this talk (and video): Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” CppCon 2017

Why do x86-64 Linux system calls work with 6 registers set?

I'm writing a freestanding program in C that depends only on the Linux kernel.
I studied the relevant manual pages and learned that on x86-64 the Linux system call entry point receives the system call number and six arguments through the seven registers rax, rdi, rsi, rdx, r10, r8, and r9.
Does this mean that every system call accepts six arguments?
I researched the source code of several libc implementations in order to find out how they perform system calls. Interestingly, musl contains two distinct approaches to system calls:
src/internal/x86_64/syscall.s
This assembly source file defines one __syscall function that moves the system call number and exactly six arguments to the registers defined in the ABI. The generic name of the function hints that it can be used with any system call, despite the fact it always passes six arguments to the kernel.
arch/x86_64/syscall_arch.h
This C header file defines seven separate __syscallN functions, with N specifying their arity. This suggests that the benefit of passing only the exact number of arguments that the system call requires surpasses the cost of having and maintaining seven nearly identical functions.
So I tried it myself:
long
system_call(long number,
long _1, long _2, long _3, long _4, long _5, long _6)
{
long value;
register long r10 __asm__ ("r10") = _4;
register long r8 __asm__ ("r8") = _5;
register long r9 __asm__ ("r9") = _6;
__asm__ volatile ( "syscall"
: "=a" (value)
: "a" (number), "D" (_1), "S" (_2), "d" (_3), "r" (r10), "r" (r8), "r" (r9)
: "rcx", "r11", "cc", "memory");
return value;
}
int main(void) {
static const char message[] = "It works!" "\n";
/* system_call(write, standard_output, ...); */
system_call(1, 1, message, sizeof message, 0, 0, 0);
return 0;
}
I ran this program and verified that it does write It works!\n to standard output. This left me with the following questions:
Why can I pass more parameters than the system call takes?
Is this reasonable, documented behavior?
What am I supposed to set the unused registers to?
Is 0 okay?
What will the kernel do with the registers it doesn't use?
Will it ignore them?
Is the seven function approach faster by virtue of having less instructions?
What happens to the other registers in those functions?
System calls accept up to 6 arguments, passed in registers (almost the same registers as the SysV x64 C ABI, with r10 replacing rcx but they are callee preserved in the syscall case), and "extra" arguments are simply ignored.
Some specific answers to your questions below.
The src/internal/x86_64/syscall.s is just a "thunk" which shifts all the all the arguments into the right place. That is, it converts from a C-ABI function which takes the syscall number and 6 more arguments, into a "syscall ABI" function with the same 6 arguments and the syscall number in rax. It works "just fine" for any number of arguments - the additional register movement will simply be ignored by the syscall if those arguments aren't used.
Since in the C-ABI all the argument registers are considered scratch (i.e., caller-save), clobbering them is harmless if you assume this __syscall method is called from C. In fact the kernel makes stronger guarantees about clobbered registers, clobbering only rcx and r11 so assuming the C calling convention is safe but pessimistic. In particular, the code calling __syscall as implemented here will unnecessarily save any argument and scratch registers per the C ABI, despite the kernel's promise to preserve them.
The arch/x86_64/syscall_arch.h file is pretty much the same thing, but in a C header file. Here, you want all seven versions (for zero to six arguments) because modern C compilers will warn or error if you call a function with the wrong number of arguments. So there is no real option to have "one function to rule them all" as in the assembly case. This also has the advantage of doing less work syscalls that take less than 6 arguments.
Your listed questions, answered:
Why can I pass more parameters than the system call takes?
Because the calling convention is mostly register-based and caller cleanup. You can always pass more arguments in this situation (including in the C ABI) and the other arguments will simply be ignored by the callee. Since the syscall mechanism is generic at the C and .asm level, there is no real way the compiler can ensure you are passing the right number of arguments - you need to pass the right syscall id and the right number of arguments. If you pass less, the kernel will see garbage, and if you pass more, they will be ignored.
Is this reasonable, documented behavior?
Yes, sure - because the whole syscall mechanism is a "generic gate" into the kernel. 99% of the time you aren't going to use that: glibc wraps the vast majority of interesting syscalls in C ABI wrappers with the correct signature so you don't have to worry about. Those are the ways that syscall access happens safely.
What am I supposed to set the unused registers to?
You don't set them to anything. If you use the C prototypes arch/x86_64/syscall_arch.h the compiler just takes care of it for you (it doesn't set them to anything) and if you are writing your own asm, you don't set them to anything (and you should assume they are clobbered after the syscall).
What will the kernel do with the registers it doesn't use?
It is free to use all the registers it wants, but will adhere to the kernel calling convention which is that on x86-64 all registers other than rax, rcx and r11 are preserved (which is why you see rcx and r11 in the clobber list in the C inline asm).
Is the seven function approach faster by virtue of having less instructions?
Yes, but the difference is very small since the reg-reg mov instructions are usually have zero latency and have high throughput (up to 4/cycle) on recent Intel architectures. So moving an extra 6 registers perhaps takes something like 1.5 cycles for a syscall that is usually going to take at least 50 cycles even if it does nothing. So the impact is small, but probably measurable (if you measure very carefully!).
What happens to the other registers in those functions?
I'm not sure what you mean exactly, but the other registers can be used just like all GP registers, if the kernel wants to preserve their values (e.g., by pushing them on the stack and then poping them later).

Subtract and detect underflow, most efficient way? (x86/64 with GCC)

I'm using GCC 4.8.1 to compile C code and I need to detect if underflow occurs in a subtraction on x86/64 architecture. Both are UNSIGNED. I know in assembly is very easy, but I'm wondering if I can do it in C code and have GCC optimize it in a way, cause I can't find it. This is a very used function (or lowlevel, is that the term?) so I need it to be efficient, but GCC seems to be too dumb to recognize this simple operation? I tried so many ways to give it hints in C, but it always uses two registers instead of just a sub and a conditional jump. And to be honest I get annoyed seeing such stupid code written so MANY times (function is called a lot).
My best approach in C seemed to be the following:
if((a-=b)+b < b) {
// underflow here
}
Basically, subtract b from a, and if result underflows detect it and do some conditional processing (which is unrelated to a's value, for example, it brings an error, etc).
GCC seems too dumb to reduce the above to just a sub and a conditional jump, and believe me I tried so many ways to do it in C code, and tried alot of command line options (-O3 and -Os included of course). What GCC does is something like this (Intel syntax assembly):
mov rax, rcx ; 'a' is in rcx
sub rcx, rdx ; 'b' is in rdx
cmp rax, rdx ; useless comparison since sub already sets flags
jc underflow
Needless to say the above is stupid, when all it needs is this:
sub rcx, rdx
jc underflow
This is so annoying because GCC does understand that sub modifies flags that way, since if I typecast it into a "int" it will generate the exact above except it uses "js" which is jump with sign, instead of carry, which will not work if the unsigned values difference is high enough to have the high bit set. Nevertheless it shows it is aware of the sub instruction affecting those flags.
Now, maybe I should give up on trying to make GCC optimize this properly and do it with inline assembly which I have no problems with. Unfortunately, this requires "asm goto" because I need a conditional JUMP, and asm goto is not very efficient with an output because it's volatile.
I tried something but I have no idea if it is "safe" to use or not. asm goto can't have outputs for some reason. I do not want to make it flush all registers to memory, that would kill the entire point I'm doing this which is efficiency. But if I use empty asm statements with outputs set to the 'a' variable before and after it, will that work and is it safe? Here's my macro:
#define subchk(a,b,g) { typeof(a) _a=a; \
asm("":"+rm"(_a)::"cc"); \
asm goto("sub %1,%0;jc %l2"::"r,m,r"(_a),"r,r,m"(b):"cc":g); \
asm("":"+rm"(_a)::"cc"); }
and using it like this:
subchk(a,b,underflow)
// normal code with no underflow
// ...
underflow:
// underflow occured here
It's a bit ugly but it works just fine. On my test scenario, it compiles just FINE without volatile overhead (flushing registers to memory) without generating anything bad, and it seems it works ok, however this is just a limited test, I can't possibly test this everywhere I use this function/macro as I said it is used A LOT, so I'd like to know if someone is knowledgeable, is there something unsafe about the above construct?
Particularly, the value of 'a' is NOT NEEDED if underflow occurs, so with that in mind are there any side effects or unsafe stuff that can happen with my inline asm macro? If not I'll use it without problems till they optimize the compiler so I can replace it back after I guess.
Please don't turn this into a debate about premature optimizations or what not, stay on topic of the question, I'm fully aware of that, so thank you.
I probably miss something obvious, but why isn't this good?
extern void underflow(void) __attribute__((noreturn));
unsigned foo(unsigned a, unsigned b)
{
unsigned r = a - b;
if (r > a)
{
underflow();
}
return r;
}
I have checked, gcc optimizes it to what you want:
foo:
movl %edi, %eax
subl %esi, %eax
jb .L6
rep
ret
.L6:
pushq %rax
call underflow
Of course you can handle underflow however you want, I have just done this to keep the asm simple.
How about the following assembly code (you can wrap it into GCC format):
sub rcx, rdx ; assuming operands are in rcx, rdx
setc al ; capture carry bit int AL (see Intel "setxx" instructions)
; return AL as boolean to compiler
Then you invoke/inline the assembly code, and branch on the resulting boolean.
Have you tested whether this is actually faster? Modern x86-microarchitectures use microcode, turning single assembly instructions into sequences of simpler micro-operations. Some of them also do micro-op fusion, in which a sequence of assembly-instructions is turned into a single micro-op. In particular, sequences like test %reg, %reg; jcc target are fused, probably because global processor flags are a bane of performance.
If cmp %reg, %reg; jcc target is mOp-fused, gcc might use that to get faster code. In my experience, gcc is very good at scheduling and similar low-level optimizations.

How to force gcc to use all SSE (or AVX) registers?

I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx. (-m64 is implied)
In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10 and shuffling data from and onto the stack. My question is:
Is there a way to convince GCC to use all the registers xmm0-xmm15?
To fix ideas, consider the following SSE code (for illustration only):
void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
for (int i=0; i < 10; i++) {
vect<__m128> v = q2 - q1;
a1 += v;
// a2 -= v;
q2 *= _mm_set1_ps(2.);
}
}
Here vect<__m128> is simply a struct of 3 __m128, with natural addition and multiplication by scalar. When the line a2 -= v is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10. When I remove the comment a2 -= v, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13 or something.
I actually haven't seen GCC use any of the registers xmm11-xmm15 anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.
Two points:
First, You're making a lot of assumptions. Register spilling is pretty cheap on x86 CPUs (due to fast L1 caches and register shadowing and other tricks), and the 64-bit only registers are more costly to access (in terms of larger instructions), so it may just be that GCC's version is as fast, or faster, than the one you want.
Second, GCC, like any compiler, does the best register allocation it can. There's no "please do better register allocation" option, because if there was, it'd always be enabled. The compiler isn't trying to spite you. (Register allocation is a NP-complete problem, as I recall, so the compiler will never be able to generate a perfect solution. The best it can do is to approximate)
So, if you want better register allocation, you basically have two options:
write a better register allocator, and patch it into GCC, or
bypass GCC and rewrite the function in assembly, so you can control exactly which registers are used when.
Actually, what you see aren't spills, it is gcc operating on a1 and a2 in memory because it can't know if they are aliased. If you declare the last two parameters as vect<__m128>& __restrict__ GCC can and will register allocate a1 and a2.

Resources