GCC extended Asm - Understanding clobbers and scratch registers usage - gcc

From GCC documentation regarding Extended ASM - Clobbers and Scratch Registers I find it difficult to understand the following explanation and example:
Here is a fictitious sum of squares instruction, that takes two
pointers to floating point values in memory and produces a floating
point register output. Notice that x, and y both appear twice in the
asm parameters, once to specify memory accessed, and once to specify a
base register used by the asm.
Ok, first part understood, now the sentence continues:
You won’t normally be wasting a
register by doing this as GCC can use the same register for both
purposes. However, it would be foolish to use both %1 and %3 for x in
this asm and expect them to be the same. In fact, %3 may well not be a
register. It might be a symbolic memory reference to the object
pointed to by x.
Lost me.
The example:
asm ("sumsq %0, %1, %2"
: "+f" (result)
: "r" (x), "r" (y), "m" (*x), "m" (*y));
What does the example and the second part of the sentence tells us? whats is the added value of this code in compare to another? does this code will result in a more efficient memory flush (as explained earlier in the chapter)?

Whats is the added value of this code in compare to another?
Which "another"? The way I see it, if you have this kind of instruction, you pretty much have to use this code (the only alternative being a generic memory clobber).
The situation is, you need to pass in registers to the instruction but the actual operands are in memory. Hence you need to ensure two separate things, namely:
That your instruction gets the operands in registers. This is what the r constraints do.
That the compiler knows your asm block reads memory so the values of *x and *y have to be flushed prior to the block. This is what the m constraints are used for.
You can't use %3 instead of %1, because the m constraint uses memory reference syntax. E.g. while %1 may be something like %r0 or %eax, %3 may be (%r0), (%eax) or even just some_symbol. Unfortunately your hypothetical instruction doesn't accept these kinds of operands.

Related

How to know which register(s) is read from as it executed?

I have a question about these kinds of quizzes. What is the theory behind this?
Given the following instruction, which register(s) is read from as it is executed? (Select all that apply)
and $sp, $gp, $s4
A. $gp(answer)
B. $s4(answer)
C. $sp
D. None of these.
lb $sp, 7472($v1)
A. $v1(answer)
B. Program Counter
C. $sp
D. None of these.
This following manual is pretty good for MIPS assembly language.  It relates the instruction assembly form to a register transfer notation that describes what the processor does with that assembly instruction, for example, the first instruction is sll $rd, $rt, shamt, which does for its operation: R[$rd] ← R[$rt] << shamt.
A register is a target if the execution of the instruction assigns the register a value (which most likely changes the value held by the register, but doesn't have to; the old value held by the register is lost).  When there is a target, the register transfer notation will show how the register is updated, i.e. how the new value is computed.
You can determine which registers are sources vs. targets by looking at where they are in relation to the ← that represents assignment.  When on the left as in R[$rd] ←, that is the target of an assignment, hence register $rd is a target, whereas when they appear on the right of assignment, that is a source register, as in ← R[$rt] << shamt.
(As you may know, the $ is commonly used to prefix register names with the MIPS assemblers / assembly languages.)
The MIPS green sheet is also pretty good but is oriented toward machine code rather than assembly language, so you have to know the order of the machine code operands vs. the assembly form of the same instruction (which you can see from the first link).  In MIPS assembly language the target register, when present, is always the first operand, despite that in machine code the target register (when present) is always the last register field.
In the green sheet, that same MIPS instruction has the following definition:
Description
Mnemonic
Format
Operation
Shift Left Logical
sll
R
R[rd] = R[rt] << shamt
Not all instructions have target registers, for example, the load instructions have a target register, but the store instructions do not — the true target of store instructions is some memory location, so there is no target register.
Every instruction also informs the processor what instruction to run next.  Most instructions tell the processor to advance the program counter by 4 (the size in bytes of one MIPS instruction), which has the effect of saying that the next instruction is the one immediately following in memory from the currently executing instruction; this achieves the normal sequential execution of one instruction after the other.  (This behavior is so fundamental that most coursework and instruction manuals assume and gloss this aspect of instruction execution, noting for example, only when the PC is updated in manner other than sequential.)
Branch instructions interact with the program counter either in the normal way (to advance by 4 for sequential execution) or to move it backwards (e.g. to accomplish a loop) or forwards (e.g. to exit a loop or skip a then or else part).  Branch instructions also do not have a target register — their effect is solely with the program counter.

What is the purpose of the 40h REX opcode in ASM x64?

I've been trying to understand the purpose of the 0x40 REX opcode for ASM x64 instructions. Like for instance, in this function prologue from Kernel32.dll:
As you see they use push rbx as:
40 53 push rbx
But using just the 53h opcode (without the prefix) also produces the same result:
According to this site, the layout for the REX prefix is as follows:
So 40h opcode seems to be not doing anything. Can someone explain its purpose?
the 04xh bytes (i.e. 040h, 041h... 04fh) are indeed REX bytes. Each bit in the lower nibble has a meaning, as you listed in your question. The value 040h means that REX.W, REX.R, REX.X and REX.B are all 0. That means that adding this byte doesn't do anything to this instruction, because you're not overriding any default REX bits, and it's not an 8-bit instruction with AH/BH/CH/DH as an operand.
Moreover, the X, R and B bits all correspond to some operands. If your instruction doesn't consume these operands, then the corresponding REX bit is ignored.
I call this a dummy REX prefix, because it does nothing before a push or pop. I wondered whether it is allowed and your experience show that it is.
It is there because the people at Microsoft apparently generated the above code. I'd speculate that for the extra registers it is needed, so they generate it always and didn't bother to remove it when it is not needed. Another possibility is that the lengthening of the instruction has a subtle effect on scheduling and or aligning and can make the code faster. This of course requires detailed knowledge of the particular processor.
I'm working at an optimiser that looks at machine code. Dummy prefixes are helpful because they make the code more uniform; there are less cases to consider. Then as a last step superfluous prefixes can be removed among other things.

How to write asm code "bsrl" in golang

I need to write some asm code in golang. I read this question Is it possible to include inline assembly in Google Go code?, but not see how to write it.
Could anyone help me? thanks.
asm ("bsrl %1, %0;"
:"=r"(bits) /* output */
:"r"(value) ); /* input */
All the answers on the question you found say it's not possible to use inline-asm in Go, with any syntax. GNU C inline-asm syntax isn't going to help.
But fortunately, you don't need inline asm for bsr (which finds the bit-index of the highest set bit). Go 1.9 has an intrinsic / built-in function for bitwise operations that are close enough that they should compile efficiently.
Use math.bits.LeadingZeros32 to get lzcnt(x), which is 31-bsr(x) for non-zero x. This may cost extra instructions, especially on CPUs which only support bsr, not lzcnt (e.g. Intel pre-Haswell).
Or use Len32(x) - 1
Len32(x) returns the number of bits required to represent x. It returns 0 for x=0, and presumably it returns 1 for x=1, so it's bsr(x) + 1, with defined behaviour for 0 (thus potentially costing extra instructions). Hopefully Len32(x) - 1 can compile directly to a bsr.
Of course, if what you really wanted was lzcnt, then use LeadingZeros32 in the first place.
Note that bsr leaves the destination register unmodified for input = 0. Intel's docs only say with an undefined value, so compilers probably don't take advantage of this guarantee that AMD documents and Intel does provide in hardware.
At least in theory, though, Len32(x) - 1 could compile to a single bsr instruction if the compiler can prove that x is non-zero.

Why do x86-64 Linux system calls work with 6 registers set?

I'm writing a freestanding program in C that depends only on the Linux kernel.
I studied the relevant manual pages and learned that on x86-64 the Linux system call entry point receives the system call number and six arguments through the seven registers rax, rdi, rsi, rdx, r10, r8, and r9.
Does this mean that every system call accepts six arguments?
I researched the source code of several libc implementations in order to find out how they perform system calls. Interestingly, musl contains two distinct approaches to system calls:
src/internal/x86_64/syscall.s
This assembly source file defines one __syscall function that moves the system call number and exactly six arguments to the registers defined in the ABI. The generic name of the function hints that it can be used with any system call, despite the fact it always passes six arguments to the kernel.
arch/x86_64/syscall_arch.h
This C header file defines seven separate __syscallN functions, with N specifying their arity. This suggests that the benefit of passing only the exact number of arguments that the system call requires surpasses the cost of having and maintaining seven nearly identical functions.
So I tried it myself:
long
system_call(long number,
long _1, long _2, long _3, long _4, long _5, long _6)
{
long value;
register long r10 __asm__ ("r10") = _4;
register long r8 __asm__ ("r8") = _5;
register long r9 __asm__ ("r9") = _6;
__asm__ volatile ( "syscall"
: "=a" (value)
: "a" (number), "D" (_1), "S" (_2), "d" (_3), "r" (r10), "r" (r8), "r" (r9)
: "rcx", "r11", "cc", "memory");
return value;
}
int main(void) {
static const char message[] = "It works!" "\n";
/* system_call(write, standard_output, ...); */
system_call(1, 1, message, sizeof message, 0, 0, 0);
return 0;
}
I ran this program and verified that it does write It works!\n to standard output. This left me with the following questions:
Why can I pass more parameters than the system call takes?
Is this reasonable, documented behavior?
What am I supposed to set the unused registers to?
Is 0 okay?
What will the kernel do with the registers it doesn't use?
Will it ignore them?
Is the seven function approach faster by virtue of having less instructions?
What happens to the other registers in those functions?
System calls accept up to 6 arguments, passed in registers (almost the same registers as the SysV x64 C ABI, with r10 replacing rcx but they are callee preserved in the syscall case), and "extra" arguments are simply ignored.
Some specific answers to your questions below.
The src/internal/x86_64/syscall.s is just a "thunk" which shifts all the all the arguments into the right place. That is, it converts from a C-ABI function which takes the syscall number and 6 more arguments, into a "syscall ABI" function with the same 6 arguments and the syscall number in rax. It works "just fine" for any number of arguments - the additional register movement will simply be ignored by the syscall if those arguments aren't used.
Since in the C-ABI all the argument registers are considered scratch (i.e., caller-save), clobbering them is harmless if you assume this __syscall method is called from C. In fact the kernel makes stronger guarantees about clobbered registers, clobbering only rcx and r11 so assuming the C calling convention is safe but pessimistic. In particular, the code calling __syscall as implemented here will unnecessarily save any argument and scratch registers per the C ABI, despite the kernel's promise to preserve them.
The arch/x86_64/syscall_arch.h file is pretty much the same thing, but in a C header file. Here, you want all seven versions (for zero to six arguments) because modern C compilers will warn or error if you call a function with the wrong number of arguments. So there is no real option to have "one function to rule them all" as in the assembly case. This also has the advantage of doing less work syscalls that take less than 6 arguments.
Your listed questions, answered:
Why can I pass more parameters than the system call takes?
Because the calling convention is mostly register-based and caller cleanup. You can always pass more arguments in this situation (including in the C ABI) and the other arguments will simply be ignored by the callee. Since the syscall mechanism is generic at the C and .asm level, there is no real way the compiler can ensure you are passing the right number of arguments - you need to pass the right syscall id and the right number of arguments. If you pass less, the kernel will see garbage, and if you pass more, they will be ignored.
Is this reasonable, documented behavior?
Yes, sure - because the whole syscall mechanism is a "generic gate" into the kernel. 99% of the time you aren't going to use that: glibc wraps the vast majority of interesting syscalls in C ABI wrappers with the correct signature so you don't have to worry about. Those are the ways that syscall access happens safely.
What am I supposed to set the unused registers to?
You don't set them to anything. If you use the C prototypes arch/x86_64/syscall_arch.h the compiler just takes care of it for you (it doesn't set them to anything) and if you are writing your own asm, you don't set them to anything (and you should assume they are clobbered after the syscall).
What will the kernel do with the registers it doesn't use?
It is free to use all the registers it wants, but will adhere to the kernel calling convention which is that on x86-64 all registers other than rax, rcx and r11 are preserved (which is why you see rcx and r11 in the clobber list in the C inline asm).
Is the seven function approach faster by virtue of having less instructions?
Yes, but the difference is very small since the reg-reg mov instructions are usually have zero latency and have high throughput (up to 4/cycle) on recent Intel architectures. So moving an extra 6 registers perhaps takes something like 1.5 cycles for a syscall that is usually going to take at least 50 cycles even if it does nothing. So the impact is small, but probably measurable (if you measure very carefully!).
What happens to the other registers in those functions?
I'm not sure what you mean exactly, but the other registers can be used just like all GP registers, if the kernel wants to preserve their values (e.g., by pushing them on the stack and then poping them later).

How Does Assembly Code Generation Work?

I've been studying compiler design a lot recently. I've managed to get a strong grasp of the parsing stage, but am having a bit of trouble understanding how code generation works.
From what I've read, there seems to be 3 major steps in the code generation phase:
Instruction Selection (Greedy Tiling)
Instruction Scheduling
Register Allocation
Now, instruction scheduling is a little beyond what I'm trying to do at the moment, and I think with a bit more studying and prototyping, I can probably wrap my mind around the graph coloring algorithm for register allocation.
What stumps me is the first step, instruction selection. From what I've read about it, each instruction in a target machine language is represented by a tile; and the goal is to find the instructions that match the largest parts of the tree (hence the nickname, greedy tiling).
The thing I'm confused about is, how do you select instructions when they don't actually correspond 1:1 with the syntax tree?
Take for example, accumulator-based architectures like the Z80 or MIPs single instruction architecture. Performing even 16-bit integer arithmetic on a Z80 may require the use of the accumulator or shadow registers.
There are also some instructions that can only be used on certain registers despite them being general purpose.
Would I be right in assuming the following?
a) A tile may consist of a sequence of instructions that match a syntax tree pattern, rather than just a 1:1 match.
b) The code generator generates code for a stack-based architecture (or an architecture with infinite temporary registers) first and expands and substitutes instructions as necessary somehow during the register allocation phase.
a) A tile can emit arbitrary number of instructions. For example, if you have instruction like %x <- %y + %z, but the target machine has only two-address instructions, then a matching tile might emit the assembly sequence (destination is first operand)
mov %x, %y
add %x, %z
b) what kind of register (or const, or mem reference) is allowed as an operand to an instruction is determined by the instruction itself, hence the instruction selection phase has to work on representation with symbolic register names (pseudo registers). Register allocation phase may indeed emit addition instructions, e.g. spill/load code when a register of a required class is not available for allocation.
Check this
Survey on Instruction Selection: an Extensive and Modern Literature Review

Resources