Why did additional pointer arguments disappear in assembly? - gcc

C Code:
void PtrArg1(int* a,int* b,int* c, int* d, int* e, int* f)
{
return;
}
void PtrArg2(int* a,int* b,int* c, int* d, int* e, int* f, int* g, int* h)
{
return;
}
Compiling with
gcc -c -m64 -o basics basics.c -O0
Running
objdump -d basics -M intel -r
then results in the following disassembly (Intel syntax):
000000000000000b <PtrArg1>:
b: f3 0f 1e fa endbr64
f: 55 push rbp
10: 48 89 e5 mov rbp,rsp
13: 48 89 7d f8 mov QWORD PTR [rbp-0x8],rdi
17: 48 89 75 f0 mov QWORD PTR [rbp-0x10],rsi
1b: 48 89 55 e8 mov QWORD PTR [rbp-0x18],rdx
1f: 48 89 4d e0 mov QWORD PTR [rbp-0x20],rcx
23: 4c 89 45 d8 mov QWORD PTR [rbp-0x28],r8
27: 4c 89 4d d0 mov QWORD PTR [rbp-0x30],r9
2b: 90 nop
2c: 5d pop rbp
2d: c3 ret
000000000000002e <PtrArg2>:
2e: f3 0f 1e fa endbr64
32: 55 push rbp
33: 48 89 e5 mov rbp,rsp
36: 48 89 7d f8 mov QWORD PTR [rbp-0x8],rdi
3a: 48 89 75 f0 mov QWORD PTR [rbp-0x10],rsi
3e: 48 89 55 e8 mov QWORD PTR [rbp-0x18],rdx
42: 48 89 4d e0 mov QWORD PTR [rbp-0x20],rcx
46: 4c 89 45 d8 mov QWORD PTR [rbp-0x28],r8
4a: 4c 89 4d d0 mov QWORD PTR [rbp-0x30],r9
4e: 90 nop
4f: 5d pop rbp
50: c3 ret
The number of arguments differs for PtrArg1 and PtrArg2, but the assembly instructions are the same for both. Why?

This is due to the calling convention (System V AMD64 ABI, current version 1.0). The first six parameters are passed in integer registers, all others are pushed onto the stack.
After executing till location PtrArg2+0x4e, you get the following stack layout:
+----------+-----------------+
| offset | content |
+----------+-----------------+
| rbp-0x30 | f |
| rbp-0x28 | e |
| rbp-0x20 | d |
| rbp-0x18 | c |
| rbp-0x10 | b |
| rbp-0x8 | a |
| rbp+0x0 | saved rbp value |
| rbp+0x8 | return address |
| rbp+0x10 | g |
| rbp+0x18 | h |
+----------+-----------------+
Since g and h are pushed by the caller, you get the same disassembly for both functions. For the caller
void Caller()
{
PtrArg2(1, 2, 3, 4, 5, 6, 7, 8);
}
(I ommitted the necessary casts for clarity) we would get the following disassembly:
Caller():
push rbp
mov rbp, rsp
push 8
push 7
mov r9d, 6
mov r8d, 5
mov ecx, 4
mov edx, 3
mov esi, 2
mov edi, 1
call PtrArg2
add rsp, 16
nop
leave
ret
(see compiler explorer)
The parameters h = 8 and g = 7 are pushed onto the stack, before calling PtrArg2.

Disappear? What did you expect the function to do with them that the compiler would emit asm instructions to implement?
You literally return; as the only statement in a void function so there's nothing the function needs to do other than ret. If you compile with a normal level of optimization like -O2, that's all you'll get. Debug-mode code is usually not interesting to look at, and is full of redundant / useless stuff.
How to remove "noise" from GCC/clang assembly output?
The only reason you're seeing any instructions for some args is that you compiled in debug mode, i.e. the default optimization level of -O0, anti-optimized debug mode. Every C object (except register locals) has a memory address, and debug mode makes sure that the value is actually there in memory before/after every C statement. This means spilling register args to the stack on function entry. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
The x86-64 System V ABI's calling convention passes the first 6 integer args in registers, the rest on the stack. The stack args already have memory addresses; the compiler doesn't emit code to copy them down next to other local vars below the return address; that would be pointless. The callee "owns" its own stack args, i.e. it can store new values to the stack space where the caller wrote the args, so that space can be the true address of args even if the function were to modify them.

Related

instruction repeated twice when decoded into machine language,

Am basically learning how to make my own instruction in the X86 architecture, but to do that I am understanding how they are decoded and and interpreted to a low level language,
By taking an example of a simple mov instruction and using the .byte notation I wanted to understand in detail as to how instructions are decoded,
My simple code is as follows:
#include <stdio.h>
#include <iostream>
int main(int argc, char const *argv[])
{
int x{5};
int y{0};
// mov %%eax, %0
asm (".byte 0x8b,0x45,0xf8\n\t" //mov %1, eax
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
printf ("dst value : %d\n", y);
return 0;
}
and when I use objdump to analyze how it is broken down to machine language, i get the following output:
000000000000078a <main>:
78a: 55 push %ebp
78b: 48 dec %eax
78c: 89 e5 mov %esp,%ebp
78e: 48 dec %eax
78f: 83 ec 20 sub $0x20,%esp
792: 89 7d ec mov %edi,-0x14(%ebp)
795: 48 dec %eax
796: 89 75 e0 mov %esi,-0x20(%ebp)
799: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%ebp)
7a0: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
7a7: 8b 45 f8 mov -0x8(%ebp),%eax
7aa: 8b 45 f8 mov -0x8(%ebp),%eax
7ad: 89 c0 mov %eax,%eax
7af: 89 45 fc mov %eax,-0x4(%ebp)
7b2: 8b 45 fc mov -0x4(%ebp),%eax
7b5: 89 c6 mov %eax,%esi
7b7: 48 dec %eax
7b8: 8d 3d f7 00 00 00 lea 0xf7,%edi
7be: b8 00 00 00 00 mov $0x0,%eax
7c3: e8 78 fe ff ff call 640 <printf#plt>
7c8: b8 00 00 00 00 mov $0x0,%eax
7cd: c9 leave
7ce: c3 ret
With regard to this output of objdump why is the instruction 7aa: 8b 45 f8 mov -0x8(%ebp),%eax repeated twice, any reason behind it or am I doing something wrong while using the .byte notation?
One of those is compiler-generated, because you asked GCC to have the input in its choice of register for you. That's what "r"(x) means. And you compiled with optimization disabled (the default -O0) so it actually stored x to memory and then reloaded it before your asm statement.
Your code has no business assuming anything about the contents of memory or where EBP points.
Since you're using 89 c0 mov %eax,%eax, the only safe constraints for your asm statement are "a" explicit-register constraints for input and output, forcing the compiler to pick that. If you compile with optimization enabled, your code totally breaks because you lied to the compiler about what your code actually does.
// constraints that match your manually-encoded instruction
asm (".byte 0x89, 0xC0\n\t"
: "=a" (y)
: "a" (x)
);
There's no constraint to force GCC to pick a certain addressing mode for a "m" source or "=m" dest operand so you need to ask for inputs/outputs in specific registers.
If you want to encode your own mov instructions differently from standard mov, see which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension - you might want to use a prefix in front of regular mov opcodes so you can let the assembler encode registers and addressing modes for you, like .byte something; mov %1, %0.
Look at the compiler-generate asm output (gcc -S, not disassembly of the .o or executable). Then you can see which instructions come from the asm statement and which are emitted by GCC.
If you don't explicitly reference some operands in the asm template but still want to see what the compiler picked, you can use them in asm comments like this:
asm (".byte 0x8b,0x45,0xf8 # 0 = %0 1 = %1 \n\t"
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
and gcc will fill it in for you so you can see what operands it expects you to be reading and writing. (Godbolt with g++ -m32 -O3). I put your code in void foo(){} instead of main because GCC -m32 thinks it needs to re-align the stack at the top of main. This makes the code a lot harder to follow.
# gcc-9.2 -O3 -m32 -fverbose-asm
.LC0:
.string "dst value : %d\n"
foo():
subl $20, %esp #,
movl $5, %eax #, tmp84
## Notice that GCC hasn't set up EBP at all before it runs your asm,
## and hasn't stored x in memory.
## It only put it in a register like you asked it to.
.byte 0x8b,0x45,0xf8 # 0 = %eax 1 = %eax # y, tmp84
.byte 0x89, 0xC0
pushl %eax # y
pushl $.LC0 #
call printf #
addl $28, %esp #,
ret
Also note that if you were compiling as 64-bit, it would probably pick %esi as a register because printf will want its 2nd arg there. So the "a" instead of "r" constraint would actually matter.
You could get 32-bit GCC to use a different register if you were assigning to a variable that has to survive across a function call; then GCC would pick a call-preserved reg like EBX instead of EAX.

Understanding GCC's alloca() alignment and seemingly missed optimization

Consider the following toy example that allocates memory on the stack by means of the alloca() function:
#include <alloca.h>
void foo() {
volatile int *p = alloca(4);
*p = 7;
}
Compiling the function above using gcc 8.2 with -O3 results in the following assembly code:
foo:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 15(%rsp), %rax
andq $-16, %rax
movl $7, (%rax)
leave
ret
Honestly, I would have expected a more compact assembly code.
16-byte alignment for allocated memory
The instruction andq $-16, %rax in the code above results in rax containing the (only) 16-byte-aligned address between the addresses rsp and rsp + 15 (both inclusive).
This alignment enforcement is the first thing I don't understand: Why does alloca() align the allocated memory to a 16-byte boundary?
Possible missed optimization?
Let's consider anyway that we want the memory allocated by alloca() to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo), if we pay attention to the status of the stack inside foo() just after pushing the rbp register:
Size Stack RSP mod 16 Description
-----------------------------------------------------------------------------------
------------------
| . |
| . |
| . |
------------------........0 at "call foo" (stack 16-byte aligned)
8 bytes | return address |
------------------........8 at foo entry
8 bytes | saved RBP |
------------------........0 <----- RSP is 16-byte aligned!!!
I think that by taking advantage of the red zone (i.e., no need to modify rsp) and the fact that rsp already contains a 16-byte aligned address, the following code could be used instead:
foo:
pushq %rbp
movq %rsp, %rbp
movl $7, -16(%rbp)
leave
ret
The address contained in the register rbp is 16-byte aligned, therefore rbp - 16 will also be aligned to a 16-byte boundary.
Even better, the creation of the new stack frame can be optimized away, since rsp is not modified:
foo:
movl $7, -8(%rsp)
ret
Is this just a missed optimization or I am missing something else here?
This is (partially) missed optimization in gcc. Clang does it as expected.
I said partially because if you know you will be using gcc you can use builtin functions (use conditional compilation for gcc and other compilers to have portable code).
__builtin_alloca_with_align is your friend ;)
Here is an example (changed so the compiler will not reduce function call to single ret):
#include <alloca.h>
volatile int* p;
void foo()
{
p = alloca(4) ;
*p = 7;
}
void zoo()
{
// aligment is 16 bits, not bytes
p = __builtin_alloca_with_align(4,16) ;
*p = 7;
}
int main()
{
foo();
zoo();
}
Disassembled code (with objdump -d -w --insn-width=12 -M intel)
Clang will produce the following code (clang -O3 test.c) - both functions look alike
0000000000400480 <foo>:
400480: 48 8d 44 24 f8 lea rax,[rsp-0x8]
400485: 48 89 05 a4 0b 20 00 mov QWORD PTR [rip+0x200ba4],rax # 601030 <p>
40048c: c7 44 24 f8 07 00 00 00 mov DWORD PTR [rsp-0x8],0x7
400494: c3 ret
00000000004004a0 <zoo>:
4004a0: 48 8d 44 24 fc lea rax,[rsp-0x4]
4004a5: 48 89 05 84 0b 20 00 mov QWORD PTR [rip+0x200b84],rax # 601030 <p>
4004ac: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
4004b4: c3 ret
GCC this one (gcc -g -O3 -fno-stack-protector)
0000000000000620 <foo>:
620: 55 push rbp
621: 48 89 e5 mov rbp,rsp
624: 48 83 ec 20 sub rsp,0x20
628: 48 8d 44 24 0f lea rax,[rsp+0xf]
62d: 48 83 e0 f0 and rax,0xfffffffffffffff0
631: 48 89 05 e0 09 20 00 mov QWORD PTR [rip+0x2009e0],rax # 201018 <p>
638: c7 00 07 00 00 00 mov DWORD PTR [rax],0x7
63e: c9 leave
63f: c3 ret
0000000000000640 <zoo>:
640: 48 8d 44 24 fc lea rax,[rsp-0x4]
645: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
64d: 48 89 05 c4 09 20 00 mov QWORD PTR [rip+0x2009c4],rax # 201018 <p>
654: c3 ret
As you can see zoo now looks like expected and similar to clang code.
The x86-64 System V ABI requires VLAs (C99 Variable Length Arrays) to be 16-byte aligned, same for automatic / static arrays that are >= 16 bytes.
It looks like gcc is treating alloca as a VLA, and failing to do constant-propagation into an alloca that only runs once per function call. (Or that it internally uses alloca for VLAs.)
A generic alloca / VLA can't use the red-zone, in case the runtime value is larger than 128 bytes. GCC also makes a stack frame with RBP instead of saving the allocation size and doing an add rsp, rdx later.
So the asm looks exactly like what it would if the size was a function arg or other runtime variable instead of a constant. That's what led me to this conclusion.
Also alignof(maxalign_t) == 16 , but alloca and malloc can satisfy the requirement to return memory usable for any object without 16-byte alignment for objects smaller than 16 bytes. None of the standard types have alignment requirements wider than their size in x86-64 SysV.
You're right, it should be able to optimize it to this:
void foo() {
alignas(16) int dummy[1];
volatile int *p = dummy; // alloca(4)
*p = 7;
}
and compile it to the movl $7, -8(%rsp) ; ret you suggested.
The alignas(16) might be optional here for alloca.
If you really need gcc to emit better code when constant propagation makes the arg to alloca a compile-time constant, you could consider simply using a VLA in the first place. GNU C++ supports C99-style VLAs in C++ mode, but ISO C++ (and MSVC) don't.
Or possibly use if(__builtin_constant_p(size)) { VLA version } else { alloca version }, but scoping of VLAs means you can't return a VLA from the scope of an if that detects that we're being inlined with a compile-time constant size. So you'd have to duplicate the code that needs the pointer.

gcc likely() unlikely() macros and assembly code

I'm trying to see how gcc's likely() and unlikely() branch prediction macros has effect on assembly code. In the following piece of code I don't see any difference in the generated assembly code regardless of which macro i use. Any pointers on what's happening?
0 int main() {
1 volatile int x;
2 unlikely(x)?x++:x--;
3 }
Asm code:
0 0000000000000014 <main>:
1 int main() {
2 14: 55 push rbp
3 15: 48 89 e5 mov rbp,rsp
4 volatile int x;
5 likely(x)?x++:x--;
6 18: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
7 1b: 85 c0 test eax,eax
8 1d: 0f 95 c0 setne al
9 20: 0f b6 c0 movzx eax,al
10 23: 48 85 c0 test rax,rax
11 26: 74 0b je 33 <main+0x1f>
12 28: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
13 2b: 83 c0 01 add eax,0x1
14 2e: 89 45 fc mov DWORD PTR [rbp-0x4],eax
15 31: eb 09 jmp 3c <main+0x28>
16 33: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
17 36: 83 e8 01 sub eax,0x1
18 39: 89 45 fc mov DWORD PTR [rbp-0x4],eax
19 }
20 3c: 5d pop rbp
21 3d: c3 ret
It looks like you compiled without optimization. Basic block reordering is an optimization, so without it, __builtin_expect does not have this effect. With optimization, I observe that the sense of the branch is inverted when switching the expected result.
Note that whether this has any effect on current x86 processors is difficult to say.

Is tooling available to 'assemble' WebAssembly to x86-64 native code?

I am guessing that a Wasm binary is usually JIT-compiled to native code, but given a Wasm source, is there a tool to see the actual generated x86-64 machine code?
Or asked in a different way, is there a tool that consumes Wasm and outputs native code?
The online WasmExplorer compiles C code to both WebAssembly and FireFox x86, using the SpiderMonkey compiler. Given the following simple function:
int testFunction(int* input, int length) {
int sum = 0;
for (int i = 0; i < length; ++i) {
sum += input[i];
}
return sum;
}
Here is the x86 output:
wasm-function[0]:
sub rsp, 8 ; 0x000000 48 83 ec 08
cmp esi, 1 ; 0x000004 83 fe 01
jge 0x14 ; 0x000007 0f 8d 07 00 00 00
0x00000d:
xor eax, eax ; 0x00000d 33 c0
jmp 0x26 ; 0x00000f e9 12 00 00 00
0x000014:
xor eax, eax ; 0x000014 33 c0
0x000016: ; 0x000016 from: [0x000024]
mov ecx, dword ptr [r15 + rdi] ; 0x000016 41 8b 0c 3f
add eax, ecx ; 0x00001a 03 c1
add edi, 4 ; 0x00001c 83 c7 04
add esi, -1 ; 0x00001f 83 c6 ff
test esi, esi ; 0x000022 85 f6
jne 0x16 ; 0x000024 75 f0
0x000026:
nop ; 0x000026 66 90
add rsp, 8 ; 0x000028 48 83 c4 08
ret
You can view this example online.
WasmExplorer compiles code into wasm / x86 via a service - you can see the scripts that are run on Github - you should be able to use these to construct a command-line tool yourself.

How can I insert repeated NOP statements using Visual C++'s inline assembler?

Visual C++, using Microsoft's compiler, allows us to define inline assembly code using:
__asm {
nop
}
What I need is a macro that makes possible to multiply such instruction n times like:
ASM_EMIT_MULT(op, times)
for example:
ASM_EMIT_MULT(0x90, 160)
Is that possible? How could I do this?
With MASM, this is very simple to do. Part of the installation is a file named listing.inc (since everyone gets MASM as part of Visual Studio now, this will be located in your Visual Studio root directory/VC/include). This file defines a series of npad macros that take a single size argument and expand to an appropriate sequence of non-destructive "padding" opcodes. If you only need one byte of padding, you use the obvious nop instruction. But rather than using a long series of nops until you reach the desired length, Intel actually recommends other non-destructive opcodes of the appropriate length, as do other vendors. These pre-defined npad macros free you from having to memorize that table, not to mention making the code much more readable.
Unfortunately, inline assembly is not a full-featured assembler. There are a lot of things missing that you would expect to find in real assemblers like MASM. Macros (MACRO) and repeats (REPEAT/REPT) are among the things that are missing.
However, ALIGN directives are available in inline assembly. These will generate the required number of nops or other non-destructive opcodes to enforce alignment of the next instruction. Using this is drop-dead simple. Here is a very stupid example, where I've taken working code and peppered it with random aligns:
unsigned long CountDigits(unsigned long value)
{
__asm
{
mov edx, DWORD PTR [value]
bsr eax, edx
align 4
xor eax, 1073741792
mov eax, DWORD PTR [4 * eax + kMaxDigits+132]
align 16
cmp edx, DWORD PTR [4 * eax + kPowers-4]
sbb eax, 0
align 8
}
}
This generates the following output (MSVC's assembly listings use npad x, where x is the number of bytes, just as you'd write it in MASM):
PUBLIC CountDigits
_TEXT SEGMENT
_value$ = 8
CountDigits PROC
00000 8b 54 24 04 mov edx, DWORD PTR _value$[esp-4]
00004 0f bd c2 bsr eax, edx
00007 90 npad 1 ;// enforcing the "align 4"
00008 35 e0 ff ff 3f xor eax, 1073741792
0000d 8b 04 85 84 00
00 00 mov eax, DWORD PTR _kMaxDigits[eax*4+132]
00014 eb 0a 8d a4 24
00 00 00 00 8d
49 00 npad 12 ;// enforcing the "align 16"
00020 3b 14 85 fc ff
ff ff cmp edx, DWORD PTR _kPowers[eax*4-4]
00027 83 d8 00 sbb eax, 0
0002a 8d 9b 00 00 00
00 npad 6 ;// enforcing the "align 8"
00030 c2 04 00 ret 4
CountDigits ENDP
_TEXT ENDS
If you aren't actually wanting to enforce alignment, but just want to insert an arbitrary number of nops (perhaps as filler for later hot-patching?), then you can use C macros to simulate the effect:
#define NOP1 __asm { nop }
#define NOP2 NOP1 NOP1
#define NOP4 NOP2 NOP2
#define NOP8 NOP4 NOP4
#define NOP16 NOP8 NOP8
// ...
#define NOP64 NOP16 NOP16 NOP16 NOP16
// ...etc.
And then pepper your code as desired:
unsigned long CountDigits(unsigned long value)
{
__asm
{
mov edx, DWORD PTR [value]
bsr eax, edx
NOP8
xor eax, 1073741792
mov eax, DWORD PTR [4 * eax + kMaxDigits+132]
NOP4
cmp edx, DWORD PTR [4 * eax + kPowers-4]
sbb eax, 0
}
}
to produce the following output:
PUBLIC CountDigits
_TEXT SEGMENT
_value$ = 8
CountDigits PROC
00000 8b 54 24 04 mov edx, DWORD PTR _value$[esp-4]
00004 0f bd c2 bsr eax, edx
00007 90 npad 1 ;// these are, of course, just good old NOPs
00008 90 npad 1
00009 90 npad 1
0000a 90 npad 1
0000b 90 npad 1
0000c 90 npad 1
0000d 90 npad 1
0000e 90 npad 1
0000f 35 e0 ff ff 3f xor eax, 1073741792
00014 8b 04 85 84 00
00 00 mov eax, DWORD PTR _kMaxDigits[eax*4+132]
0001b 90 npad 1
0001c 90 npad 1
0001d 90 npad 1
0001e 90 npad 1
0001f 3b 14 85 fc ff
ff ff cmp edx, DWORD PTR _kPowers[eax*4-4]
00026 83 d8 00 sbb eax, 0
00029 c2 04 00 ret 4
CountDigits ENDP
_TEXT ENDS
Or, even cooler, we can use a bit of template meta-programming magic to get the same effect in style. Just define the following template function and its specialization (important to prevent infinite recursion):
template <size_t N> __forceinline void npad()
{
npad<N-1>();
__asm { nop }
}
template <> __forceinline void npad<0>() { }
And use it like this:
unsigned long CountDigits(unsigned long value)
{
__asm
{
mov edx, DWORD PTR [value]
bsr eax, edx
}
npad<8>();
__asm
{
xor eax, 1073741792
mov eax, DWORD PTR [4 * eax + kMaxDigits+132]
}
npad<4>();
__asm
{
cmp edx, DWORD PTR [4 * eax + kPowers-4]
sbb eax, 0
}
}
That'll produce the desired output (exactly the same as the one just above) in all optimized builds—whether you optimize for size (/O1) or speed (/O2)—…but not in debugging builds. If you need it in debug builds, you'll have to resort to the C macros. :-(
Base on Cody Gray Answer and code example for metaprogramming using template recursion and inline or forceinline as stated on the code before
template <size_t N> __forceinline void npad()
{
npad<N-1>();
__asm { nop }
}
template <> __forceinline void npad<0>() { }
It won't work on visual studio, without setting some options and is not a guarantee it will work
Although __forceinline is a stronger indication to the compiler than
__inline, inlining is still performed at the compiler's discretion, but no heuristics are used to determine the benefits from inlining this function.
You can read more about this here https://learn.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-4-c4714?view=vs-2019

Resources