Understanding Hello World's lea macOS x86-64 [duplicate] - macos

Consider the following variable reference in x64 Intel assembly, where the variable a is declared in the .data section:
mov eax, dword ptr [rip + _a]
I have trouble understanding how this variable reference works. Since a is a symbol corresponding to the runtime address of the variable (with relocation), how can [rip + _a] dereference the correct memory location of a? Indeed, rip holds the address of the current instruction, which is a large positive integer, so the addition results in an incorrect address of a?
Conversely, if I use x86 syntax (which is very intuitive):
mov eax, dword ptr [_a]
, I get the following error: 32-bit absolute addressing is not supported in 64-bit mode.
Any explanation?
1 int a = 5;
2
3 int main() {
4 int b = a;
5 return b;
6 }
Compilation: gcc -S -masm=intel abs_ref.c -o abs_ref:
1 .section __TEXT,__text,regular,pure_instructions
2 .build_version macos, 10, 14
3 .intel_syntax noprefix
4 .globl _main ## -- Begin function main
5 .p2align 4, 0x90
6 _main: ## #main
7 .cfi_startproc
8 ## %bb.0:
9 push rbp
10 .cfi_def_cfa_offset 16
11 .cfi_offset rbp, -16
12 mov rbp, rsp
13 .cfi_def_cfa_register rbp
14 mov dword ptr [rbp - 4], 0
15 mov eax, dword ptr [rip + _a]
16 mov dword ptr [rbp - 8], eax
17 mov eax, dword ptr [rbp - 8]
18 pop rbp
19 ret
20 .cfi_endproc
21 ## -- End function
22 .section __DATA,__data
23 .globl _a ## #a
24 .p2align 2
25 _a:
26 .long 5 ## 0x5
27
28
29 .subsections_via_symbols

GAS syntax for RIP-relative addressing looks like symbol + current_address (RIP), but it actually means symbol with respect to RIP.
There's an inconsistency with numeric literals:
[rip + 10] or AT&T 10(%rip) means 10 bytes past the end of this instruction
[rip + a] or AT&T a(%rip) means to calculate a rel32 displacement to reach a, not RIP + symbol value. (The GAS manual documents this special interpretation)
[a] or AT&T a is an absolute address, using a disp32 addressing mode. This isn't supported on OS X, where the image base address is always outside the low 32 bits. (Or for mov to/from al/ax/eax/rax, a 64-bit absolute moffs encoding is available, but you don't want that).
Linux position-dependent executables do put static code/data in the low 31 bits (2GiB) of virtual address space, so you can/should use mov edi, sym there, but on OS X your best option is lea rdi, [sym+RIP] if you need an address in a register. Unable to move variables in .data to registers with Mac x86 Assembly.
(In OS X, the convention is that C variable/function names are prepended with _ in asm. In hand-written asm you don't have to do this for symbols you don't want to access from C.)
NASM is much less confusing in this respect:
[rel a] means RIP-relative addressing for [a]
[abs a] means [disp32].
default rel or default abs sets what's used for [a]. The default is (unfortunately) default abs, so you almost always want a default rel.
Example with .set symbol values vs. a label
.intel_syntax noprefix
mov dword ptr [sym + rip], 0x11111111
sym:
.equ x, 8
inc byte ptr [x + rip]
.set y, 32
inc byte ptr [y + rip]
.set z, sym
inc byte ptr [z + rip]
gcc -nostdlib foo.s && objdump -drwC -Mintel a.out (on Linux; I don't have OS X):
0000000000001000 <sym-0xa>:
1000: c7 05 00 00 00 00 11 11 11 11 mov DWORD PTR [rip+0x0],0x11111111 # 100a <sym> # rel32 = 0; it's from the end of the instruction not the end of the rel32 or anywhere else.
000000000000100a <sym>:
100a: fe 05 08 00 00 00 inc BYTE PTR [rip+0x8] # 1018 <sym+0xe>
1010: fe 05 20 00 00 00 inc BYTE PTR [rip+0x20] # 1036 <sym+0x2c>
1016: fe 05 ee ff ff ff inc BYTE PTR [rip+0xffffffffffffffee] # 100a <sym>
(Disassembling the .o with objdump -dr will show you that there aren't any relocations for the linker to fill in, they were all done at assemble time.)
Notice that only .set z, sym resulted in a with-respect-to calculation. x and y were original from plain numeric literals, not labels, so even though the instruction itself used [x + RIP], we still got [RIP + 8].
(Linux non-PIE only): To address absolute 8 wrt. RIP, you'd need AT&T syntax incb 8-.(%rip). I don't know how to write that in GAS intel_syntax; [8 - . + RIP] is rejected with Error: invalid operands (*ABS* and .text sections) for '-'.
Of course you can't do that anyway on OS X, except maybe for absolute addresses that are in range of the image base. But there's probably no relocation that can hold the 64-bit absolute address to be calculated for a 32-bit rel32.
Related:
How to load address of function or label into register AT&T version of this
32-bit absolute addresses no longer allowed in x86-64 Linux? PIE vs. non-PIE executables, when you have to use position-independent code.

Related

printf in Windows x86-64 shellcode, strings on stack, and alignment

From x64 shellcode on Windows 10, I want to call printf("String: %s\n", "Hello World") with the string "Hello World" defined on the stack. I ran into problems that I’ve somewhat fixed, but I understand neither the problem nor my solution. I read that the x64 ABI mandates the stack to be 16-byte aligned before call instructions, but this does not seem to explain all of it.
After some playing around with nasm, I arrived at the following code, which works:
bits 64
default rel
segment .data
msg db "String: %s", 0xd, 0xa, 0
segment .text
global main
extern ExitProcess
extern printf
main:
push rbp
mov rbp, rsp
sub rsp, 64 ; relevant number 1
mov rdx, rsp
add rdx, 32 ; relevant number 2
mov dword [rdx], "Hell"
mov dword [rdx + 4], "o Wo"
mov dword [rdx + 8], "rld"
lea rcx, [msg]
call printf
xor rcx, rcx
call ExitProcess
Curiously, there are certain combinations of values for the rsp and the position rdx of the string on the stack that work and certain combinations that don’t. However, I don’t see the general rule.
Number 1
Number 2
Works?
Output
64
0
no
64
8
no
x¨/(☻
64
16
no
Pf{tè☻
64
24
no
Ϩ;í
64
32
yes
Hello World
64
40
yes
Hello World
64
48
yes
Hello World
64
56
yes
Hello World
64
64
yes
Hello World
32
0
no
32
8
no
°²øRg
32
16
no
ákG×│☺
32
24
no
↑°wd¢
32
32
yes
Hello World
What’s going on? Why do we need to store the string at least 32 bytes away from the rsp? This seem unrelated to the 16-byte alignment requirement.

What's the purpose of many int3's in a row in WinDbg disassembly of ntdll code? [duplicate]

This question already has answers here:
Visual C++ appends 0xCC (int3) bytes at the end of functions
(3 answers)
Several int3 in a row
(1 answer)
Closed 1 year ago.
I'm learning assembly and after assembly of:
format PE64 NX GUI 6.0
entry start
section '.text' code readable executable
start:
int3
ret
running in my debugger (at the end of the OS loader code and also ) I see
...
00007fff`bc78070d 4889442428 mov qword ptr [rsp+28h], rax
00007fff`bc780712 488364242000 and qword ptr [rsp+20h], 0
00007fff`bc780718 e8cf90f9ff call ntdll!RtlStringCbPrintfExW (00007fff`bc7197ec)
00007fff`bc78071d 488b8c24e0010000 mov rcx, qword ptr [rsp+1E0h]
00007fff`bc780725 4833cc xor rcx, rsp
00007fff`bc780728 e813bbfbff call ntdll!_security_check_cookie (00007fff`bc73c240)
00007fff`bc78072d 4881c4f0010000 add rsp, 1F0h
00007fff`bc780734 5b pop rbx
00007fff`bc780735 c3 ret
00007fff`bc780736 cc int 3
00007fff`bc780737 cc int 3
00007fff`bc780738 cc int 3
00007fff`bc780739 cc int 3
00007fff`bc78073a cc int 3
00007fff`bc78073b cc int 3
00007fff`bc78073c cc int 3
00007fff`bc78073d cc int 3
00007fff`bc78073e cc int 3
00007fff`bc78073f cc int 3
ntdll!LdrpDoDebuggerBreak:
00007fff`bc780740 4883ec38 sub rsp, 38h
00007fff`bc780744 488364242000 and qword ptr [rsp+20h], 0
00007fff`bc78074a 41b901000000 mov r9d, 1
00007fff`bc780750 4c8d442440 lea r8, [rsp+40h]
00007fff`bc780755 418d5110 lea edx, [r9+10h]
00007fff`bc780759 48c7c1feffffff mov rcx, 0FFFFFFFFFFFFFFFEh
00007fff`bc780760 e84bcbfcff call ntdll!NtQueryInformationThread (00007fff`bc74d2b0)
00007fff`bc780765 85c0 test eax, eax
00007fff`bc780767 780a js ntdll!LdrpDoDebuggerBreak+0x33 (00007fff`bc780773)
00007fff`bc780769 807c244000 cmp byte ptr [rsp+40h], 0
00007fff`bc78076e 7503 jne ntdll!LdrpDoDebuggerBreak+0x33 (00007fff`bc780773)
00007fff`bc780770 cc int 3
...
Can someone explain what the purpose of multiple int3's in a row? It reminds me of a nop slide but I can't imagine why you'd need to do such a thing with a debug command. Or is this just bad disassembly?

instruction repeated twice when decoded into machine language,

Am basically learning how to make my own instruction in the X86 architecture, but to do that I am understanding how they are decoded and and interpreted to a low level language,
By taking an example of a simple mov instruction and using the .byte notation I wanted to understand in detail as to how instructions are decoded,
My simple code is as follows:
#include <stdio.h>
#include <iostream>
int main(int argc, char const *argv[])
{
int x{5};
int y{0};
// mov %%eax, %0
asm (".byte 0x8b,0x45,0xf8\n\t" //mov %1, eax
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
printf ("dst value : %d\n", y);
return 0;
}
and when I use objdump to analyze how it is broken down to machine language, i get the following output:
000000000000078a <main>:
78a: 55 push %ebp
78b: 48 dec %eax
78c: 89 e5 mov %esp,%ebp
78e: 48 dec %eax
78f: 83 ec 20 sub $0x20,%esp
792: 89 7d ec mov %edi,-0x14(%ebp)
795: 48 dec %eax
796: 89 75 e0 mov %esi,-0x20(%ebp)
799: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%ebp)
7a0: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
7a7: 8b 45 f8 mov -0x8(%ebp),%eax
7aa: 8b 45 f8 mov -0x8(%ebp),%eax
7ad: 89 c0 mov %eax,%eax
7af: 89 45 fc mov %eax,-0x4(%ebp)
7b2: 8b 45 fc mov -0x4(%ebp),%eax
7b5: 89 c6 mov %eax,%esi
7b7: 48 dec %eax
7b8: 8d 3d f7 00 00 00 lea 0xf7,%edi
7be: b8 00 00 00 00 mov $0x0,%eax
7c3: e8 78 fe ff ff call 640 <printf#plt>
7c8: b8 00 00 00 00 mov $0x0,%eax
7cd: c9 leave
7ce: c3 ret
With regard to this output of objdump why is the instruction 7aa: 8b 45 f8 mov -0x8(%ebp),%eax repeated twice, any reason behind it or am I doing something wrong while using the .byte notation?
One of those is compiler-generated, because you asked GCC to have the input in its choice of register for you. That's what "r"(x) means. And you compiled with optimization disabled (the default -O0) so it actually stored x to memory and then reloaded it before your asm statement.
Your code has no business assuming anything about the contents of memory or where EBP points.
Since you're using 89 c0 mov %eax,%eax, the only safe constraints for your asm statement are "a" explicit-register constraints for input and output, forcing the compiler to pick that. If you compile with optimization enabled, your code totally breaks because you lied to the compiler about what your code actually does.
// constraints that match your manually-encoded instruction
asm (".byte 0x89, 0xC0\n\t"
: "=a" (y)
: "a" (x)
);
There's no constraint to force GCC to pick a certain addressing mode for a "m" source or "=m" dest operand so you need to ask for inputs/outputs in specific registers.
If you want to encode your own mov instructions differently from standard mov, see which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension - you might want to use a prefix in front of regular mov opcodes so you can let the assembler encode registers and addressing modes for you, like .byte something; mov %1, %0.
Look at the compiler-generate asm output (gcc -S, not disassembly of the .o or executable). Then you can see which instructions come from the asm statement and which are emitted by GCC.
If you don't explicitly reference some operands in the asm template but still want to see what the compiler picked, you can use them in asm comments like this:
asm (".byte 0x8b,0x45,0xf8 # 0 = %0 1 = %1 \n\t"
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
and gcc will fill it in for you so you can see what operands it expects you to be reading and writing. (Godbolt with g++ -m32 -O3). I put your code in void foo(){} instead of main because GCC -m32 thinks it needs to re-align the stack at the top of main. This makes the code a lot harder to follow.
# gcc-9.2 -O3 -m32 -fverbose-asm
.LC0:
.string "dst value : %d\n"
foo():
subl $20, %esp #,
movl $5, %eax #, tmp84
## Notice that GCC hasn't set up EBP at all before it runs your asm,
## and hasn't stored x in memory.
## It only put it in a register like you asked it to.
.byte 0x8b,0x45,0xf8 # 0 = %eax 1 = %eax # y, tmp84
.byte 0x89, 0xC0
pushl %eax # y
pushl $.LC0 #
call printf #
addl $28, %esp #,
ret
Also note that if you were compiling as 64-bit, it would probably pick %esi as a register because printf will want its 2nd arg there. So the "a" instead of "r" constraint would actually matter.
You could get 32-bit GCC to use a different register if you were assigning to a variable that has to survive across a function call; then GCC would pick a call-preserved reg like EBX instead of EAX.

Understanding GCC's alloca() alignment and seemingly missed optimization

Consider the following toy example that allocates memory on the stack by means of the alloca() function:
#include <alloca.h>
void foo() {
volatile int *p = alloca(4);
*p = 7;
}
Compiling the function above using gcc 8.2 with -O3 results in the following assembly code:
foo:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 15(%rsp), %rax
andq $-16, %rax
movl $7, (%rax)
leave
ret
Honestly, I would have expected a more compact assembly code.
16-byte alignment for allocated memory
The instruction andq $-16, %rax in the code above results in rax containing the (only) 16-byte-aligned address between the addresses rsp and rsp + 15 (both inclusive).
This alignment enforcement is the first thing I don't understand: Why does alloca() align the allocated memory to a 16-byte boundary?
Possible missed optimization?
Let's consider anyway that we want the memory allocated by alloca() to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo), if we pay attention to the status of the stack inside foo() just after pushing the rbp register:
Size Stack RSP mod 16 Description
-----------------------------------------------------------------------------------
------------------
| . |
| . |
| . |
------------------........0 at "call foo" (stack 16-byte aligned)
8 bytes | return address |
------------------........8 at foo entry
8 bytes | saved RBP |
------------------........0 <----- RSP is 16-byte aligned!!!
I think that by taking advantage of the red zone (i.e., no need to modify rsp) and the fact that rsp already contains a 16-byte aligned address, the following code could be used instead:
foo:
pushq %rbp
movq %rsp, %rbp
movl $7, -16(%rbp)
leave
ret
The address contained in the register rbp is 16-byte aligned, therefore rbp - 16 will also be aligned to a 16-byte boundary.
Even better, the creation of the new stack frame can be optimized away, since rsp is not modified:
foo:
movl $7, -8(%rsp)
ret
Is this just a missed optimization or I am missing something else here?
This is (partially) missed optimization in gcc. Clang does it as expected.
I said partially because if you know you will be using gcc you can use builtin functions (use conditional compilation for gcc and other compilers to have portable code).
__builtin_alloca_with_align is your friend ;)
Here is an example (changed so the compiler will not reduce function call to single ret):
#include <alloca.h>
volatile int* p;
void foo()
{
p = alloca(4) ;
*p = 7;
}
void zoo()
{
// aligment is 16 bits, not bytes
p = __builtin_alloca_with_align(4,16) ;
*p = 7;
}
int main()
{
foo();
zoo();
}
Disassembled code (with objdump -d -w --insn-width=12 -M intel)
Clang will produce the following code (clang -O3 test.c) - both functions look alike
0000000000400480 <foo>:
400480: 48 8d 44 24 f8 lea rax,[rsp-0x8]
400485: 48 89 05 a4 0b 20 00 mov QWORD PTR [rip+0x200ba4],rax # 601030 <p>
40048c: c7 44 24 f8 07 00 00 00 mov DWORD PTR [rsp-0x8],0x7
400494: c3 ret
00000000004004a0 <zoo>:
4004a0: 48 8d 44 24 fc lea rax,[rsp-0x4]
4004a5: 48 89 05 84 0b 20 00 mov QWORD PTR [rip+0x200b84],rax # 601030 <p>
4004ac: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
4004b4: c3 ret
GCC this one (gcc -g -O3 -fno-stack-protector)
0000000000000620 <foo>:
620: 55 push rbp
621: 48 89 e5 mov rbp,rsp
624: 48 83 ec 20 sub rsp,0x20
628: 48 8d 44 24 0f lea rax,[rsp+0xf]
62d: 48 83 e0 f0 and rax,0xfffffffffffffff0
631: 48 89 05 e0 09 20 00 mov QWORD PTR [rip+0x2009e0],rax # 201018 <p>
638: c7 00 07 00 00 00 mov DWORD PTR [rax],0x7
63e: c9 leave
63f: c3 ret
0000000000000640 <zoo>:
640: 48 8d 44 24 fc lea rax,[rsp-0x4]
645: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
64d: 48 89 05 c4 09 20 00 mov QWORD PTR [rip+0x2009c4],rax # 201018 <p>
654: c3 ret
As you can see zoo now looks like expected and similar to clang code.
The x86-64 System V ABI requires VLAs (C99 Variable Length Arrays) to be 16-byte aligned, same for automatic / static arrays that are >= 16 bytes.
It looks like gcc is treating alloca as a VLA, and failing to do constant-propagation into an alloca that only runs once per function call. (Or that it internally uses alloca for VLAs.)
A generic alloca / VLA can't use the red-zone, in case the runtime value is larger than 128 bytes. GCC also makes a stack frame with RBP instead of saving the allocation size and doing an add rsp, rdx later.
So the asm looks exactly like what it would if the size was a function arg or other runtime variable instead of a constant. That's what led me to this conclusion.
Also alignof(maxalign_t) == 16 , but alloca and malloc can satisfy the requirement to return memory usable for any object without 16-byte alignment for objects smaller than 16 bytes. None of the standard types have alignment requirements wider than their size in x86-64 SysV.
You're right, it should be able to optimize it to this:
void foo() {
alignas(16) int dummy[1];
volatile int *p = dummy; // alloca(4)
*p = 7;
}
and compile it to the movl $7, -8(%rsp) ; ret you suggested.
The alignas(16) might be optional here for alloca.
If you really need gcc to emit better code when constant propagation makes the arg to alloca a compile-time constant, you could consider simply using a VLA in the first place. GNU C++ supports C99-style VLAs in C++ mode, but ISO C++ (and MSVC) don't.
Or possibly use if(__builtin_constant_p(size)) { VLA version } else { alloca version }, but scoping of VLAs means you can't return a VLA from the scope of an if that detects that we're being inlined with a compile-time constant size. So you'd have to duplicate the code that needs the pointer.

Weird SSE assembler instructions for double negation

GCC and Clang compilers seem to employ some dark magic. The C code just negates the value of a double, but the assembler instructions involve bit-wise XOR and the instruction pointer. Can somebody explain what is happening and why is it an optimal solution. Thank you.
Contents of test.c:
void function(double *a, double *b) {
*a = -(*b); // This line.
}
The resulting assembler instructions:
(gcc)
0000000000000000 <function>:
0: f2 0f 10 06 movsd xmm0,QWORD PTR [rsi]
4: 66 0f 57 05 00 00 00 xorpd xmm0,XMMWORD PTR [rip+0x0] # c <function+0xc>
b: 00
c: f2 0f 11 07 movsd QWORD PTR [rdi],xmm0
10: c3 ret
(clang)
0000000000000000 <function>:
0: f2 0f 10 06 movsd xmm0,QWORD PTR [rsi]
4: 0f 57 05 00 00 00 00 xorps xmm0,XMMWORD PTR [rip+0x0] # b <function+0xb>
b: 0f 13 07 movlps QWORD PTR [rdi],xmm0
e: c3 ret
The assembler instruction at address 0x4 represents "This line", however I can't understand how it works. The xorpd/xorps instructions are supposed to be bit-wise XOR and PTR [rip] is the instruction pointer.
I suspect that at the moment of execution rip is pointing somewhere near the 0f 57 05 00 00 00 0f strip of bytes, but I can't quite figure out, how is this working and why do both compilers choose this approach.
P.S. I should point out that this is compiled using -O3
for me the output of gcc with the -S -O3 options for the same code is:
.file "test.c"
.text
.p2align 4,,15
.globl function
.type function, #function
function:
.LFB0:
.cfi_startproc
movsd (%rsi), %xmm0
xorpd .LC0(%rip), %xmm0
movsd %xmm0, (%rdi)
ret
.cfi_endproc
.LFE0:
.size function, .-function
.section .rodata.cst16,"aM",#progbits,16
.align 16
.LC0:
.long 0
.long -2147483648
.long 0
.long 0
.ident "GCC: (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406"
.section .note.GNU-stack,"",#progbits
here the xorpd instruction uses instruction pointer relative addressing with the offset which points to .LC0 label with the 64 bit value 0x8000000000000000(the 63rd bit is set to one).
.LC0:
.long 0
.long -2147483648
if your compiler was big endian these lines where swaped.
xoring the double value with 0x8000000000000000 sets the sign bit(which is the 63rd bit) to one for a negative value.
clang uses xorps instruction for the same manner this xors the first 32bit of the double value.
if you run object dump with -r option it will show you the relocations that should be done on the program before running it.
objdump -d test.o -r
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <function>:
0: f2 0f 10 06 movsd (%rsi),%xmm0
4: 66 0f 57 05 00 00 00 xorpd 0x0(%rip),%xmm0 # c <function+0xc>
b: 00
8: R_X86_64_PC32 .LC0-0x4
c: f2 0f 11 07 movsd %xmm0,(%rdi)
10: c3 retq
Disassembly of section .text.startup:
0000000000000000 <main>:
0: 31 c0 xor %eax,%eax
2: c3 retq
here at <function + 0xb> we have a relocation of type R_X86_64_PC32.
PS: I'm using gcc 6.3.0
xorps xmm0,XMMWORD PTR [rip+0x0]
Any part of an instruction surrounded by [] is an indirect reference to memory.
In this case a reference to the memory at address RIP+0
(I doubt it is actually RIP+0, you might have edited the actual offset)
The X64 instruction set adds instruction pointer relative addressing. This means you can have (usually read-only) data in your program that you can address easily even if the program is moved around in memory.
A XOR xmm0,Y inverts all bits in xmm0 that are set in Y.
Negation involves inverting the sign bit, so that's why xor is used. Specifically xorpd/s because we are dealing with double resp. single floats.

Resources