Understanding GCC's alloca() alignment and seemingly missed optimization

Understanding GCC's alloca() alignment and seemingly missed optimization - gcc

Consider the following toy example that allocates memory on the stack by means of the alloca() function:
#include <alloca.h>
void foo() {
volatile int *p = alloca(4);
*p = 7;
}
Compiling the function above using gcc 8.2 with -O3 results in the following assembly code:
foo:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 15(%rsp), %rax
andq $-16, %rax
movl $7, (%rax)
leave
ret
Honestly, I would have expected a more compact assembly code.
16-byte alignment for allocated memory
The instruction andq $-16, %rax in the code above results in rax containing the (only) 16-byte-aligned address between the addresses rsp and rsp + 15 (both inclusive).
This alignment enforcement is the first thing I don't understand: Why does alloca() align the allocated memory to a 16-byte boundary?
Possible missed optimization?
Let's consider anyway that we want the memory allocated by alloca() to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo), if we pay attention to the status of the stack inside foo() just after pushing the rbp register:
Size Stack RSP mod 16 Description
-----------------------------------------------------------------------------------
------------------
| . |
| . |
| . |
------------------........0 at "call foo" (stack 16-byte aligned)
8 bytes | return address |
------------------........8 at foo entry
8 bytes | saved RBP |
------------------........0 <----- RSP is 16-byte aligned!!!
I think that by taking advantage of the red zone (i.e., no need to modify rsp) and the fact that rsp already contains a 16-byte aligned address, the following code could be used instead:
foo:
pushq %rbp
movq %rsp, %rbp
movl $7, -16(%rbp)
leave
ret
The address contained in the register rbp is 16-byte aligned, therefore rbp - 16 will also be aligned to a 16-byte boundary.
Even better, the creation of the new stack frame can be optimized away, since rsp is not modified:
foo:
movl $7, -8(%rsp)
ret
Is this just a missed optimization or I am missing something else here?

This is (partially) missed optimization in gcc. Clang does it as expected.
I said partially because if you know you will be using gcc you can use builtin functions (use conditional compilation for gcc and other compilers to have portable code).
__builtin_alloca_with_align is your friend ;)
Here is an example (changed so the compiler will not reduce function call to single ret):
#include <alloca.h>
volatile int* p;
void foo()
{
p = alloca(4) ;
*p = 7;
}
void zoo()
{
// aligment is 16 bits, not bytes
p = __builtin_alloca_with_align(4,16) ;
*p = 7;
}
int main()
{
foo();
zoo();
}
Disassembled code (with objdump -d -w --insn-width=12 -M intel)
Clang will produce the following code (clang -O3 test.c) - both functions look alike
0000000000400480 <foo>:
400480: 48 8d 44 24 f8 lea rax,[rsp-0x8]
400485: 48 89 05 a4 0b 20 00 mov QWORD PTR [rip+0x200ba4],rax # 601030 <p>
40048c: c7 44 24 f8 07 00 00 00 mov DWORD PTR [rsp-0x8],0x7
400494: c3 ret
00000000004004a0 <zoo>:
4004a0: 48 8d 44 24 fc lea rax,[rsp-0x4]
4004a5: 48 89 05 84 0b 20 00 mov QWORD PTR [rip+0x200b84],rax # 601030 <p>
4004ac: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
4004b4: c3 ret
GCC this one (gcc -g -O3 -fno-stack-protector)
0000000000000620 <foo>:
620: 55 push rbp
621: 48 89 e5 mov rbp,rsp
624: 48 83 ec 20 sub rsp,0x20
628: 48 8d 44 24 0f lea rax,[rsp+0xf]
62d: 48 83 e0 f0 and rax,0xfffffffffffffff0
631: 48 89 05 e0 09 20 00 mov QWORD PTR [rip+0x2009e0],rax # 201018 <p>
638: c7 00 07 00 00 00 mov DWORD PTR [rax],0x7
63e: c9 leave
63f: c3 ret
0000000000000640 <zoo>:
640: 48 8d 44 24 fc lea rax,[rsp-0x4]
645: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
64d: 48 89 05 c4 09 20 00 mov QWORD PTR [rip+0x2009c4],rax # 201018 <p>
654: c3 ret
As you can see zoo now looks like expected and similar to clang code.

The x86-64 System V ABI requires VLAs (C99 Variable Length Arrays) to be 16-byte aligned, same for automatic / static arrays that are >= 16 bytes.
It looks like gcc is treating alloca as a VLA, and failing to do constant-propagation into an alloca that only runs once per function call. (Or that it internally uses alloca for VLAs.)
A generic alloca / VLA can't use the red-zone, in case the runtime value is larger than 128 bytes. GCC also makes a stack frame with RBP instead of saving the allocation size and doing an add rsp, rdx later.
So the asm looks exactly like what it would if the size was a function arg or other runtime variable instead of a constant. That's what led me to this conclusion.
Also alignof(maxalign_t) == 16 , but alloca and malloc can satisfy the requirement to return memory usable for any object without 16-byte alignment for objects smaller than 16 bytes. None of the standard types have alignment requirements wider than their size in x86-64 SysV.
You're right, it should be able to optimize it to this:
void foo() {
alignas(16) int dummy[1];
volatile int *p = dummy; // alloca(4)
*p = 7;
}
and compile it to the movl $7, -8(%rsp) ; ret you suggested.
The alignas(16) might be optional here for alloca.
If you really need gcc to emit better code when constant propagation makes the arg to alloca a compile-time constant, you could consider simply using a VLA in the first place. GNU C++ supports C99-style VLAs in C++ mode, but ISO C++ (and MSVC) don't.
Or possibly use if(__builtin_constant_p(size)) { VLA version } else { alloca version }, but scoping of VLAs means you can't return a VLA from the scope of an if that detects that we're being inlined with a compile-time constant size. So you'd have to duplicate the code that needs the pointer.

Related

Why did additional pointer arguments disappear in assembly?

C Code:
void PtrArg1(int* a,int* b,int* c, int* d, int* e, int* f)
{
return;
}
void PtrArg2(int* a,int* b,int* c, int* d, int* e, int* f, int* g, int* h)
{
return;
}
Compiling with
gcc -c -m64 -o basics basics.c -O0
Running
objdump -d basics -M intel -r
then results in the following disassembly (Intel syntax):
000000000000000b <PtrArg1>:
b: f3 0f 1e fa endbr64
f: 55 push rbp
10: 48 89 e5 mov rbp,rsp
13: 48 89 7d f8 mov QWORD PTR [rbp-0x8],rdi
17: 48 89 75 f0 mov QWORD PTR [rbp-0x10],rsi
1b: 48 89 55 e8 mov QWORD PTR [rbp-0x18],rdx
1f: 48 89 4d e0 mov QWORD PTR [rbp-0x20],rcx
23: 4c 89 45 d8 mov QWORD PTR [rbp-0x28],r8
27: 4c 89 4d d0 mov QWORD PTR [rbp-0x30],r9
2b: 90 nop
2c: 5d pop rbp
2d: c3 ret
000000000000002e <PtrArg2>:
2e: f3 0f 1e fa endbr64
32: 55 push rbp
33: 48 89 e5 mov rbp,rsp
36: 48 89 7d f8 mov QWORD PTR [rbp-0x8],rdi
3a: 48 89 75 f0 mov QWORD PTR [rbp-0x10],rsi
3e: 48 89 55 e8 mov QWORD PTR [rbp-0x18],rdx
42: 48 89 4d e0 mov QWORD PTR [rbp-0x20],rcx
46: 4c 89 45 d8 mov QWORD PTR [rbp-0x28],r8
4a: 4c 89 4d d0 mov QWORD PTR [rbp-0x30],r9
4e: 90 nop
4f: 5d pop rbp
50: c3 ret
The number of arguments differs for PtrArg1 and PtrArg2, but the assembly instructions are the same for both. Why?

This is due to the calling convention (System V AMD64 ABI, current version 1.0). The first six parameters are passed in integer registers, all others are pushed onto the stack.
After executing till location PtrArg2+0x4e, you get the following stack layout:
+----------+-----------------+
| offset | content |
+----------+-----------------+
| rbp-0x30 | f |
| rbp-0x28 | e |
| rbp-0x20 | d |
| rbp-0x18 | c |
| rbp-0x10 | b |
| rbp-0x8 | a |
| rbp+0x0 | saved rbp value |
| rbp+0x8 | return address |
| rbp+0x10 | g |
| rbp+0x18 | h |
+----------+-----------------+
Since g and h are pushed by the caller, you get the same disassembly for both functions. For the caller
void Caller()
{
PtrArg2(1, 2, 3, 4, 5, 6, 7, 8);
}
(I ommitted the necessary casts for clarity) we would get the following disassembly:
Caller():
push rbp
mov rbp, rsp
push 8
push 7
mov r9d, 6
mov r8d, 5
mov ecx, 4
mov edx, 3
mov esi, 2
mov edi, 1
call PtrArg2
add rsp, 16
nop
leave
ret
(see compiler explorer)
The parameters h = 8 and g = 7 are pushed onto the stack, before calling PtrArg2.

Disappear? What did you expect the function to do with them that the compiler would emit asm instructions to implement?
You literally return; as the only statement in a void function so there's nothing the function needs to do other than ret. If you compile with a normal level of optimization like -O2, that's all you'll get. Debug-mode code is usually not interesting to look at, and is full of redundant / useless stuff.
How to remove "noise" from GCC/clang assembly output?
The only reason you're seeing any instructions for some args is that you compiled in debug mode, i.e. the default optimization level of -O0, anti-optimized debug mode. Every C object (except register locals) has a memory address, and debug mode makes sure that the value is actually there in memory before/after every C statement. This means spilling register args to the stack on function entry. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
The x86-64 System V ABI's calling convention passes the first 6 integer args in registers, the rest on the stack. The stack args already have memory addresses; the compiler doesn't emit code to copy them down next to other local vars below the return address; that would be pointless. The callee "owns" its own stack args, i.e. it can store new values to the stack space where the caller wrote the args, so that space can be the true address of args even if the function were to modify them.

instruction repeated twice when decoded into machine language,

Am basically learning how to make my own instruction in the X86 architecture, but to do that I am understanding how they are decoded and and interpreted to a low level language,
By taking an example of a simple mov instruction and using the .byte notation I wanted to understand in detail as to how instructions are decoded,
My simple code is as follows:
#include <stdio.h>
#include <iostream>
int main(int argc, char const *argv[])
{
int x{5};
int y{0};
// mov %%eax, %0
asm (".byte 0x8b,0x45,0xf8\n\t" //mov %1, eax
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
printf ("dst value : %d\n", y);
return 0;
}
and when I use objdump to analyze how it is broken down to machine language, i get the following output:
000000000000078a <main>:
78a: 55 push %ebp
78b: 48 dec %eax
78c: 89 e5 mov %esp,%ebp
78e: 48 dec %eax
78f: 83 ec 20 sub $0x20,%esp
792: 89 7d ec mov %edi,-0x14(%ebp)
795: 48 dec %eax
796: 89 75 e0 mov %esi,-0x20(%ebp)
799: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%ebp)
7a0: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
7a7: 8b 45 f8 mov -0x8(%ebp),%eax
7aa: 8b 45 f8 mov -0x8(%ebp),%eax
7ad: 89 c0 mov %eax,%eax
7af: 89 45 fc mov %eax,-0x4(%ebp)
7b2: 8b 45 fc mov -0x4(%ebp),%eax
7b5: 89 c6 mov %eax,%esi
7b7: 48 dec %eax
7b8: 8d 3d f7 00 00 00 lea 0xf7,%edi
7be: b8 00 00 00 00 mov $0x0,%eax
7c3: e8 78 fe ff ff call 640 <printf#plt>
7c8: b8 00 00 00 00 mov $0x0,%eax
7cd: c9 leave
7ce: c3 ret
With regard to this output of objdump why is the instruction 7aa: 8b 45 f8 mov -0x8(%ebp),%eax repeated twice, any reason behind it or am I doing something wrong while using the .byte notation?

One of those is compiler-generated, because you asked GCC to have the input in its choice of register for you. That's what "r"(x) means. And you compiled with optimization disabled (the default -O0) so it actually stored x to memory and then reloaded it before your asm statement.
Your code has no business assuming anything about the contents of memory or where EBP points.
Since you're using 89 c0 mov %eax,%eax, the only safe constraints for your asm statement are "a" explicit-register constraints for input and output, forcing the compiler to pick that. If you compile with optimization enabled, your code totally breaks because you lied to the compiler about what your code actually does.
// constraints that match your manually-encoded instruction
asm (".byte 0x89, 0xC0\n\t"
: "=a" (y)
: "a" (x)
);
There's no constraint to force GCC to pick a certain addressing mode for a "m" source or "=m" dest operand so you need to ask for inputs/outputs in specific registers.
If you want to encode your own mov instructions differently from standard mov, see which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension - you might want to use a prefix in front of regular mov opcodes so you can let the assembler encode registers and addressing modes for you, like .byte something; mov %1, %0.
Look at the compiler-generate asm output (gcc -S, not disassembly of the .o or executable). Then you can see which instructions come from the asm statement and which are emitted by GCC.
If you don't explicitly reference some operands in the asm template but still want to see what the compiler picked, you can use them in asm comments like this:
asm (".byte 0x8b,0x45,0xf8 # 0 = %0 1 = %1 \n\t"
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
and gcc will fill it in for you so you can see what operands it expects you to be reading and writing. (Godbolt with g++ -m32 -O3). I put your code in void foo(){} instead of main because GCC -m32 thinks it needs to re-align the stack at the top of main. This makes the code a lot harder to follow.
# gcc-9.2 -O3 -m32 -fverbose-asm
.LC0:
.string "dst value : %d\n"
foo():
subl $20, %esp #,
movl $5, %eax #, tmp84
## Notice that GCC hasn't set up EBP at all before it runs your asm,
## and hasn't stored x in memory.
## It only put it in a register like you asked it to.
.byte 0x8b,0x45,0xf8 # 0 = %eax 1 = %eax # y, tmp84
.byte 0x89, 0xC0
pushl %eax # y
pushl $.LC0 #
call printf #
addl $28, %esp #,
ret
Also note that if you were compiling as 64-bit, it would probably pick %esi as a register because printf will want its 2nd arg there. So the "a" instead of "r" constraint would actually matter.
You could get 32-bit GCC to use a different register if you were assigning to a variable that has to survive across a function call; then GCC would pick a call-preserved reg like EBX instead of EAX.

Why do I find some never called instructions nopl, nopw after ret or jmp in GCC compiled code? [duplicate]

I've been working with C for a short while and very recently started to get into ASM. When I compile a program:
int main(void)
{
int a = 0;
a += 1;
return 0;
}
The objdump disassembly has the code, but nops after the ret:
...
08048394 <main>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
80483a1: 83 45 fc 01 addl $0x1,-0x4(%ebp)
80483a5: b8 00 00 00 00 mov $0x0,%eax
80483aa: c9 leave
80483ab: c3 ret
80483ac: 90 nop
80483ad: 90 nop
80483ae: 90 nop
80483af: 90 nop
...
From what I learned nops do nothing, and since after ret wouldn't even be executed.
My question is: why bother? Couldn't ELF(linux-x86) work with a .text section(+main) of any size?
I'd appreciate any help, just trying to learn.

First of all, gcc doesn't always do this. The padding is controlled by -falign-functions, which is automatically turned on by -O2 and -O3:
-falign-functions
-falign-functions=n
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance,
-falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 would align to the next 32-byte boundary only
if this can be done by skipping 23 bytes or less.
-fno-align-functions and -falign-functions=1 are equivalent and mean that functions will not be aligned.
Some assemblers only support this flag when n is a power of two; in
that case, it is rounded up.
If n is not specified or is zero, use a machine-dependent default.
Enabled at levels -O2, -O3.
There could be multiple reasons for doing this, but the main one on x86 is probably this:
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be
advantageous to align critical loop entries and subroutine entries by 16 in order to minimize
the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.
(Quoted from "Optimizing subroutines in assembly
language" by Agner Fog.)
edit: Here is an example that demonstrates the padding:
// align.c
int f(void) { return 0; }
int g(void) { return 0; }
When compiled using gcc 4.4.5 with default settings, I get:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
000000000000000b <g>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: c9 leaveq
15: c3 retq
Specifying -falign-functions gives:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
b: eb 03 jmp 10 <g>
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <g>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: b8 00 00 00 00 mov $0x0,%eax
19: c9 leaveq
1a: c3 retq

This is done to align the next function by 8, 16 or 32-byte boundary.
From “Optimizing subroutines in assembly language” by A.Fog:
11.5 Alignment of code
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an importantsubroutine entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetching that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first instructions after thelabel. This can be avoided by aligning important subroutine entries and loop entries by 16.
[...]
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before thesubroutine entry to make the address divisible by 8, 16, 32 or 64, as desired.

As far as I remember, instructions are pipelined in cpu and different cpu blocks (loader, decoder and such) process subsequent instructions. When RET instructions is being executed, few next instructions are already loaded into cpu pipeline. It's a guess, but you can start digging here and if you find out (maybe the specific number of NOPs that are safe, share your findings please.

Why does the assembly encoding of objdump vary?

I was reading this article about Position Independent Code and I encountered this assembly listing of a function.
0000043c <ml_func>:
43c: 55 push ebp
43d: 89 e5 mov ebp,esp
43f: e8 16 00 00 00 call 45a <__i686.get_pc_thunk.cx>
444: 81 c1 b0 1b 00 00 add ecx,0x1bb0
44a: 8b 81 f0 ff ff ff mov eax,DWORD PTR [ecx-0x10]
450: 8b 00 mov eax,DWORD PTR [eax]
452: 03 45 08 add eax,DWORD PTR [ebp+0x8]
455: 03 45 0c add eax,DWORD PTR [ebp+0xc]
458: 5d pop ebp
459: c3 ret
0000045a <__i686.get_pc_thunk.cx>:
45a: 8b 0c 24 mov ecx,DWORD PTR [esp]
45d: c3 ret
However, on my machine (gcc-7.3.0, Ubuntu 18.04 x86_64), I got slightly different result below:
0000044d <ml_func>:
44d: 55 push %ebp
44e: 89 e5 mov %esp,%ebp
450: e8 29 00 00 00 call 47e <__x86.get_pc_thunk.ax>
455: 05 ab 1b 00 00 add $0x1bab,%eax
45a: 8b 90 f0 ff ff ff mov -0x10(%eax),%edx
460: 8b 0a mov (%edx),%ecx
462: 8b 55 08 mov 0x8(%ebp),%edx
465: 01 d1 add %edx,%ecx
467: 8b 90 f0 ff ff ff mov -0x10(%eax),%edx
46d: 89 0a mov %ecx,(%edx)
46f: 8b 80 f0 ff ff ff mov -0x10(%eax),%eax
475: 8b 10 mov (%eax),%edx
477: 8b 45 0c mov 0xc(%ebp),%eax
47a: 01 d0 add %edx,%eax
47c: 5d pop %ebp
47d: c3 ret
The main difference I found was that the semantic of mov instruction. In the upper listing, mov ebp,esp actually moves esp to ebp, while in the lower listing, mov %esp,%ebp does the same thing, but the order of operands are different.
This is quite confusing, even when I have to code hand-written assembly. To summarize, my questions are (1) why I got different assembly representations for the same instructions and (2) which one I should use, when writing assembly code (e.g. with __asm(:::);)

obdjump defaults to -Matt AT&T syntax (like your 2nd code block). See att vs. intel-syntax. The tag wikis have some info about the syntax differences: https://stackoverflow.com/tags/att/info vs. https://stackoverflow.com/tags/intel-syntax/info
Either syntax has the same limitations, imposed by what the machine itself can do, and what's encodeable in machine code. They're just different ways of expressing that in text.
Use objdump -d -Mintel for Intel syntax. I use alias disas='objdump -drwC -Mintel' in my .bashrc, so I can disas foo.o and get the format I want, with relocations printed (important for making sense of a non-linked .o), without line-wrapping for long instructions, and with C++ symbol names demangled.
In inline asm, you can use either syntax, as long as it matches what the compiler is expecting. The default is AT&T, and that's what I'd recommend using for compatibility with clang. Maybe there's a way, but clang doesn't work the same way as GCC with -masm=intel.
Also, AT&T is basically standard for GNU C inline asm on x86, and it means you don't need special build options for your code to work.
But you can use gcc -masm=intel to compile source files that use Intel syntax in their asm statements. This is fine for your own use if you don't care about clang.
If you're writing code for a header, you can make it portable between AT&T and Intel syntax using dialect alternatives, at least for GCC:
static inline
void atomic_inc(volatile int *p) {
// use __asm__ instead of asm in headers, so it works even with -std=c11 instead of gnu11
__asm__("lock {addl $1, %0 | add %0, 1}": "+m"(*p));
// TODO: flag output for return value?
// maybe doesn't need to be asm volatile; compilers know that modifying pointed-to memory is a visible side-effect unless it's a local that fully optimizes away.
// If you want this to work as a memory barrier, use a `"memory"` clobber to stop compile-time memory reordering. The lock prefix provides a runtime full barrier
}
source+asm outputs for gcc/clang on the Godbolt compiler explorer.
With g++ -O3 (default or -masm=att), we get
atomic_inc(int volatile*):
lock addl $1, (%rdi) # operand-size is from my explicit addl suffix
ret
With g++ -O3 -masm=intel, we get
atomic_inc(int volatile*):
lock add DWORD PTR [rdi], 1 # operand-size came from the %0 expansion
ret
clang works with the AT&T version, but fails with -masm=intel (or the -mllvm --x86-asm-syntax=intel which that implies), because that apparently only applies to code emitted by LLVM, not for how the front-end fills in the asm template.
The clang error message is:
<source>:4:13: error: unknown use of instruction mnemonic without a size suffix
__asm__("lock {addl $1, %0 | add %0, 1}": "+m"(*p));
^
<inline asm>:1:2: note: instantiated into assembly here
lock add (%rdi), 1
^
1 error generated.
It picked the "Intel" syntax alternative, but still filled in the template with an AT&T memory operand.

Reversed Mach-O 64-bit x86 Assembly analysis

This question is for Intel x86 assembly experts to answer. Thanks for your effort in advance!
Problem Specification
I am analysing a binary file, which match Mach-O 64-bit x86 assembly. I am currently using MacOS 64 OS. The assembly comes from objdump.
The problem is that when I am learning assembly, I can see variable name "$xxx", I can see string value in ascii and I can also see the callee name like "call _printf"
But in this assembly, I can get nothing above:
no main function:
Disassembly of section __TEXT,__text:
__text:
100000c90: 55 pushq %rbp
100000c91: 48 89 e5 movq %rsp, %rbp
100000c94: 48 83 ec 10 subq $16, %rsp
100000c98: 48 8d 3d bf 02 00 00 leaq 703(%rip), %rdi
100000c9f: b0 00 movb $0, %al
100000ca1: e8 68 02 00 00 callq 616
100000ca6: 89 45 fc movl %eax, -4(%rbp)
100000ca9: 48 83 c4 10 addq $16, %rsp
100000cad: 5d popq %rbp
100000cae: c3 retq
100000caf: 90 nop
100000cb0: 55 pushq %rbp
...
The above is codes frame will be executed, but I have no idea where it is executed.
Also, I newbie of AT&T assemble. Hence, could you tell me what is the meaning of instruction:
0000000100000c90 pushq %rbp
0000000100000c98 leaq 0x2bf(%rip), %rdi ## literal pool for: "xxxx\n"
...
0000000100000cd0 callq 0x100000c90
Is it a loop? I am not sure but it seems to be. And why we they use %rip and %rdi register. In intel x86 I know that EIP represents current caller address, but I don't understand the meaning here.
call integer:
No matter what call convention they used, I had never seen code pattern like "call 616":
"100000cd0: e8 bb ff ff ff callq -69 <__mh_execute_header+C90>"
After ret:
Ret in intel x86, means delete stack frame and return control flow to caller. It should be an independent function. However, after this, we can see codes like
100000cae: c3 retq
100000caf: 90 nop
/* new function call */
100000cb0: 55 pushq %rbp
...
It is ridiculous!
ASCII string lost:
I have already viewed the binary in Hexadecimal format, and recognise some ascii string before reverse it to asm file.
However, in this file no ascii string occurrences!
Total architecture review:
Disassembly of section __TEXT,__text:
__text:
from address 10000c90 to 100000ef6 of 145 lines
Disassembly of section __TEXT,__stubs:
__stubs:
from address 100000efc to 100000f14 of 5 lines asm codes:
100000efc: ff 25 16 01 00 00 jmp qword ptr [rip + 278]
100000f02: ff 25 18 01 00 00 jmp qword ptr [rip + 280]
100000f08: ff 25 1a 01 00 00 jmp qword ptr [rip + 282]
100000f0e: ff 25 1c 01 00 00 jmp qword ptr [rip + 284]
100000f14: ff 25 1e 01 00 00 jmp qword ptr [rip + 286]
Disassembly of section __TEXT,__stub_helper:
__stub_helper:
...
Disassembly of section __TEXT,__cstring:
__cstring:
...
Disassembly of section __TEXT,__unwind_info:
__unwind_info:
...
Disassembly of section __DATA,__nl_symbol_ptr:
__nl_symbol_ptr:
...
Disassembly of section __DATA,__got:
__got:
...
Disassembly of section __DATA,__la_symbol_ptr:
__la_symbol_ptr:
...
Disassembly of section __DATA,__data:
__data:
...
Since it might be a virus, I cannot execute it. How should I analyse it ?
Update on May 21
I have already identified where is the output, and if I totally understand the data flow pipeline represented in this programme, I might be able to figure out the possible solutions.
I am appreciated if someone can give me the detailed explanation. Thank you !
Update on May 22
I installed a MacOS in VirtualBox and after chmod privileges , I executed the programme but nothing special except for two lines of output happened. And the result hiding in the binary file.

You don't need a main if you are not using C. The binary header contains the entry point address.
Nothing special about call 616, it's just that you don't have (all) symbols. It's somewhat strange that objdump didn't calculate the address for you, but it should be 0x100000ca6+616.
Not sure what you find ridiculous there. One function ends, another starts.
That's not a question. Yes, you can create strings at runtime so you won't have them in the image. Possibly they are encrypted.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Understanding GCC's alloca() alignment and seemingly missed optimization - gcc

Related

Why did additional pointer arguments disappear in assembly?

instruction repeated twice when decoded into machine language,

Why do I find some never called instructions nopl, nopw after ret or jmp in GCC compiled code? [duplicate]

Why does the assembly encoding of objdump vary?

Reversed Mach-O 64-bit x86 Assembly analysis

Categories

Resources