How can jmpq end up at the wrong place? - debugging

I am debugging some problems with the relocation of code loaded by llvm's runtime linker. One odd thing that happens is a segfault after calling a function that has been relocated via R_X86_64_PLT32. In that case, llvm creates an entry in a GOT (global offset table?) and makes an indirect jump. The callq instruction ends up here:
0x7ffff7f12021 jmpq *0xded21(%rip) # 0x7ffff7ff0d48
So I'd expect the next instruction to be at 0x7ffff7ff0d48. Now that location does not look like real code to me, but that's another story. What bothers me is that after a one step, I end up here:
0x3415b20 adc %bl,0x323(%rsi)
This is a completely different location, and source of a segfault that even seems to kill gdb. How did jmpq end up here? Is there some truncation going on?

Related

Accessing global variables in ARM64 position independent assembly code

I'm writing some ARM64 assembly code for macOS, and it needs to access a global variable.
I tried to use the solution in this SO answer, and it works fine if I just call the function as is. However, my application needs to patch some instructions of this function, and the way I'm doing it, the function gets moved somewhere else in memory in the process. Note the adrp/ldr pair is untouched during patching.
However, if I try to run the function after moving it elsewhere in memory, it no longer returns correct results. This happens even if I just memcpy() the code as is, without patching. After tracing with a debugger, I isolated the issue to the address of the global valuable being incorrectly loaded by the adrp/ldr pair (and weirdly, the ldr is assembled as an add, as seen with objdump straight after compiling the binary -- not sure if it's somehow related to the issue here.)
What would be the correct way to load a global variable, so that it survives the function being copied somewhere else and run from there?
Note the adrp/ldr pair is untouched during patching.
There's the issue. If you rip code out of the binary it's in, then you effectively need to re-link it.
There's two ways of dealing with this:
If you have complete control over the segment layout, then you could have one executable segment with all of your assembly in it, and right next to it one segment with all addresses that code needs, and make sure the assembly ONLY has references to things on that page. Then wherever you copy your assembly, you'd also copy the data page next to it. This would enable you to make use of static addresses that get rebased by the dynamic linker at the time your binary is loaded. This might look something like:
.section __ASM,__asm,regular
.globl _asm_stub
.p2align 2
_asm_stub:
adrp x0, _some_ref#PAGE
ldr x0, [x0, _some_ref#PAGEOFF]
ret
.section __REF,__ref
.globl _some_ref
.p2align 3
_some_ref:
.8byte _main
Compile that with -Wl,-segprot,__ASM,rx,rx and you'll get an executable __ASM and a writeable __REF segment. Those two would have to maintain their relative position to each other when they get copied around.
(Note that on arm64 macOS you cannot put symbol references into executable segments for the dynamic linker to rebase, because it will fault and crash while trying to do so, and even if it were able to do that, it would invalidate the code signature.)
You act as a linker, scanning for PC-relative instructions and re-linking them as you go. The list of PC-relative instructions in arm64 is quite short, so it should be a feasible amount of work:
adr and adrp
b and bl
b.cond (and bc.cond with FEAT_HBC)
cbz and cbnz
tbz and tbnz
ldr and ldrsw (literal)
ldr (SIMD & FP literal)
prfm (literal)
(You can look for the string PC[] in the ARMv8 Reference Manual to find all uses.)
For each of those you'd have to check whether their target address lies within the range that's being copied or not. If it does, then you'd leave the instruction alone (unless you copy the code to a different offset within the 4K page than it was before, in which case you have to fix up adrp instructions). If it isn't then you'll have to recalculate the offset and emit a new instruction. Some of the instructions have a really low maximum offset (tbz/tbnz ±32KiB). But usually the only instructions that reference addresses across function boundaries are adr, adrp, b, bl and ldr. If all code on the page is written by you then you can do adrp+add instead of adr and adrp+ldr instead of just ldr, and if you have compiler-generated code on there, then all adr's and ldr's will have a nop before or after, which you can use to turn them into an adrp combo. That should get your maximum reference range up to ±128MiB.

Setting a random address breakpoint in gdb

I am learning some mechanism of breakpoint and I learned that 'In x86, there exist a instruction called int3 for debugger to interrupt the CPU. And then CPU will interrupt the running program by signal'.
For example:
8048e20: 55 push %ebp
8048e21: 89 e5 mov %esp,%ebp
When the user input
b *0x8048e21
The instruction will be replaced by int3(opcode 0xcc) and become this:
8048e20: 55 push %ebp
8048e21: cc e5 mov %esp,%ebp
And it will stop at the right place.
Then comes the question:
What would happen if I set the breakpoint not at the beginning of a instruction? ie, if I input:
b *0x8048e22
will debugger still replace the e5 with cc? So I write a simple example and run it with gdb.
As you can see above, I set two break points and the second is at the middle of a break points. I Input r and stop at the first breakpoint and input c and run to the end.
So it seems that the gdb ignore the second breakpoint. (For if it really repalce it with a int3 the program would be totally wrong).
Question: What happen to the second breakpoint, more specifically, what does gdb deal with it( or what I learn is wrong?)
Edit: #dbrank already give a great example about altering the data field of a instruction, I will try to make it more comprehensive with a similar example (it seems the register).
(Any reference about mechanism of breakpoint is appreciated!)
Inserting breakpoint in the middle of instruction will alter the instruction.
See this example of a program, where inserting a breakpoint overwrites original value assigned to variable (42 (0x2a)) with breakpoint instruction (0xcc (204)).
You can find more about how breakpoints work here.
You can also look into GDB sources (breakpoint.c & infrun.c mostly).

Why is an empty function not just a return

If I compile an empty C function
void nothing(void)
{
}
using gcc -O2 -S (and clang) on MacOS, it generates:
_nothing:
pushq %rbp
movq %rsp, %rbp
popq %rbp
ret
Why does gcc not remove everything but the ret? It seems like an easy optimisation to make unless it really does something (seems not to, to me). This pattern (push/move at the beginning, pop at the end) is also visible in other non-empty functions where rbp is otherwise unused.
On Linux using a more recent gcc (4.4.5) I see just
nothing:
rep
ret
Why the rep ? The rep is absent in non-empty functions.
Why the rep ?
The reasons are explained in this blog post. In short, jumping directly to a single-byte ret instruction would mess up the branch prediction on some AMD processors. And rather than adding a nop before the ret, a meaningless prefix byte was added to save instruction decoding bandwidth.
The rep is absent in non-empty functions.
To quote from the blog post I linked to: "[rep ret] is preferred to the simple ret either when it is the target of any kind of branch, conditional (jne/je/...) or unconditional (jmp/call/...)".
In the case of an empty function, the ret would have been the direct target of a call. In a non-empty function, it wouldn't be.
Why does gcc not remove everything but the ret?
It's possible that some compilers won't omit frame pointer code even if you've specified -O2. At least with gcc, you can explicitly tell the compiler to omit them by using the -fomit-frame-pointer option.
As explained here: http://support.amd.com/us/Processor_TechDocs/25112.PDF, a two-byte near-return instruction (i.e. rep ret) is used because a single-byte return can me mispredicted on some on some amd64 processors in some situations such as this one.
If you fiddle around with the processor targeted by gcc you may find that you can get it to generate a single-byte ret. -mtune=nocona worked for me.
I suspect early, your last code is a bug. As johnfound says. The first code is because all C Compiler must always follow _cdecl calling convention that in function means (In Intel, sorry, I don't know the AT&T Syntax):
Function Definition
_functionA:
push rbp
mov rbp, rsp
;Some function
pop rbp
ret
In caller :
call _functionA
sub esp, 0 ; Maybe if it zero, some compiler can strip it
Why GCC is always follow _cdecl calling convention when not following that is nonsense, that is the compiler isn't smarter that the advanced assembly programmer. So, it always follow _cdecl at all cost.
That is, because even so called "optimization compilers" are too dumb to generate always good machine code.
They can't generate better code than their creators made them to generate.
As long as an empty function is nonsense, they probably simply didn't bother to optimize it or even to detect this very special case.
Although, single "rep" prefix is probably a bug. It does nothing when used without string instruction, but anyway, in some newer CPU it theoretically can cause an exception. (and IMHO should)

gcc, start symbol and push $0x00

So I've searched the interwebs for weeks and cannot find the answer to my questions.
I can see that the start symbol added by gcc sets up the initial two arguments (int argc, char *argv[]) and I believe the 3rd environment argument, but I have a couple questions about this. Why does it add all of this if the main() function is defined to have no arguments? Wouldn't it save space (and technically processing time) if it would just call _main without any arguments?
Second what does push $0x0 do? I have done tests and if you try to iterate through the command line arguments like what the default start symbol does then you need push $0x0 at the beginning, otherwise I have created misaligned stack errors if I do something like the following:
push $0x00
call _main
mov %eax, %edi
call _exit
also in my investigations I have found that the start symbol is added by the linker when you link to crt1.10.6.o
Any explanations or documentation would be greatly appreciated
The code for _start, as you have found, comes from an object file named crt1.o or similar. That object file normally comes from source code that's part of the C library. It is compiled in ignorance of how you have declared your main function, and so it has to assume you wish to make use of all three potential arguments. It does no harm to set up the arguments even if main won't use them (other than a trivial amount of space and time, as you mention).
I deduce from mov %eax,%edi that this is the x86-64 ELF ABI, which means that there are no frame pointers, so the push $0x00 is indeed just to keep the stack aligned to a 16-byte boundary, as you speculate. (In the x86-32 ELF ABI, it would also have served as the terminator for the linked list of frame pointers.)

grdb not working variables

i know this is kinda retarded but I just can't figure it out. I'm debugging this:
xor eax,eax
mov ah,[var1]
mov al,[var2]
call addition
stop: jmp stop
var1: db 5
var2: db 6
addition:
add ah,al
ret
the numbers that I find on addresses var1 and var2 are 0x0E and 0x07. I know it's not segmented, but that ain't reason for it to do such escapades, because the addition call works just fine. Could you please explain to me where is my mistake?
I see the problem, dunno how to fix it yet though. The thing is, for some reason the instruction pointer starts at 0x100 and all the segment registers at 0x1628. To address the instruction the used combination is i guess [cs:ip] (one of the segment registers and the instruction pointer for sure). The offset to var1 is 0x10 (probably because from the begining of the code it's the 0x10th byte in order), i tried to examine the memory and what i got was:
1628:100 8 bytes
1628:108 8 bytes
1628:110 <- wtf? (assume another 8 bytes)
1628:118 ...
whatever tricks are there in the memory [cs:var1] points somewhere else than in my code, which is probably where the label .data would usually address ds.... probably.. i don't know what is supposed to be at 1628:10
ok, i found out what caused the assness and wasted me whole fuckin day. the behaviour described above is just correct, the code is fully functional. what i didn't know is that grdb debugger for some reason sets the begining address to 0x100... the sollution is to insert the directive ORG 0x100 on the first line and that's the whole thing. the code was working because instruction pointer has the right address to first instruction and goes one by one, but your assembler doesn't know what effective address will be your program stored at so it pretty much remains relative to first line of the code which means all the variables (if not using label for data section) will remain pointing as if it started at 0x0. which of course wouldn't work with DOS. and grdb apparently emulates some DOS features... sry for the language, thx everyone for effort, hope this will spare someone's time if having the same problem...
heheh.. at least now i know the reason why to use .data section :))))
Assuming that is x86 assembly, var1 and var2 must reside in the .data section.
Explanation: I'm not going to explain exactly how the executable file is structured (not to mention this is platform-specific), but here's a general idea as to why what you're doing is not working.
Assembly code must be divided into data sections due to the fact that each data section corresponds directly (or almost directly) to a specific part of the binary/executable file. All global variables must be defined in the .data sections since they have a corresponding location in the binary file which is where all global data resides.
Defining a global variable (or a globally accessed part of the memory) inside the code section will lead to undefined behavior. Some x86 assemblers might even throw an error on this.

Resources