Does a linker generate absolute virtual addresses when linking - macos

Assume a simple hello world in C, compiled using gcc -c to an object file and disassembled using objdump will looks like this:
_main:
0: 55 pushq %rbp
1: 48 89 e5 movq %rsp, %rbp
4: c7 45 fc 00 00 00 00 movl $0, -4(%rbp)
b: c7 45 f8 05 00 00 00 movl $5, -8(%rbp)
12: 8b 05 00 00 00 00 movl (%rip), %eax
As you can see the memory addresses are 0, 1, 4, .. and so on. They are not actual addresses.
Linking the object file and disassembling it looks like this:
_main:
100000f90: 55 pushq %rbp
100000f91: 48 89 e5 movq %rsp, %rbp
100000f94: c7 45 fc 00 00 00 00 movl $0, -4(%rbp)
100000f9b: c7 45 f8 05 00 00 00 movl $5, -8(%rbp)
100000fa2: 8b 05 58 00 00 00 movl 88(%rip), %eax
My question is, is 100000f90 an actual address of a byte of virtual memory or is it an offset?
How can the linker give an actual address prior to execution? What if that memory address isn't available when executing? What if I execute it on another machine with much less memory (maybe paging kicks in here).
Is't it the job of the loader to assign actual addresses?
Is the linker generating actual addresses for he final executable file?

(The following answers assume that the linker is not creating a position-independent executable.)
My question is, is 100000f90 an actual address of a byte of virtual memory or is it an offset?
It's the actual virtual address. Strictly speaking, it is the offset from the base of the code segment, but since modern operating systems always set the base of the code segment to 0, it is effectively the actual virtual address.
How can the linker give an actual address prior to execution? What if that memory address isn't available when executing? What if I execute it on another machine with much less memory (maybe paging kicks in here).
Each process gets its own separate virtual address space. Because it is virtual memory, the amount of physical memory in the machine doesn't matter. Paging is the process by which virtual addresses get mapped to physical address.
Isn't it the job of the loader to assign actual addresses?
Yes, when creating a process, the operating system loader allocates physical page frames for the process and maps the pages into the process's virtual address space. But the virtual addresses are those assigned by the linker.

Does a linker generate absolute virtual addresses when linking
It depends upon the linker setting and the input source. For general programming, linkers usually strive to create position independent code.
My question is, is 100000f90 an actual address of a byte of virtual memory or is it an offset?
It is most likely an offset.
How can the linker give an actual address prior to execution?
Think about the loader for an operating system. It expects things to be in specific address locations. Any decent linker will allow the programmer to specify absolute addresses some way.
What if that memory address isn't available when executing? What if I execute it on another machine with much less memory (maybe paging kicks in here).
That's the problem with position-dependent code.
Is't it the job of the loader to assign actual addresses?
The job of the loader is to follow the instructions given to it in the executable file. In creating the executable, the linker can specify addresses or defer to the loader in some cases.

Related

What does "nop dword ptr [rax+rax]" x64 assembly instruction do?

I'm trying to understand the x64 assembly optimization that is done by the compiler.
I compiled a small C++ project as Release build with Visual Studio 2008 SP1 IDE on Windows 8.1.
And one of the lines contained the following assembly code:
B8 31 00 00 00 mov eax,31h
0F 1F 44 00 00 nop dword ptr [rax+rax]
And here's a screenshot:
As far as I know nop by itself is do nothing, but I've never seen it with an operand like that.
Can someone explain what does it do?
In a comment elsewhere on this page, Michael Petch points to a web page which describes the Intel x86 multi-byte NOP opcodes. The page has a table of useful information, but unfortunately the HTML is messed up so you can't read it. Here is some information from that page, plus that table presented a readable form:
Multi-Byte NOPhttp://www.felixcloutier.com/x86/NOP.html
The one-byte NOP instruction is an alias mnemonic for the XCHG (E)AX, (E)AX instruction.
The multi-byte NOP instruction performs no operation on supported processors and generates undefined opcode exception on processors that do not support the multi-byte NOP instruction.
The memory operand form of the instruction allows software to create a byte sequence of “no operation” as one instruction.
For situations where multiple-byte NOPs are needed, the recommended operations (32-bit mode and 64-bit mode) are: [my edit: in 64-bit mode, write rax instead of eax.]
Length Assembly Byte Sequence
------- ------------------------------------------ --------------------------
1 byte nop 90
2 bytes 66 nop 66 90
3 bytes nop dword ptr [eax] 0F 1F 00
4 bytes nop dword ptr [eax + 00h] 0F 1F 40 00
5 bytes nop dword ptr [eax + eax*1 + 00h] 0F 1F 44 00 00
6 bytes 66 nop word ptr [eax + eax*1 + 00h] 66 0F 1F 44 00 00
7 bytes nop dword ptr [eax + 00000000h] 0F 1F 80 00 00 00 00
8 bytes nop dword ptr [eax + eax*1 + 00000000h] 0F 1F 84 00 00 00 00 00
9 bytes 66 nop word ptr [eax + eax*1 + 00000000h] 66 0F 1F 84 00 00 00 00 00
Note that the technique for selecting the right byte sequence--and thus the desired total size--may differ according to which assembler you are using.
For example, the following two lines of assembly taken from the table are ostensibly similar:
nop dword ptr [eax + 00h]
nop dword ptr [eax + 00000000h]
These differ only in the number of leading zeros, and some assemblers may make it hard to disable their "helpful" feature of always encoding the shortest possible byte sequence, which could make the second expression inaccessible.
For the multi-byte NOP situation, you don't want this "help" because you need to make sure that you actually get the desired number of bytes. So the issue is how to specify an exact combination of mod and r/m bits that ends up with the desired disp size--but via instruction mnemonics alone. This topic is complex, and certainly beyond the scope of my knowledge, but Scaled Indexing, MOD+R/M and SIB might be a starting place.
Now as I know you were just thinking, if you find it difficult or impossible to coerce your assembler's cooperation via instruction mnemonics you can always just resort to db ("define bytes") as a simple no-fuss alternative which is, um, guaranteed to work.
As pointed out in the comments, it is a multi-byte NOP usually used to align the subsequent instruction to a 16-byte boundary, when that instruction is the first instruction in a loop.
Such alignment can help with instruction fetch bandwidth, because instruction fetch often happens in units of 16 bytes, so aligning the top of a loop gives the greatest chance that the decoding occurs without bottlenecks.
The importance of such alignment is arguably less important than it once was, with the introduction of the loop buffer and the uop cache which are less sensitive to alignment. In some cases this optimization may even be a pessimization, especially when the loop executes very few times.
This code alignment is done when there are used jump instructions that perform jumps from bigger addresses to lower (0EBh XX - jmp short) and (0E9h XX XX XX XX - jmp near), where XX in both cases is a signed negative number. So, the compiler is aligning that chunk of code where the jump needs to be performed to 10h bytes boundary. This will give an optimization and code execution speedup.

Does FTRACE invalidate the CPU instruction cache after it has modify the code instructions in memory?

As is well known, the kernel uses "mcount" as a placeholder to redirect CPU instruction execution during FTRACE operation. Eg:
c1003000 <run_init_process>:
c1003000: 55 push %ebp
c1003001: 89 e5 mov %esp,%ebp
c1003003: 83 ec 04 sub $0x4,%esp
c1003006: e8 21 e2 5c 00 call c15d122c <mcount>
c100300b: b9 80 4f 83 c1 mov $0xc1834f80,%ecx
c1003010: 64 8b 15 90 cf 95 c1 mov %fs:0xc195cf90,%edx
c1003017: a3 20 50 83 c1 mov %eax,0xc1835020
From above, the instruction "call mcount" will be dynamically replace with some other instruction during FTRACE operation.
Question is how safe is the instruction replacement in the kernel memory - given that the CPU always preload certain number of instructions into its cache before execution. And it may happen that after loading the instruction, the FTRACE operation replaces the instruction in memory. But the CPU will still be executing the cached version, right? Or does FTRACE trigger a CPU instruction/data cache invalidation immediately after modifying the memory content? (Please provide kernel source code reference?)
Thanks.
PS: Reference: http://people.redhat.com/srostedt/ftrace-tutorial.odp (slide 36 and 37 showed the instructions operation in memory when FTRACE is enabled on the function)
As briefly mentioned here:
http://lwn.net/Articles/556186/
FTRACE is using "stop_machine" architecture, and in this mode, when the CPU is modifying the memory of the tasks code area, all tasks are far and away from its execution activity, and thus the CPU cache is unlikely to store the code to be executed, thus it is fine to modify the code in memory.

Omiting processor cache

I have a question I had been given a while ago during the job interview, I was wandering about the data processor cache. The question itself was connected with volatile variable, how can we not optimize the memory access for those variables. From my understanding when we read the volatile variable we need to omit the processor cache. And this is what my question is about. What is happening in such cases, is entire cache being flushed when the access for such variable is executed? Or there is some register setting that caching should be omitted for a memory region? Or is there a function for reading memory without looking in the cache? Or is it architecture dependent.
Thanks in advance for your time and answers.
There is some confusion here - the memory your program uses (through the compiler), is in fact an abstraction, maintained together by the OS and the processor. As such, you don't "need" to worry about paging, swapping, physical address space and performance.
Wait, before you jump and yell at me for talking nonesence - that was not to say you shouldn't care about them, when optimizing your code you might want to know what actually happens, so you have a set of tools to assist you (SW prefetches for example), as well as a rough idea on how the system works (cache sizes and hierarchy), allowing you to write optimized code.
However, as I said, you don't have to worry about this, and if you don't - it's guaranteed to work "under the hood", to an extent. The cache for example, is guaranteed to maintain coherency even when working with shared data (that's maintained through a set of pretty complicated HW protocols), and even in cases of virtual address aliases (multiple virt addresses pointing to the same physical one). But here comes the "to an extent" part - in some cases you have to make sure you use it correctly. If you want to do memory-mapped IO for e.g., you should define it properly so that the processor knows it shouldn't be cached. The compiler isn't likely to do this for you implicitly, it probably won't even know.
Now, volatile lives in an upper level, it's part of the contract between the programmer and his compiler. It means the compiler isn't allowed to do all sorts of optimizations with this variable, that would be unsafe for the program even within the memory model abstraction. These are basically cases where the value can be modified externally at any point (through interrupt, mmio, other threads, ...). Keep in mind that the compiler still lives above the memory abstraction, if it decides to write something to memory or read it, aside from possible hints it relies completely on the processor to do whatever it needs to make this chunk of memory close at hand while maintaining correctness. However, a compiler is allowed much more freedom than the HW - it could decide to move reads/writes or eliminate variables alltogether, something which the CPU in most cases isn't allowed to, so you need to prevent that from happening if it's unsafe. Some nice examples of when that happens can be found here - http://www.barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword
So while a volatile hint limits the freedom of the compiler inside the memory model, it doesn't necessarily limits the underlying HW. You probably don't want it to - say you have a volatile variable that you want to expose to other threads - if the compiler made it uncacheable it would ruin the performance (and without need). If on top of that you also want to protect the memory model from unsafe caching (which are just a subset of the cases volatile might come in handy), you'll have to do so explicitly.
EDIT:
I felt bad for not adding any example, so to make it clearer - consider the following code:
int main() {
int n = 20;
int sum = 0;
int x = 1;
/*volatile */ int* px = &x;
while (sum < n) {
sum+= *px;
printf("%d\n", sum);
}
return 0;
}
This would count from 1 to 20 in jumps of x, which is 1. Let's see how gcc -O3 writes it:
0000000000400440 <main>:
400440: 53 push %rbx
400441: 31 db xor %ebx,%ebx
400443: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
400448: 83 c3 01 add $0x1,%ebx
40044b: 31 c0 xor %eax,%eax
40044d: be 3c 06 40 00 mov $0x40063c,%esi
400452: 89 da mov %ebx,%edx
400454: bf 01 00 00 00 mov $0x1,%edi
400459: e8 d2 ff ff ff callq 400430 <__printf_chk#plt>
40045e: 83 fb 14 cmp $0x14,%ebx
400461: 75 e5 jne 400448 <main+0x8>
400463: 31 c0 xor %eax,%eax
400465: 5b pop %rbx
400466: c3 retq
note the add $0x1,%ebx - since the variable is considered "safe" enough by the compiler (volatile is commented out here), it allows itself to consider it as loop invariant. In fact, if I had not printed something on each iteration, the entire loop would have been optimized away since gcc can tell the final outcome pretty easily.
However, uncommenting the volatile keyword, we get -
0000000000400440 <main>:
400440: 53 push %rbx
400441: 31 db xor %ebx,%ebx
400443: 48 83 ec 10 sub $0x10,%rsp
400447: c7 04 24 01 00 00 00 movl $0x1,(%rsp)
40044e: 66 90 xchg %ax,%ax
400450: 8b 04 24 mov (%rsp),%eax
400453: be 4c 06 40 00 mov $0x40064c,%esi
400458: bf 01 00 00 00 mov $0x1,%edi
40045d: 01 c3 add %eax,%ebx
40045f: 31 c0 xor %eax,%eax
400461: 89 da mov %ebx,%edx
400463: e8 c8 ff ff ff callq 400430 <__printf_chk#plt>
400468: 83 fb 13 cmp $0x13,%ebx
40046b: 7e e3 jle 400450 <main+0x10>
40046d: 48 83 c4 10 add $0x10,%rsp
400471: 31 c0 xor %eax,%eax
400473: 5b pop %rbx
400474: c3 retq
400475: 90 nop
now the add operand is being read from the stack, as the compilers is led to suspect someone might change it. It's still caches, and as a normal writeback-typed memory it would catch any attempt to modify it from another thread or DMA, and the memory system would provide the new value (most likely the cache line would be snooped and invalidated, forcing the CPU to fetch the new value from whichever core owns it now). However, as I said, if x should not have been a normal cacheable memory address, but rather ment to be some MMIO or something else that might change silently beneath the memory system - then the cached value would be wrong (that's why MMIO shouldn't be cached), and the compiler would never know that even though it's considered volatile.
By the way - using volatile int x and adding it directly would produce the same result. Then again - making x or px global variables would also do that, the reason being - the compiler would suspect that someone might have access to it, and therefore would take the same precautions as with an explicit volatile hint. Interestingly enuogh, the same goes for making x local, but copying its address into a global pointer (but still using x directly in the main loop). The compiler is quite cautious.
That is not to say it's 100% full proof, you could in theory keep x local, have the compiler do the optimizations, and then "guess" the address somewhere from the outside (another thread for e.g.). This is when volatile does come in handy.
volatile variable, how can we not optimize the memory access for those variables.
Yes, Volatile on variable tells the compiler that the variable can be read or write in such a way that programmer can foresee what could happen to this variable out of programs scope and cannot seen by the compiler. This means that compiler cannot perform optimizations on the variable which will alter the intended functionality, caching its value in a register to avoid memory access using the register copy during each iteration.
`entire cache being flushed when the access for such variable is executed?`
No. Ideally compiler access variable from the variable's storage location which doesn't flush the existing cache entries between CPU and memory.
Or there is some register setting that caching should be omitted for a memory region?
Apparently when the register is in un-chached memory space, accessing that memory variable will give you the up-to-date value than from cache memory. Again this should be architecture dependent.

Machine Code Jump Destination Calculation

Ok, so I need to hook a program, but to do this I am going to copy the instructions E8 <Pointer to Byte Array that contains other code>. The problem with this is, that when I assemble Call 0x100 I get E8 FD, We know the E8 is the call instruction, so FD must be the destination, so how does the assembler take the destination from 0x100 into FD? Thanks, Bradley - Imcept
There is plethora of jump/call opcodes and some of them are relative. I'd say you in fact got not E8 FD but E8 FD FF. E8 seems to be "call 16-bit relative" and 0x100 is the place where instructions are placed by default.
So you put call 0x100 at address 0x100, and the generated code is "do the jump instruction, and jump -3 from the actual instruction pointer". -3 is because the shift is computed from the position after the instruction is read, which in case of E8 FD FF is 0x103. That is why the shift if FD FF, big-endian for 0xfffd, which is 16-bit -3.
http://wwwcsif.cs.ucdavis.edu/~davis/50/8086 Opcodes.htm
E8 is a 16 bit relative call. So for instance E8 00 10 means call the address at the PC+0x1000.

Absolute addressing for runtime code replacement in x86_64

I'm currently using some code replace scheme in 32 bit where the code which is moved to another position, reads variables and a class pointer. Since x86_64 does not support absolute addressing I have trouble getting the correct addresses for the variables at the new position of the code. The problem in detail is, that because of rip relative addressing the instruction pointer address is different than at compile time.
So is there a way to use absolute addressing in x86_64 or another way to get addresses of variables not instruction pointer relative?
Something like: leaq variable(%%rax), %%rbx would also help. I only want to have no dependency on the instruction pointer.
Try using the large code model for x86_64. In gcc this can be selected with -mcmodel=large. The compiler will use 64 bit absolute addressing for both code and data.
You could also add -fno-pic to disallow the generation of position independent code.
Edit: I built a small test app with -mcmodel=large and the resulting binary contains sequences like
400b81: 48 b9 f0 30 60 00 00 movabs $0x6030f0,%rcx
400b88: 00 00 00
400b8b: 49 b9 d0 09 40 00 00 movabs $0x4009d0,%r9
400b92: 00 00 00
400b95: 48 8b 39 mov (%rcx),%rdi
400b98: 41 ff d1 callq *%r9
which is a load of an absolute 64 bit immediate (in this case an address) followed by an indirect call or an indirect load. The instruction sequence
moveabs $variable, %rbx
addq %rax, %rbx
is the equivalent to a "leaq offset64bit(%rax), %rbx" (which doesn't exist), with some side effects like flag changing etc.
What you're asking about is doable, but not very easy.
One way to do it is compensate for the code move in its instructions. You need to find all the instructions that use the RIP-relative addressing (they have the ModRM byte of 05h, 0dh, 15h, 1dh, 25h, 2dh, 35h or 3dh) and adjust their disp32 field by the amount of move (the move is therefore limited to +/- 2GB in the virtual address space, which may not be guaranteed given the 64-bit address space is bigger than 4GB).
You can also replace those instructions with their equivalents, most likely replacing every original instruction with more than one, for example:
; These replace the original instruction and occupy exactly as many bytes as the original instruction:
JMP Equivalent1
NOP
NOP
Equivalent1End:
; This is the code equivalent to the original instruction:
Equivalent1:
Equivalent subinstruction 1
Equivalent subinstruction 2
...
JMP Equivalent1End
Both methods will require at least some rudimentary x86 disassembly routines.
The former may require the use of VirtualAlloc() on Windows (or some equivalent on Linux) to ensure the memory that contains the patched copy of the original code is within +/- 2GB of that original code. And allocation at specific addresses can still fail.
The latter will require more than just primitive disassemblying, but also full instruction decoding and generation.
There may be other quirks to work around.
Instruction boundaries may also be found by setting the TF flag in the RFLAGS register to make the CPU generate the single-step debug interrupt at the end of execution of every instruction. A debug exception handler will need to catch those and record the value of RIP of the next instruction. I believe this can be done using Structured Exception Handling (SEH) in Windows (never tried with the debug interrupts), not sure about Linux. For this to work you'll have to make all of the code execute, every instruction.
Btw, there's absolute addressing in 64-bit mode, see, for example the MOV to/from accumulator instructions with opcodes from 0A0h through 0A3h.

Resources