Bytes at the end of functions in objdump - gcc

When I disassembly a binary (compiled with g++) with objdump, I often see "random" bytes at the end of the contained functions, such as:
4005a5: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1)
4005ac: 00 00 00 00
What are those bytes? Why the compiler put them there?
EDIT:
apparently those bytes represent a long NOP instruction put there by the compiler to keep functions 16-byte aligned. The weird thing is that the only function which is not 16-byte aligned is the main function. Are there any reasons?

Related

Does a linker generate absolute virtual addresses when linking

Assume a simple hello world in C, compiled using gcc -c to an object file and disassembled using objdump will looks like this:
_main:
0: 55 pushq %rbp
1: 48 89 e5 movq %rsp, %rbp
4: c7 45 fc 00 00 00 00 movl $0, -4(%rbp)
b: c7 45 f8 05 00 00 00 movl $5, -8(%rbp)
12: 8b 05 00 00 00 00 movl (%rip), %eax
As you can see the memory addresses are 0, 1, 4, .. and so on. They are not actual addresses.
Linking the object file and disassembling it looks like this:
_main:
100000f90: 55 pushq %rbp
100000f91: 48 89 e5 movq %rsp, %rbp
100000f94: c7 45 fc 00 00 00 00 movl $0, -4(%rbp)
100000f9b: c7 45 f8 05 00 00 00 movl $5, -8(%rbp)
100000fa2: 8b 05 58 00 00 00 movl 88(%rip), %eax
My question is, is 100000f90 an actual address of a byte of virtual memory or is it an offset?
How can the linker give an actual address prior to execution? What if that memory address isn't available when executing? What if I execute it on another machine with much less memory (maybe paging kicks in here).
Is't it the job of the loader to assign actual addresses?
Is the linker generating actual addresses for he final executable file?
(The following answers assume that the linker is not creating a position-independent executable.)
My question is, is 100000f90 an actual address of a byte of virtual memory or is it an offset?
It's the actual virtual address. Strictly speaking, it is the offset from the base of the code segment, but since modern operating systems always set the base of the code segment to 0, it is effectively the actual virtual address.
How can the linker give an actual address prior to execution? What if that memory address isn't available when executing? What if I execute it on another machine with much less memory (maybe paging kicks in here).
Each process gets its own separate virtual address space. Because it is virtual memory, the amount of physical memory in the machine doesn't matter. Paging is the process by which virtual addresses get mapped to physical address.
Isn't it the job of the loader to assign actual addresses?
Yes, when creating a process, the operating system loader allocates physical page frames for the process and maps the pages into the process's virtual address space. But the virtual addresses are those assigned by the linker.
Does a linker generate absolute virtual addresses when linking
It depends upon the linker setting and the input source. For general programming, linkers usually strive to create position independent code.
My question is, is 100000f90 an actual address of a byte of virtual memory or is it an offset?
It is most likely an offset.
How can the linker give an actual address prior to execution?
Think about the loader for an operating system. It expects things to be in specific address locations. Any decent linker will allow the programmer to specify absolute addresses some way.
What if that memory address isn't available when executing? What if I execute it on another machine with much less memory (maybe paging kicks in here).
That's the problem with position-dependent code.
Is't it the job of the loader to assign actual addresses?
The job of the loader is to follow the instructions given to it in the executable file. In creating the executable, the linker can specify addresses or defer to the loader in some cases.

What does "nop dword ptr [rax+rax]" x64 assembly instruction do?

I'm trying to understand the x64 assembly optimization that is done by the compiler.
I compiled a small C++ project as Release build with Visual Studio 2008 SP1 IDE on Windows 8.1.
And one of the lines contained the following assembly code:
B8 31 00 00 00 mov eax,31h
0F 1F 44 00 00 nop dword ptr [rax+rax]
And here's a screenshot:
As far as I know nop by itself is do nothing, but I've never seen it with an operand like that.
Can someone explain what does it do?
In a comment elsewhere on this page, Michael Petch points to a web page which describes the Intel x86 multi-byte NOP opcodes. The page has a table of useful information, but unfortunately the HTML is messed up so you can't read it. Here is some information from that page, plus that table presented a readable form:
Multi-Byte NOPhttp://www.felixcloutier.com/x86/NOP.html
The one-byte NOP instruction is an alias mnemonic for the XCHG (E)AX, (E)AX instruction.
The multi-byte NOP instruction performs no operation on supported processors and generates undefined opcode exception on processors that do not support the multi-byte NOP instruction.
The memory operand form of the instruction allows software to create a byte sequence of “no operation” as one instruction.
For situations where multiple-byte NOPs are needed, the recommended operations (32-bit mode and 64-bit mode) are: [my edit: in 64-bit mode, write rax instead of eax.]
Length Assembly Byte Sequence
------- ------------------------------------------ --------------------------
1 byte nop 90
2 bytes 66 nop 66 90
3 bytes nop dword ptr [eax] 0F 1F 00
4 bytes nop dword ptr [eax + 00h] 0F 1F 40 00
5 bytes nop dword ptr [eax + eax*1 + 00h] 0F 1F 44 00 00
6 bytes 66 nop word ptr [eax + eax*1 + 00h] 66 0F 1F 44 00 00
7 bytes nop dword ptr [eax + 00000000h] 0F 1F 80 00 00 00 00
8 bytes nop dword ptr [eax + eax*1 + 00000000h] 0F 1F 84 00 00 00 00 00
9 bytes 66 nop word ptr [eax + eax*1 + 00000000h] 66 0F 1F 84 00 00 00 00 00
Note that the technique for selecting the right byte sequence--and thus the desired total size--may differ according to which assembler you are using.
For example, the following two lines of assembly taken from the table are ostensibly similar:
nop dword ptr [eax + 00h]
nop dword ptr [eax + 00000000h]
These differ only in the number of leading zeros, and some assemblers may make it hard to disable their "helpful" feature of always encoding the shortest possible byte sequence, which could make the second expression inaccessible.
For the multi-byte NOP situation, you don't want this "help" because you need to make sure that you actually get the desired number of bytes. So the issue is how to specify an exact combination of mod and r/m bits that ends up with the desired disp size--but via instruction mnemonics alone. This topic is complex, and certainly beyond the scope of my knowledge, but Scaled Indexing, MOD+R/M and SIB might be a starting place.
Now as I know you were just thinking, if you find it difficult or impossible to coerce your assembler's cooperation via instruction mnemonics you can always just resort to db ("define bytes") as a simple no-fuss alternative which is, um, guaranteed to work.
As pointed out in the comments, it is a multi-byte NOP usually used to align the subsequent instruction to a 16-byte boundary, when that instruction is the first instruction in a loop.
Such alignment can help with instruction fetch bandwidth, because instruction fetch often happens in units of 16 bytes, so aligning the top of a loop gives the greatest chance that the decoding occurs without bottlenecks.
The importance of such alignment is arguably less important than it once was, with the introduction of the loop buffer and the uop cache which are less sensitive to alignment. In some cases this optimization may even be a pessimization, especially when the loop executes very few times.
This code alignment is done when there are used jump instructions that perform jumps from bigger addresses to lower (0EBh XX - jmp short) and (0E9h XX XX XX XX - jmp near), where XX in both cases is a signed negative number. So, the compiler is aligning that chunk of code where the jump needs to be performed to 10h bytes boundary. This will give an optimization and code execution speedup.

Does FTRACE invalidate the CPU instruction cache after it has modify the code instructions in memory?

As is well known, the kernel uses "mcount" as a placeholder to redirect CPU instruction execution during FTRACE operation. Eg:
c1003000 <run_init_process>:
c1003000: 55 push %ebp
c1003001: 89 e5 mov %esp,%ebp
c1003003: 83 ec 04 sub $0x4,%esp
c1003006: e8 21 e2 5c 00 call c15d122c <mcount>
c100300b: b9 80 4f 83 c1 mov $0xc1834f80,%ecx
c1003010: 64 8b 15 90 cf 95 c1 mov %fs:0xc195cf90,%edx
c1003017: a3 20 50 83 c1 mov %eax,0xc1835020
From above, the instruction "call mcount" will be dynamically replace with some other instruction during FTRACE operation.
Question is how safe is the instruction replacement in the kernel memory - given that the CPU always preload certain number of instructions into its cache before execution. And it may happen that after loading the instruction, the FTRACE operation replaces the instruction in memory. But the CPU will still be executing the cached version, right? Or does FTRACE trigger a CPU instruction/data cache invalidation immediately after modifying the memory content? (Please provide kernel source code reference?)
Thanks.
PS: Reference: http://people.redhat.com/srostedt/ftrace-tutorial.odp (slide 36 and 37 showed the instructions operation in memory when FTRACE is enabled on the function)
As briefly mentioned here:
http://lwn.net/Articles/556186/
FTRACE is using "stop_machine" architecture, and in this mode, when the CPU is modifying the memory of the tasks code area, all tasks are far and away from its execution activity, and thus the CPU cache is unlikely to store the code to be executed, thus it is fine to modify the code in memory.

Machine Code Jump Destination Calculation

Ok, so I need to hook a program, but to do this I am going to copy the instructions E8 <Pointer to Byte Array that contains other code>. The problem with this is, that when I assemble Call 0x100 I get E8 FD, We know the E8 is the call instruction, so FD must be the destination, so how does the assembler take the destination from 0x100 into FD? Thanks, Bradley - Imcept
There is plethora of jump/call opcodes and some of them are relative. I'd say you in fact got not E8 FD but E8 FD FF. E8 seems to be "call 16-bit relative" and 0x100 is the place where instructions are placed by default.
So you put call 0x100 at address 0x100, and the generated code is "do the jump instruction, and jump -3 from the actual instruction pointer". -3 is because the shift is computed from the position after the instruction is read, which in case of E8 FD FF is 0x103. That is why the shift if FD FF, big-endian for 0xfffd, which is 16-bit -3.
http://wwwcsif.cs.ucdavis.edu/~davis/50/8086 Opcodes.htm
E8 is a 16 bit relative call. So for instance E8 00 10 means call the address at the PC+0x1000.

What does the beginning of process memory mean

I am trying to learn more about how to read process memory. So I opened the "entire memory" of the Firefox process in WinHex and saw the following hex values starting at offset 10000.
00 00 00 00 00 00 00 00 EC 6B 3F 80 0C 6D 00 01 EE FF EE FF 01 00 00 00
My question is
Is it possible for a human to interpret this without further knowledge? Are these pointers or values? Is there anything, which is common for different programs created with different compilers with regards to the process memory apart from things like endianness? Why does it start with lots of zeroes, isn't that a very odd way to start using space?
Obviously, you can't do anything "without further knowledge". But we already know a whole lot from the fact that it's Windows. For starters, we know that the executable gets its own view of memory, and in that virtual view the executable is loaded at its preferred starting address (as stated in the PE header of the EXE).
The start at 0x00010000 is a compatibility thing with MS-DOS (yes, that 16 bit OS) - the first 64KB are reserved and are never valid addresses. The pages up to 0x00400000 (4MB) are reserved for the OS, and in general differ between OS versions.
A common data structure in that range is the Process Environment Block. With the WinDBG tool, and the Microsoft Symbol Server, you can figure whether the Process Envirionment Block is indeed located at offset 0x10000, and what its contents mean.

Resources