Fully understanding how .exe file is executed - windows
Goal
I want to understand how executables work. I hope that understanding one very specific example in full detail will enable me to do so. My final (perhaps too ambitious) goal is to take a hello-world .exe file (compiled with a C compiler and linked) and understand in full detail how it is loaded into memory and executed by a x86 processor. If I succeed in doing so, I want to write an article and/or make a video about it, since I have not found something like this on the internet.
Specific questions I want to ask are marked in bold. Of course any further suggestions and sources doing something similar are very welcome. Thanks a lot in advance for any help!
What I need
This Answer gives an overview of the process that C code goes through until it gets into physical memory as a programm. I'm not sure yet how much I want to look into how the C code is compiled. Is there a way to view the Assembly code a C compiler generates before assembling it? I may decide it's worth the effort to understand the processes of loading and linking. In the meantime the most important parts I need to understand are
the PA executable file format
the relation between assembler code and x86 byte-code
the process of loading (i.e. how the process RAM is prepared for execution using information from the executable file).
I have a very basic understanding of the PA format (this understanding will be outlined in the section "What I have learned so far") and I think the sources given there should be sufficient, I just need to look into it some more until I know enough to understand a basic Hello-World programm. Further sources on this topic are of course very welcome.
The translation of byte-code into assembler code (disassembly) seems to be quite difficult for x86. Nonetheless, I would love to learn more about it. How would you go about disassembling a short byte code segment?
I'm still looking for a way to view the contents of a process' memory (the virtual memory assigned to it). I've already looked into windows-kernel32.dll functions such as ReadProcessMemory but couldn't get it to work yet. Also it's strange to me that there don't seem to be (free) tools available for this. Together with an understanding of loading, I may then be able to understand how a process is run from RAM. Also I'm looking for debugging tools for assembly programmers that allow to view the entire process virtual memory conents. My current starting point of this search is this question. Do you have further advice on how I can see and understand loading and process execution from RAM?
What I have learned so far
The rest of this StackOverflow question describes what I have learned so far in some detail and giving various sources. It is meant to be reproducible and help anyone trying to understand this. However, I still do have some questions about the example I looked at so far.
PA format
In Windows, an executable file follows the PA format. The official documentation and this article give a good overview of the format. The format describes what the individual bytes in an .exe file mean. The beginning is a DOS programm (included for legacy reasons) that I will not worry about. Then comes a bunch of headers, which give information about the executable. The actual file contents are split into sections that have names, such as '.rdata'. After the file headers, there are also section headers, which tell you which parts of the file are which section and what each section does (e.g. if it contains executable code).
The headers and sections can be parsed using tools such as dumpbin (microsoft tool to look at binary files). For comparison with dumpbin output, the hex code of a file can be viewed directly with a Hex editor or even using the Powershell (command Format-Hex -Path <Path to file>).
Specific example
I performed these steps for a very simple programm, which does nothing. This is the code:
; NASM assembler programm. Does nothing. Stores string in code section.
; Adapted from stackoverflow.com/a/1029093/9988487
global _main
section .text
_main:
hlt
db 'Hello, World'
I assembled it with NASM (command nasm -fwin32 filename.asm) and linked it with the linker that comes with VS2019 (link /subsystem:console /nodefaultlib /entry:main test.obj). This is adapted from this answer, which demonstrates how to make a hello-world programm for Windows using WinAPI call. The programm runs on Windows 10 and terminates with no output. It takes about 2 sec to run, which seems very long and makes me think there may be some error somehwere?
I then looked at the dumpbin output:
D:\ASM>dumpbin test.exe /ALL
Microsoft (R) COFF/PE Dumper Version 14.22.27905.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file test.exe
PE signature found
File Type: EXECUTABLE IMAGE
FILE HEADER VALUES
14C machine (x86)
2 number of sections
5E96C000 time date stamp Wed Apr 15 10:04:16 2020
0 file pointer to symbol table
0 number of symbols
E0 size of optional header
102 characteristics
Executable
32 bit word machine
OPTIONAL HEADER VALUES
10B magic # (PE32)
14.22 linker version
200 size of code
200 size of initialized data
0 size of uninitialized data
1000 entry point (00401000)
1000 base of code
2000 base of data
400000 image base (00400000 to 00402FFF)
1000 section alignment
200 file alignment
<further header values omitted ...>
SECTION HEADER #1
.text name
E virtual size
1000 virtual address (00401000 to 0040100D)
200 size of raw data
200 file pointer to raw data (00000200 to 000003FF)
0 file pointer to relocation table
0 file pointer to line numbers
0 number of relocations
0 number of line numbers
60000020 flags
Code
Execute Read
RAW DATA #1
00401000: F4 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 0A ôHello, World.
SECTION HEADER #2
.rdata name
58 virtual size
2000 virtual address (00402000 to 00402057)
200 size of raw data
400 file pointer to raw data (00000400 to 000005FF)
0 file pointer to relocation table
0 file pointer to line numbers
0 number of relocations
0 number of line numbers
40000040 flags
Initialized Data
Read Only
RAW DATA #2
00402000: 00 00 00 00 00 C0 96 5E 00 00 00 00 0D 00 00 00 .....À.^........
00402010: 3C 00 00 00 1C 20 00 00 1C 04 00 00 00 00 00 00 <.... ..........
00402020: 00 10 00 00 0E 00 00 00 2E 74 65 78 74 00 00 00 .........text...
00402030: 00 20 00 00 1C 00 00 00 2E 72 64 61 74 61 00 00 . .......rdata..
00402040: 1C 20 00 00 3C 00 00 00 2E 72 64 61 74 61 24 7A . ..<....rdata$z
00402050: 7A 7A 64 62 67 00 00 00 zzdbg...
Debug Directories
Time Type Size RVA Pointer
-------- ------- -------- -------- --------
5E96C000 coffgrp 3C 0000201C 41C
Summary
1000 .rdata
1000 .text
The file header field "characteristics" is a combination of flags. In particular 102h = 1 0000 0010b and the two set flags (according to the PE format doc) are IMAGE_FILE_EXECUTABLE_IMAGE and IMAGE_FILE_BYTES_REVERSED_HI. The latter has description
IMAGE_FILE_BYTES_REVERSED_HI:
Big endian: the MSB precedes the LSB in memory. This flag is deprecated and should be zero.
I ask myself: Why does a modern assembler and a modern linker produce a deprecated flag?
There are 2 sections in the file. The section .text was defined in the assembler code (and is the only one containing executable code, as specified in its header). I don't know what the second section '.rdata' (name seems to refer to "readable data") is or does here. Why was it created? How could I find out?
Disassembly
I used dumpbin to diassemble the .exe file (command dumpbin test.exe /DISASM). It gets the hlt correct, the 'Hello, World.' string is (perhaps unfortunately) interpreted as executable commands. I guess the disassembler can hardly be blamed for this. However, if I understand correctly (I have no practical experience in assembly programming), putting data into a code section is not unheard of (it was done in several examples that I found while looking into assembly programming). Is there a better way to disassemle this, that would be able to reproduce my assembly code better? Also, do compilers sometimes put data into code sections in this way?
In some respects this is a massively broad question that may not survive for that reason. The information is all out there on the internet, keep looking, it is not complicated, and not worthy of a paper or video.
So you have a rough idea that a compiler takes a program written in one language and converts it to another language be that assembly language or machine code or whatever.
Then there are file formats and there are many different ones that we all use the term "binary" for but again, different formats. Ideally they contain, using some form of encoding, the machine code and data or information about the data.
Going to use ARM for now, fixed length instructions easy to disassemble and read, etc.
#define ONE 1
unsigned int x;
unsigned int y = 5;
const unsigned int z = 7;
unsigned int fun ( unsigned int a )
{
return(a+ONE);
}
and gnu gcc/binutils because it is very well know, widely used, you can use it to make programs on your wintel machine. I run Linux so you will see elf not exe, but that is just a file format for what you are asking.
arm-none-eabi-gcc -O2 -c so.c -save-temps -o so.o
This toolchain (chain of tools that are linked for example compiler -> assembler -> linker) is Unix style and modular. You are going to have an assembler for a target so not sure why you would want to re-invent that, and it is so much easier to debug a compiler by looking at the assembly output than trying to go straight to machine code. But there are folks that like to climb the mountain just because it is there rather than go around and some tools go straight for machine code just because its there.
This specific compiler has this save temps feature, gcc itself is a front end program that preps for the real compiler then if asked for (if you don't say not to) will call the assembler and linker.
cat so.i
# 1 "so.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "so.c"
unsigned int x;
unsigned int y = 5;
const unsigned int z = 7;
unsigned int fun ( unsigned int a )
{
return(a+1);
}
So at this point defines and includes are taken care of and its one big file to be sent to the compiler.
The compiler does its thing and turns it onto assembly language
cat so.s
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "so.c"
.text
.align 2
.global fun
.arch armv4t
.syntax unified
.arm
.fpu softvfp
.type fun, %function
fun:
# Function supports interworking.
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
add r0, r0, #1
bx lr
.size fun, .-fun
.global z
.global y
.comm x,4,4
.section .rodata
.align 2
.type z, %object
.size z, 4
z:
.word 7
.data
.align 2
.type y, %object
.size y, 4
y:
.word 5
.ident "GCC: (GNU) 9.3.0"
which then gets put into an object file, in this case, binutils, linux default, etc
file so.o
so.o: ELF 32-bit LSB relocatable, ARM, EABI5 version 1 (SYSV), not stripped
It is using an elf file format which is easy to find info on, easy to write programs to parse, etc.
I can disassemble this, note that because I am using the disassembler it tries to disassemble everything even if it isn't machine code, sticking to 32 bit arm stuff It can grind through that and when there are real instructions they are shown (aligned and not variable length as used here, so you can disassemble linearly which you cannot with a variable length instruction set and have a hope of success (like x86) you need to disassemble in execution order and then you often miss some due to the nature of the program)
arm-none-eabi-objdump -D so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <fun>:
0: e2800001 add r0, r0, #1
4: e12fff1e bx lr
Disassembly of section .data:
00000000 <y>:
0: 00000005 andeq r0, r0, r5
Disassembly of section .rodata:
00000000 <z>:
0: 00000007 andeq r0, r0, r7
Disassembly of section .comment:
00000000 <.comment>:
0: 43434700 movtmi r4, #14080 ; 0x3700
4: 4728203a ; <UNDEFINED> instruction: 0x4728203a
8: 2029554e eorcs r5, r9, lr, asr #10
c: 2e332e39 mrccs 14, 1, r2, cr3, cr9, {1}
10: Address 0x0000000000000010 is out of bounds.
Disassembly of section .ARM.attributes:
00000000 <.ARM.attributes>:
0: 00002941 andeq r2, r0, r1, asr #18
4: 61656100 cmnvs r5, r0, lsl #2
8: 01006962 tsteq r0, r2, ror #18
c: 0000001f andeq r0, r0, pc, lsl r0
10: 00543405 subseq r3, r4, r5, lsl #8
14: 01080206 tsteq r8, r6, lsl #4
18: 04120109 ldreq r0, [r2], #-265 ; 0xfffffef7
1c: 01150114 tsteq r5, r4, lsl r1
20: 01180317 tsteq r8, r7, lsl r3
24: 011a0119 tsteq r10, r9, lsl r1
28: Address 0x0000000000000028 is out of bounds.
and yes the tool put extra stuff in there, but note primarily that I created. some code, some initialized read/write data, some initialized read/write data and some initialized read only data. The toolchain authors can use whatever names they want, they don't even have to use the term section. But from decades of history and communication and terminology .text is generally used for code (as in read only machine code AND related data), .bss for zeroed read/write data although I have seen other names, .data for initialized read/write data and this generation of this tool .rodata for read only initialized data (technically that could land in .text)
And note that they all have an address of zero. they are not linked yet.
Now this is ugly but to avoid adding any more code and if the tool lets me do it, let's link it to make a completely unusable binary (no bootstrap, etc, etc):
arm-none-eabi-ld -Ttext=0x1000 -Tdata=0x2000 so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
arm-none-eabi-objdump -D so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00001000 <fun>:
1000: e2800001 add r0, r0, #1
1004: e12fff1e bx lr
Disassembly of section .data:
00002000 <y>:
2000: 00000005 andeq r0, r0, r5
Disassembly of section .rodata:
00001008 <z>:
1008: 00000007 andeq r0, r0, r7
Disassembly of section .bss:
00002004 <x>:
2004: 00000000 andeq r0, r0, r0
And now it is linked. The read only items .text and .rodata landed in the .text address space in the order found in the file. The read/write items landed in the .data address space in the order found in the file.
Yes, where was .bss in the object? It is in there, it has no actual data as in bytes that are part of the object, instead it has a name and size and that it is .bss. And for whatever reason the tool does show it from the linked binary.
So back on the term binary. The so.elf binary has the bytes that go in memory that make up the program, but also file format infrastructure plus a symbol table to make the disassembly and debugging easier plus other stuff. Elf is a flexible file format gnu can use it and you get one result some other tool or version of a tool can use it and have a different file. And obviously two compilers can generate different machine code from the same source program not just due to optimizations, the job is to make a functional program in the target language and functional is the opinion of the compiler/tool author.
What about a memory image type file:
arm-none-eabi-objcopy so.elf so.bin -O binary
hexdump -C so.bin
00000000 01 00 80 e2 1e ff 2f e1 07 00 00 00 00 00 00 00 |....../.........|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 05 00 00 00 |....|
00001004
Now how the objcopy tool works is that it starts with the first defined loadable or whatever term you want to use byte and ends with the last one and uses (zero) padding to make the file size match so that the memory image matches from an address perspective. The asterisk means essentially 0 padding. Because we started at 0x1000 with .text and 0x2000 for .data but the first byte of this file (offset 0) is the beginning of .text and 0x1000 byte later which is offset 0x1000 in the file but we know it goes to 0x2000 in memory is the read/write stuff. Also note that the bss zeros are not in the output. The bootstrap is expected to zero those.
There is no information like where in memory this data from this file goes, etc. And if you think a bit about it what if I have one byte at a section I define goes to 0x00000000 and one byte at a section I define goes to 0x80000000 and output this file, yes that is a 0x80000001 byte file even though there are only two useful bytes of relevant information. A 2GB file to hold two bytes. This is why you don't want to output this file format until you have sorted out your linker script and tools.
Same data and two other equally old school formats with a little history of intel vs motorola
arm-none-eabi-objcopy so.elf so.hex -O ihex
cat so.hex
:08100000010080E21EFF2FE158
:0410080007000000DD
:0420000005000000D7
:0400000300001000E9
:00000001FF
arm-none-eabi-objcopy so.elf so.srec -O srec
cat so.srec
S00A0000736F2E7372656338
S10B1000010080E21EFF2FE154
S107100807000000D9
S107200005000000D3
S9031000EC
now these contain the relevant bytes, plus addresses, but not much other information, takes more than two bytes for every byte of data, but compared to a huge file with padding, a worthy trade-off. Both of these formats can be found in use today, not as much as the old days but still there.
And countless other binary file formats and a tool like objdump has a decent list of formats it can generate as well as other linkers and/or tools out there.
What is relevant about all of this is that there is a binary file format of some form that contains the bytes we need to run the program.
What format and what addresses you might ask...That is part of the operating system or the system design. In the case of Windows there are specific file formats and variations perhaps of those formats that are supported by the windows operating system, the specific version you are using. Windows has determined what the address space looks like. Operating systems like this take advantage of the MMU both for virtualizing addresses and protection. Having a virtual address space means every program can live in the same space. All programs can have an address that is zero based for example....
test.c
int main ( void )
{
return 1;
}
hello.c
int main ( void )
{
return 2;
}
gcc test.c -o test
objdump -D test
Disassembly of section .text:
00000000004003e0 <_start>:
4003e0: 31 ed xor %ebp,%ebp
4003e2: 49 89 d1 mov %rdx,%r9
4003e5: 5e pop %rsi
...
gcc hello.c -o hello
objdump -D hello
Disassembly of section .text:
00000000004003e0 <_start>:
4003e0: 31 ed xor %ebp,%ebp
4003e2: 49 89 d1 mov %rdx,%r9
same address, how is that possible won't they sit on top of each other? no virtual machine. And note this is built for a specific Linux on a specific day, etc. The toolchain has a default linker script (notice I didn't specify how to link) for this platform when the compiler was built for this target/platform.
arm-none-eabi-gcc -O2 test.c -c -o test.o
arm-none-eabi-ld test.o -o test.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000
arm-none-eabi-objdump -D test.elf
test.elf: file format elf32-littlearm
Disassembly of section .text:
00008000 <main>:
8000: e3a00001 mov r0, #1
8004: e12fff1e bx lr
same source code, same compiler, built for a different target and system different address.
So for Windows there are definitely going to be rules for the supported binary formats and rules for address spaces that can be used, how to define those spaces in the file.
Then it is a simple matter of the operating systems launcher to read the binary file and put the loadable items into memory at those addresses (in the virtual space that the os has created for this specific program) It is very possible that a feature of the loader is to zero bss for you since the information is there. The low level programmer needs to know that to possibly deal with zeroing .bss or not.
If not you will see and may need to create a solution, unfortunately this is where you get deeper into tool specific items. While C may be somewhat standardized there are tool specific things that are not or at least are standardized by the tool/authors but no reason to assume those cross over to other tools.
.globl _start
_start:
ldr sp,sp_init
bl fun
b .
.word __bss_start__
.word __bss_end__
sp_init:
.word 0x8000
Everything about assembly language is tool specific, the mnemonics for sanity reasons no doubt will resemble the ip/processor vendors documentation which uses syntax that the tool they paid to have developed uses. But beyond that assembly language is wholly defined by the tool not the target, x86 because of its age and other things is really bad about that and this is not the Intel vs AT&T thing, just in general. Gnu assembler is well known for I would assume perhaps intentionally not making compatible languages with other assembly languages. The above is gnu assembler for arm.
Using the fun() function above, C says it should be main() but the tool doesn't care I am already typing enough here.
add a simple ram based linker script
MEMORY
{
ram : ORIGIN = 0x1000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ram
.rodata : { *(.rodata*) } > ram
.bss : {
__bss_start__ = .;
*(.bss*)
} > ram
__bss_end__ = .;
}
build it all
arm-none-eabi-as start.s -o start.o
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-ld -T sram.ld start.o so.o -o so.elf
examine
arm-none-eabi-nm so.elf
0000102c B __bss_end__
00001028 B __bss_start__
00001018 T fun
00001014 t sp_init
00001000 T _start
00001028 B x
00001024 D y
00001020 R z
arm-none-eabi-objdump -D so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00001000 <_start>:
1000: e59fd00c ldr sp, [pc, #12] ; 1014 <sp_init>
1004: eb000003 bl 1018 <fun>
1008: eafffffe b 1008 <_start+0x8>
100c: 00001028 andeq r1, r0, r8, lsr #32
1010: 0000102c andeq r1, r0, r12, lsr #32
00001014 <sp_init>:
1014: 00008000 andeq r8, r0, r0
00001018 <fun>:
1018: e2800001 add r0, r0, #1
101c: e12fff1e bx lr
Disassembly of section .rodata:
00001020 <z>:
1020: 00000007 andeq r0, r0, r7
Disassembly of section .data:
00001024 <y>:
1024: 00000005 andeq r0, r0, r5
Disassembly of section .bss:
00001028 <x>:
1028: 00000000 andeq r0, r0, r0
So now it is possible to add to the bootstrap a memory zeroing loop (do not use C/memset you don't create chicken and egg problems you write the bootstrap in asm) based on the start and end addresses.
Fortunately or unfortunately because the linker script is tool specific and assembly language is tool specific and they need to work together if you are letting the tools do the work for you (the sane way to do it, have fun figuring out where .bss is otherwise).
This can be done on an operating system but when you get into say microcontrollers where it all has to be on non-volatile storage (flash) well it is possible to have one that is downloaded from elsewhere (like your mouse firmware sometimes, sometimes keyboard, etc) into ram, assume flash, so how do you deal with .data??
MEMORY
{
rom : ORIGIN = 0x0000, LENGTH = 0x1000
ram : ORIGIN = 0x1000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.data : {
*(.data*)
} > ram AT > rom
.bss : {
__bss_start__ = .;
*(.bss*)
} > ram
__bss_end__ = .;
}
With gnu ld this basically says that .data's home is in ram, but the output binary formats will put it in flash/rom
so.elf so.srec -O srec
cat so.srec
S00A0000736F2E7372656338
S11300000CD09FE5030000EBFEFFFFEA04100000A4
S11300100810000000800000010080E21EFF2FE1B4
S107002007000000D1 <- z variable at address 0020
S107002405000000CF <- y variable at 0024
S9030000FC
and you have to play with the linker script more to get the tool to tell you both the ram and flash starting addresses and ending addresses or length. then add code in the bootstrap (asm not C) to copy .data from flash to ram.
Also note here per another one of your many questions.
.word __bss_start__
.word __bss_end__
sp_init:
.word 0x8000
These items are technically data. but they live in .text first and foremost because they were defined in the code that was assumed to be .text (I didn't need to state that in the asm, but could have). you will see this in x86 as well, but for fixed length like arm, mips, risc-v, etc where you cant put any old immediate/constant/linked value you want in the instruction itself you put it nearby in a "pool" and do a pc relative read to get it. You will see this for linking externals too:
extern unsigned int x;
int main ( void )
{
return x;
}
arm-none-eabi-gcc -O2 -c test.c -o test.o
arm-none-eabi-objdump -D test.o
test.o: file format elf32-littlearm
Disassembly of section .text.startup:
00000000 <main>:
0: e59f3004 ldr r3, [pc, #4] ; c <main+0xc>
4: e5930000 ldr r0, [r3]
8: e12fff1e bx lr
c: 00000000 andeq r0, r0, r0 <--- the code gets the address of the
variable from here and then reads it from memory
once linked
Disassembly of section .text:
00008000 <main>:
8000: e59f3004 ldr r3, [pc, #4] ; 800c <main+0xc>
8004: e5930000 ldr r0, [r3]
8008: e12fff1e bx lr
800c: 00018010 andeq r8, r1, r0, lsl r0
Disassembly of section .data:
00018010 <x>:
18010: 00000005 andeq r0, r0, r5
for x86
gcc -c -O2 test.c -o test.o
dwelch-desktop so # objdump -D test.o
test.o: file format elf64-x86-64
Disassembly of section .text.startup:
0000000000000000 <main>:
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 6 <main+0x6>
6: c3 retq
00000000004003e0 <main>:
4003e0: 8b 05 4a 0c 20 00 mov 0x200c4a(%rip),%eax # 601030 <x>
4003e6: c3 retq
If you squint is it really different? there is data nearby that the processor reads to load into a register and or use. either way, due to the nature of the instruction sets the linker modifies the instruction or nearby pool data or both.
last one:
arm-none-eabi-gcc -S test.c
cat test.s
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 6
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "test.c"
.text
.align 2
.global main
.arch armv4t
.syntax unified
.arm
.fpu softvfp
.type main, %function
main:
# Function supports interworking.
# args = 0, pretend = 0, frame = 0
# frame_needed = 1, uses_anonymous_args = 0
# link register save eliminated.
str fp, [sp, #-4]!
add fp, sp, #0
ldr r3, .L3
ldr r3, [r3]
mov r0, r3
add sp, fp, #0
# sp needed
ldr fp, [sp], #4
bx lr
.L4:
.align 2
.L3:
.word x
.size main, .-main
.ident "GCC: (GNU) 9.3.0"
So can you see the assembly language, yes some tools will let you save the intermediate files and/or let you generate the assembly output of the file when compiling.
Can you have data in the code, yes there are times and reasons to have data values in the .text area not just target specific you will see this for various reasons and some toolchains put read only data there.
There are many file formats the ones used by modern operating systems have features not just for defining the bytes that make up the machine code and data values but also will include symbols and other debug information.
The file format and memory space for a program is operating system specific not language nor even target specific (Linux, Windows, MacOS on the same laptop are not expected to have the same rules despite the exact same target computer). A native toolchain for that platform has a default linker script and whatever other information required to build usable/working programs for that target. Including the supported file format.
The machine code and data items can be represented in different file formats in different ways, whether or not the operating system or loader of the target system can use that format depends on that target system.
Programs have bugs and nuances. File formats have versions and inconsistencies, you might find some elf file format reader only to find it doesn't work or prints out strange stuff when fed a perfectly good elf file that works on some system. Why are some flags being set? Perhaps those bytes got re-used or the flag to repurposed or the data structure changed or a tool is using it differently or in a non-standard way (think mov 20h,ax) and another tool that is not compatible can't understand or gets lucky and gets close enough.
Asking "why" questions at Stack Overflow is not very useful, the odds of finding the individual that wrote the thing are very low, better odds of asking the place you got the tool from and following that hoping the person is still alive and willing to be bothered. And 99.999(lots of 9s)% there is no global set of godly rules that the thing was written under/for. General it was some dude just felt like it that is why they did what they did, no real reason, laziness, a bug, intentionally trying to break someone else's tool. All the way up to a large committee of people with an opinion voted on it on a particular day in a particular room and that's why (and we know what we get when we design by committee or try to write specs that nobody conforms to).
I know you are on Windows and I don't have a Windows machine handy and am on Linux. But the gnu/binutils and clang/llvm tools are readily available and have a rich set of tools like readelf, nm, objdump, etc. That assist in examining things, a good tool is going to have that at least internally for the developers so they can debug the output of the tool to a certain quality level. gnu folks made tools and made them available for everyone, and while it takes time to sort through them and their features they are very powerful for the things you are trying to understand.
You are NOT going to find a good x86 disassembler, they are all crap simply because of the nature of the beast. It is a variable length instruction set, so by definition unless you are executing you cant sort it out correctly. You must disassemble in execution order from a known good entry point to have half a chance, and then for various reasons there are code paths you cannot see that way (think jump tables for example, or dlls or so files). The BEST solution is to have a very accurate/perfect emulator/simulator and run the code and perform all the actions/gyrations you need to do to get it to cover all the code paths, and have that tool record instructions from data and where each is located or each linear section without a branch.
The good side of this is that a lot of code is compiled today using tools that are not trying to hide anything. In the old days for various reasons you would see hand written asm that intentionally tried to prevent disassembly or due to other factors (hand editing a binary rom image for a video game the day before the trade show, go disassemble some of the classic roms).
mov r0,#0
cmp r0,#0
jz somewhere
.word 0x12345678
A disassembler isn't going to figure this out, some might add a case for that then
mov r0,#0
nop
nop
xor r0,#1
nop
nop
xor r0,#3
xor r0,#2
cmp r0,#0
jz somewhere
.word 0x12345678
and it thinks that data is an instruction, for variable length that is super hard for a disassembler to resolve a decent one will at least detect collisions where the non opcode part of the instruction is branched to and/or an opcode part of an instruction shows up later as additional bytes in some other instruction. The tool cant resolve it a human has to.
Even with arm and mips and having 32 and 16 bit instructions, risc-v with variable sized instructions, etc...
Very often gnu's disassembler will get tripped up with x86.
I don't think I'll be able to answer to everything. I am a beginner too so I may say some things not exact. But, I'll try my best and I think I can bring you some things.
No, compilers do not put data in code sections (correct me if I am wrong). There is the section .data (for initialized data) and section .bss (for uninitialized data).
I think, I'll better show you an example of a program which prints hello world (for linux because it's much simpler and I don't know how to do with windows. in x64 but it's like x86. Just the names of the syscalls and the registers that are different. x64 is for 64 bits and x86 for 32 bits).
BITS 64 ;not obligatory but I prefer
section .data
msg db "hello world" ;the message
len equ $-msg ;the length of msg
section .text
global _start
_start: ;the entry point
mov rax, 1 ;syscall 1 to print something
mov rdi, 1 ;1 for stdout
mov rsi, msg ;the message
mov rdx, len ;length in rdx
syscall
mov rax, 60 ;exit syscall
mov rdi, 0 ;exit with 0
syscall
(https://tio.run/#assembly-nasm if you don't want to use a VM. I advise you to look for WSL + vscode if you are using windows. you will have linux in your windows and vscode has an extension to have an access to the files in windows) but
If you wanna disassemble the code or see what is the memory, you can use gdb or radare2 in linux. For windows, there are other tools such as ghidra, IDA, olly dbg..
I don't know any way to make the compiler create a better assembly code. but it doesn't mean it doesn't exist.
I have never made anything for windows. However, to link my object file, I use ld (I don't know if it will be helpful).
ld object.o -o compiledprogram
I don't have time right now to continue writing so I can't advise you any courses right now.. I'll see later.
Hope it has helped you.
Answers to questions in your text:
1. You can see process execution step by step and process memory with debugger. I used OllyDbg for learning assembly, it's free and powerful debugger.
2. Process is loaded by Windows kernel after calling NtCreateUserProcess so I think that you would need kernel debugging to see how it is done.
3. Code that is debugged in OllyDbg is automatically disassembled.
4. You can put read-only data in ".text" section. You can change section flags to make it writable, then code and data can be mixed. Some compilers may merge ".text" and ".rdata" sections.
I would recommend that you read about PE imports, exports, relocations and resources in that order. If you want to see easiest possible i386 PE helloworld you can check my hello_world_pe_i386_dynamic.exe program here: https://github.com/pajacol/hello-world. I wrote it entirely in binary file editor. It contains only required data structures. This executable is position independent and can be loaded at any address without relocations.
Related
What is the format of the content in the .text generated by the linker?
During the compilation process, the linker maps our code text content into the .text in the code memory section. I would like to know what is the meaning of the text content, does it mean the actual code in text or in assembly? Thanks a lot!
does it mean the actual code in text or in assembly? Neither: it's actual code in machine instructions. For example: $ cat > t.c int foo() { return 42; } $ gcc -c t.c $ objdump -d t.o t.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <foo>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: b8 2a 00 00 00 mov $0x2a,%eax 9: 5d pop %rbp a: c3 retq The contents of the .text section is the following 11 bytes: 554889e5b82a0000005dc3 Update: is it correct to say that it contains the assembly code which converted into machine readable binary format? You could say that, but it not very clear. Perhaps "contains machine readable binary instructions, produced by compiling and assembling the program source, and applying relocations". (That last part is what the linker does, not demonstrated in the example above.)
cannot access memory at address 0x10084 when trying to set breakpoint via gdb
I wrote this simple assembly-program (based on a tutorial, only slightly changed.) # p = q + r + s # let q=2, r=4, s=5 # this version of the simple-equation stores in memory p: .space 4 #reserve 4 bytes in memory for variable p q: .word 2 #create 32-bit variable q with initial value of 2 r: .word 4 s: .word 5 .global _start _start: ldr r1,q #load r1 with q ldr r2,r #load r2 with r ldr r3,s #load r3 with s add r0,r1,r2 add r0,r0,r3 mov r7,#1 #syscall to terminate the program svc 0 .end I assemble the program using as -g -o main.o main.s Then i link the object-file usind ld main.o -o main Then i execute gdb main Now, when trying to insert a breakpoint at any line-number, i get the error that is the title of this post (cannot access memory at address 0x10084). As this program's code is based off of a tutorial, and the teacher in the tutorial uses a codeblocks-project and .global main main: instead of .global _start _start: i assume that this is where my error might come from (although not understanding how this results in not being able to set a breakpoint via gdb, while not getting any error while assembling and linking). I would be very greatful if anyone could shed some light on this for me. Thanks in advance! edit: having been asked what the output of objdump -d main might look like, i add the output of the command here: main: file format elf32-littlearm Disassembly of section .text: 00010054 <p>: 10054: 00000000 .word 0x00000000 00010058 <q>: 10058: 00000002 .word 0x00000002 0001005c <r>: 1005c: 00000004 .word 0x00000004 00010060 <s>: 10060: 00000005 .word 0x00000005 00010064 <_start>: 10064: e51f1014 ldr r1, [pc, #-20] ; 10058 <q> 10068: e51f2014 ldr r2, [pc, #-20] ; 1005c <r> 1006c: e51f3014 ldr r3, [pc, #-20] ; 10060 <s> 10070: e0810002 add r0, r1, r2 10074: e0800003 add r0, r0, r3 10078: e3a07001 mov r7, #1 1007c: ef000000 svc 0x00000000 readelf -a main told me (among other things), that my entry point is 0x10064. I already tried to use main instead of _start, however then during disassembling and linking i get an error telling me that no entry point has been found. edit: Given the address of the entry-point, i ran the program again using gdb, then set a breakpoint to the specified address. It did so without complaining, and when running, execution indeed stops at the breakpoint. So the issue seems to be that the address 0x10084 that gdb wants to use for my breakpoint linenum command just doesn't correspond to the addresses that the instructions at the corresponding lines really have. Using the gdb command info line 'linenumber' just confirms my assumption. It prints out memory addresses and i can indeed set breakpoints to the printed addresses, but when i try to set a breakpoint specifying the line-number, gdb always wants to use 0x10084 and fails. Does anybody have an idea, how this behaviour comes about, and what might be ways to fix it?
gcc x86-32 stack alignment and calling printf
To the best of my knowledge, x86-64 requires the stack to be 16-byte aligned before a call, while gcc with -m32 doesn't require this for main. I have the following testing code: .data intfmt: .string "int: %d\n" testint: .int 20 .text .globl main main: mov %esp, %ebp push testint push $intfmt call printf mov %ebp, %esp ret Build with as --32 test.S -o test.o && gcc -m32 test.o -o test. I am aware that syscall write exists, but to my knowledge it cannot print ints and floats the way printf can. After entering main, a 4 byte return address is on the stack. Then interpreting this code naively, the two push calls each put 4 bytes on the stack, so call needs another 4 byte value pushed to be aligned. Here is the objdump of the binary generated by gas and gcc: 0000053d <main>: 53d: 89 e5 mov %esp,%ebp 53f: ff 35 1d 20 00 00 pushl 0x201d 545: 68 14 20 00 00 push $0x2014 54a: e8 fc ff ff ff call 54b <main+0xe> 54f: 89 ec mov %ebp,%esp 551: c3 ret 552: 66 90 xchg %ax,%ax 554: 66 90 xchg %ax,%ax 556: 66 90 xchg %ax,%ax 558: 66 90 xchg %ax,%ax 55a: 66 90 xchg %ax,%ax 55c: 66 90 xchg %ax,%ax 55e: 66 90 xchg %ax,%ax I am very confused about the push instructions generated. If two 4 byte values are pushed, how is alignment achieved? Why is 0x2014 pushed instead of 0x14? What is 0x201d? What does call 54b even achieve? Output of hd matches objdump. Why is this different in gdb? Is this the dynamic linker? B+>│0x5655553d <main> mov %esp,%ebp │ │0x5655553f <main+2> pushl 0x5655701d │ │0x56555545 <main+8> push $0x56557014 │ │0x5655554a <main+13> call 0xf7e222d0 <printf> │ │0x5655554f <main+18> mov %ebp,%esp │ │0x56555551 <main+20> ret Resources on what goes on when a binary is actually executed are appreciated, since I don't know what's actually going on and the tutorials I've read don't cover it. I'm in the process of reading through How programs get run: ELF binaries.
The i386 System V ABI does guarantee / require 16 byte stack alignment before a call, like I said at the top of my answer that you linked. (Unless you're calling a private helper function, in which case you can make up your own rules for alignment, arg-passing, and which registers are clobbered for that function.) Functions are allowed to crash or misbehave if you violate this ABI requirement, but are not required to. e.g. scanf in x86-64 Ubuntu glibc (as compiled by recent gcc) only recently started doing that: scanf Segmentation faults when called from a function that doesn't change RSP Functions can depend on stack alignment for performance (to align a double or array of doubles to avoid cache-line splits when accessing them). Usually the only case where a function depends on stack alignment for correctness is when compiled to use SSE/SSE2, so it can use 16-byte alignment-required loads/stores to copy a struct or array (movaps or movdqa), or to actually auto-vectorize a loop over a local array. I think Ubuntu doesn't compile their 32-bit libraries with SSE (except functions like memcpy that use runtime dispatching), so they can still work on ancient CPUs like Pentium II. Multiarch libraries on an x86-64 system should assume SSE2, but with 4-byte pointers it's less likely that 32-bit functions would have 16 byte structs to copy. Anyway, whatever the reason, obviously printf in your 32-bit build of glibc doesn't actually depend on 16-byte stack alignment for correctness, so it doesn't fault even when you misalign the stack. Why is 0x2014 pushed instead of 0x14? What is 0x201d? 0x14 (decimal 20) is the value in memory at that location. It will be loaded at runtime, because you used push r/m32, not push $20 (or an assemble time constant like .equ testint, 20 or testint = 20). You used gcc -m32 to make a PIE (Position Independent Executable), which is relocated at runtime, because that's the default on Ubuntu's gcc. 0x2014 is the offset relative to the start of the file. If you disassemble at runtime after running the program, you'll see a real address. Same for call 54b. It's presuambly a call to the PLT (which is near the start of the file / text segment, hence the low address). If you disassembled with objdump -drwC, you'd see symbol relocation info. (I like -Mintel as well, but beware it's MASM-like, not NASM). You can link with gcc -m32 -no-pie to make classic position-dependent executables. I'd definitely recommend that especially for 32-bit code, and especially if you're compiling C, use gcc -m32 -no-pie -fno-pie to get non-PIE code-gen as well as linking into a non-PIE executable. (see 32-bit absolute addresses no longer allowed in x86-64 Linux? for more about PIEs.)
Linking two .o files together
I have two .asm files, one that calls a function inside the other. My files look like: mainProg.asm: global main extern factorial section .text main: ;---snip--- push rcx call factorial pop rcx ;---snip--- ret factorial.asm: section .text factorial: cmp rdi, 0 je l2 mov rax, 1 l1: mul rdi dec rdi jnz l1 ret l2: mov rax, 1 ret (Yes, there's some things I could improve with the implementation.) I tried to compile them according to the steps at How to link two nasm source files: $ nasm -felf64 -o factorial.o factorial.asm $ nasm -felf64 -o mainProg.o mainProg.asm $ gcc -o mainProg mainProg.o factorial.o The first two commands work without issue, but the last fails with mainProg.o: In function `main': mainProg.asm:(.text+0x22): undefined reference to `factorial' collect2: error: ld returned 1 exit status Changing the order of the object files doesn't change the error. I tried searching for solutions to link two .o files, and I found the question C Makefile given two .o files. As mentioned there, I ran objdump -S factorial.o and got factorial.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <factorial>: 0: 48 83 ff 00 cmp $0x0,%rdi 4: 74 0e je 14 <l2> 6: b8 01 00 00 00 mov $0x1,%eax 000000000000000b <l1>: b: 48 f7 e7 mul %rdi e: 48 ff cf dec %rdi 11: 75 f8 jne b <l1> 13: c3 retq 0000000000000014 <l2>: 14: b8 01 00 00 00 mov $0x1,%eax 19: c3 retq which is pretty much identical to the source file. It clearly contains the factorial function, so why doesn't ld detect it? Is there a different method to link two .o files?
You need a global factorial assembler directive in factorial.asm. Without that, it's still in the symbol table, but the linker won't consider it for linking between objects. A label like factorial: is half way between a global/external symbol and a local label like .loop1: would make (not present in the object file at all). Local labels are a good way to get less messy disassembly, with one block per function instead of a separate block starting after every branch target. Non-global symbols are only useful for disassembly and stuff like that, AFAIK. I think they would get stripped, along with debug information, by strip. Also, note that imul rax, rdi runs faster, because it doesn't have to store the high half of the result in %rdx, or even calculate it. Also note that you can objdump -Mintel -d to get intel-syntax disassembly. Agner Fog's objconv is also very nice, but it's more typing because the output doesn't go to stdout by default. (Although a shell wrapper function or script can solve that.) Anyway, this would be better: global factorial factorial: mov eax, 1 ; depending on the assembler, might save a REX prefix ; early-out branch after setting rax, instead of duplicating the constant test rdi, rdi ; test is shorter than compare-against-zero jz .early_out .loop: ; local label won't appear in the object file imul rax, rdi dec rdi jnz .loop .early_out: ret Why does main push/pop rcx? If you're writing functions that follow the standard ABI (definitely a good idea unless there's a large performance gain), and you want something to survive a call, keep it in a call-preserved register like rbx.
x86_64: Is it possible to "in-line substitute" PLT/GOT references?
I'm not sure what a good subject line for this question is, but here we go: In order to force code locality/compactness for a critical section of code, I'm looking for a way to call a function in an external (dynamically-loaded) library through a "jump slot" (an ELF R_X86_64_JUMP_SLOT relocation) directly at the call site - what the linker ordinarily puts into PLT / GOT, but have these inlined right at the call site. If I emulate the call like: #include <stdio.h> int main(int argc, char **argv) { asm ("push $1f\n\t" "jmp *0f\n\t" "0: .quad %P0\n" "1:\n\t" : : "i"(printf), "D"("Hello, World!\n")); return 0; } To get the space for a 64bit word, the call itself works (please, no comments about this being lucky coincidence as this breaks certain ABI rules - all these are not subject of this question. For my case, be worked around/addressed in other ways, I'm trying to keep this example brief). It creates the following assembly:0000000000000000 <main>: 0: bf 00 00 00 00 mov $0x0,%edi 1: R_X86_64_32 .rodata.str1.1 5: 68 00 00 00 00 pushq $0x0 6: R_X86_64_32 .text+0x19 a: ff 24 25 00 00 00 00 jmpq *0x0 d: R_X86_64_32S .text+0x11 ... 11: R_X86_64_64 printf 19: 31 c0 xor %eax,%eax 1b: c3 retq But (due to using printf as the immediate, I guess ... ?) the target address here is still that of the PLT hook - the same R_X86_64_64 reloc. Linking the object file against libc into an actual executable results in: 0000000000400428 <printf#plt>: 400428: ff 25 92 04 10 00 jmpq *1049746(%rip) # 5008c0 <_GLOBAL_OFFSET_TABLE_+0x20> [ ... ] 0000000000400500 <main>: 400500: bf 0c 06 40 00 mov $0x40060c,%edi 400505: 68 19 05 40 00 pushq $0x400519 40050a: ff 24 25 11 05 40 00 jmpq *0x400511 400511: [ .quad 400428 ] 400519: 31 c0 xorl %eax, %eax 40051b: c3 retq [ ... ] DYNAMIC RELOCATION RECORDS OFFSET TYPE VALUE [ ... ] 00000000005008c0 R_X86_64_JUMP_SLOT printf I.e. this still gives the two-step redirection, first transfer execution to the PLT hook, then jump into the library entry point. Is there a way how I can instruct the compiler/assembler/linker to - in this example - "inline" the jump slot target at address 0x400511? I.e. replace the "local" (resolved at program link time by ld) R_X86_64_64 reloc with the "remote" (resolved at program load time by ld.so) R_X86_64_JUMP_SLOT one (and force non-lazy-load for this section of code) ? Maybe linker mapfiles might make this possible - if so, how? Edit: To make this clear, the question is about how to achieve this in a dynamically-linked executable / for an external function that's only available in a dynamic library. Yes, it's true static linking resolves this in a simpler way, but: There are systems (like Solaris) where static libraries are generally not shipped by the vendor There are libraries that aren't available as either source code or static versions Hence static linking is not helpful here :( Edit2: I've found that in some architectures (SPARC, noticeably, see section on SPARC relocations in the GNU as manual), GNU is able to create certain types of relocation references for the linker in-place using modifiers. The quoted SPARC one would use %gdop(symbolname) to make the assembler emit instructions to the linker stating "create that relocation right here". Intel's assembler on Itanium knows the #fptr(symbol) link-relocation operator for the same kind of thing (see also section 4 in the Itanium psABI). But does an equivalent mechanism - something to instruct the assembler to emit a specific linker relocation type at a specific position in the code - exist for x86_64? I've also found that the GNU assembler has a .reloc directive which supposedly is to be used for this purpose; still, if I try: #include <stdio.h> int main(int argc, char **argv) { asm ("push %%rax\n\t" "lea 1f(%%rip), %%rax\n\t" "xchg %%rax, (%rsp)\n\t" "jmp *0f\n\t" ".reloc 0f, R_X86_64_JUMP_SLOT, printf\n\t" "0: .quad 0\n" "1:\n\t" : : "D"("Hello, World!\n")); return 0; } I get an error from the linker (note that 7 == R_X86_64_JUMP_SLOT):error: /tmp/cc6BUEZh.o: unexpected reloc 7 in object file The assembler creates an object file for which readelf says:Relocation section '.rela.text.startup' at offset 0x5e8 contains 2 entries: Offset Info Type Symbol's Value Symbol's Name + Addend 0000000000000001 000000050000000a R_X86_64_32 0000000000000000 .rodata.str1.1 + 0 0000000000000017 0000000b00000007 R_X86_64_JUMP_SLOT 0000000000000000 printf + 0 This is what I want - but the linker doesn't take it. The linker does accept just using R_X86_64_64 instead above; doing that creates the same kind of binary as in the first case ... redirecting to printf#plt, not the "resolved" one.
This optimization has since been implemented in GCC. It can be enabled with the -fno-plt option and the noplt function attribute: Do not use the PLT for external function calls in position-independent code. Instead, load the callee address at call sites from the GOT and branch to it. This leads to more efficient code by eliminating PLT stubs and exposing GOT loads to optimizations. On architectures such as 32-bit x86 where PLT stubs expect the GOT pointer in a specific register, this gives more register allocation freedom to the compiler. Lazy binding requires use of the PLT; with -fno-plt all external symbols are resolved at load time. Alternatively, the function attribute noplt can be used to avoid calls through the PLT for specific external functions. In position-dependent code, a few targets also convert calls to functions that are marked to not use the PLT to use the GOT instead.
In order to inline the call you would need a code (.text) relocation whose result is the final address of the function in the dynamically loaded shared library. No such relocation exists (and modern static linkers don't allow them) on x86_64 using a GNU toolchain for GNU/Linux, therefore you cannot inline the entire call as you wish to do. The closest you can get is a direct call through the GOT (avoids PLT): .section .rodata .LC0: .string "Hello, World!\n" .text .globl main .type main, #function main: pushq %rbp movq %rsp, %rbp movl $.LC0, %eax movq %rax, %rdi call *printf#GOTPCREL(%rip) nop popq %rbp ret .size main, .-main This should generate a R_X86_64_GLOB_DAT relocation against printf in the GOT to be used by the sequence above. You need to avoid C code because in general the compiler may use any number of caller-saved registers in the prologue and epilogue, and this forces you to save and restore all such registers around the asm function call or risk corrupting those registers for later use in the wrapper function. Therefore it is easier to write the wrapper in pure assembly. Another option is to compile with -Wl,-z,now -Wl,-z,relro which ensures the PLT and PLT-related GOT entries are resolved at startup to increase code locality and compactness. With full RELRO you'll only have to run code in the PLT and access data in the GOT, two things which should already be somewhere in the cache hierarchy of the logical core. If full RELRO is enough to meet your needs then you wouldn't need wrappers and you would have added security benefits. The best options are really static linking or LTO if they are available to you.
You can statically link the executable. Just add -static to the final link command, and all you indirect jumps will be replaced by direct calls.