How can I find the main method in the PE executable file, should I find the entry point address and start from that point or find three pushes of the stack in case the PE is written in C?
There are not going to be 3 pushes because main is not the real entry point on Windows. The compiler will insert extra code that initializes things and then calls main/WinMain. There is probably too much code between the real start and main to automate finding main. You would have to consider multiple versions of Visual Studio and MinGW. And some exe files do not use the C run-time at all and execute directly from the real entrypoint.
The entry point is a function that takes zero arguments. Its address is the load address of the .exe (Starting with MZ) + IMAGE_OPTIONAL_HEADER.AddressOfEntryPoint.
Related
Today, only for the testing purposes, I came with the following idea, to create and compile a naive source code in CodeBlocks, using Release target to remove the unnecessary debugging code, a main function with three nop operations only to find faster where the entry point for the main function is.
CodeBlocks sample naive program:
Using IDA disassembler, I have seen something strange, OS actually can add aditional machine code calls in the main function (added implicitly), a call to system function which reside in kernel32.dll what is used for OS thread handling.
IDA program view:
In the machine code only for test reason the three "nop" (90) was replaced by "and esp, 0FFFFFFF0h", program was re-pached again, this is why "no operation" opcodes are not disponible in the view.
Observed behaviour:
It is logic to create a new thread for each process is opened, as we can explore it in the TaskManager, a process run in it's own thread, that is a reason why compiler add this code (the implicit default thread).
My questions:
How compiler know where to "inject" this call code automatically?
Why this call is not made before in the upper function (sub_401B8C) which will route to main function entry point?
To quote the gcc manual:
If no init section is available, when GCC compiles any function called
main (or more accurately, any function designated as a program entry
point by the language front end calling expand_main_function), it
inserts a procedure call to __main as the first executable code after
the function prologue. The __main function is defined in libgcc2.c and
runs the global constructors.
I've read this tutorial
I could follow the guide and run the code. but I have questions.
1) Why do we need both load-address and run-time address. As I understand it is because we have put .data at flash too; so why we don't run app there, but need start-up code to copy it into RAM?
http://www.bravegnu.org/gnu-eprog/c-startup.html
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
arm-none-eabi-gcc -nostdlib -o sum_array.elf sum_array.c
Many thanks
Your first question was answered in the guide.
When you load a program on an operating system your .data section, basically non-zero globals, are loaded from the "binary" into the right offset in memory for you, so that when your program starts those memory locations that represent your variables have those values.
unsigned int x=5;
unsigned int y;
As a C programmer you write the above code and you expect x to be 5 when you first start using it yes? Well, if are booting from flash, bare metal, you dont have an operating system to copy that value into ram for you, somebody has to do it. Further all of the .data stuff has to be in flash, that number 5 has to be somewhere in flash so that it can be copied to ram. So you need a flash address for it and a ram address for it. Two addresses for the same thing.
And that begins to answer your second question, for every line of C code you write you assume things like for example that any function can call any other function. You would like to be able to call functions yes? And you would like to be able to have local variables, and you would like the variable x above to be 5 and you might assume that y will be zero, although, thankfully, compilers are starting to warn about that. The startup code at a minimum for generic C sets up the stack pointer, which allows you to call other functions and have local variables and have functions more than one or two lines of code long, it zeros the .bss so that the y variable above is zero and it copies the value 5 over to ram so that x is ready to go when the code your entry point C function is run.
If you dont have an operating system then you have to have code to do this, and yes, there are many many many sandboxes and toolchains that are setup for various platforms that already have the startup and linker script so that you can just
gcc -O myprog.elf myprog.c
Now that doesnt mean you can make system calls without a...system...printf, fopen, etc. But if you download one of these toolchains it does mean that you dont actually have to write the linker script nor the bootstrap.
But it is still valuable information, note that the startup code and linker script are required for operating system based programs too, it is just that native compilers for your operating system assume you are going to mostly write programs for that operating system, and as a result they provide a linker script and startup code in that toolchain.
1) The .data section contains variables. Variables are, well, variable -- they change at run time. The variables need to be in RAM so that they can be easily changed at run time. Flash, unlike RAM, is not easily changed at run time. The flash contains the initial values of the variables in the .data section. The startup code copies the .data section from flash to RAM to initialize the run-time variables in RAM.
2) Linker-script: The object code created by your compiler has not been located into the microcontroller's memory map. This is the job of the linker and that is why you need a linker script. The linker script is input to the linker and provides some instructions on the location and extent of the system's memory.
Startup code: Your C program that begins at main does not run in a vacuum but makes some assumptions about the environment. For example, it assumes that the initialized variables are already initialized before main executes. The startup code is necessary to put in place all the things that are assumed to be in place when main executes (i.e., the "run-time environment"). The stack pointer is another example of something that gets initialized in the startup code, before main executes. And if you are using C++ then the constructors of static objects are called from the startup code, before main executes.
1) Why do we need both load-address and run-time address.
While it is in most cases possible to run code from memory mapped ROM, often code will execute faster from RAM. In some cases also there may be a much larger RAM that ROM and application code may compressed in ROM, so the executable code may not simply be copied from ROM also decompressed - allowing a much larger application than the available ROM.
In situations where the code is stored on non-memory mapped mass-storage media such as NAND flash, it cannot be executed directly in any case and must be loaded into RAM by some sort of bootloader.
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
The linker script defines the memory layout of you target and application. Since this tutorial is for bare-metal programming, there is no OS to handle that for you. Similarly the start-up code is required to at least set an initial stack-pointer, initialise static data, and jump to main. On an embedded system it is also necessary to initialise various hardware such as the PLL, memory controllers etc.
Low-level details on linking and loading of (PE) programs in Windows.
I'm looking for an answer or tutorial that clarifies how a Windows program are linked and loaded into memory after it has been assembled.
Especially, I'm uncertain about the following points:
After the program is assembled, some instructions may reference memory within the .DATA section. How are these references translated, when the program is loaded into memory starting at some arbitrary address? Does RVA's and relative memory references take care of these issues (BaseOfCode and BaseOfData RVA-fields of the PE-header)?
Is the program always loaded at the address specified in ImageBase header field? What if a loaded (DLL) module specifies the same base?
First I'm going to answer your second question:
No, a module (being an exe or dll) is not allways loaded at the base address. This can happen for two reasons, either there is some other module already loaded and there is no space for loading it at the base address contained in the headers, or because of ASLR (Address Space Layout Randomization) which mean modules are loaded at random slots for exploit mitigation purposes.
To address the first question (it is related to the second one):
The way a memory location is refered to can be relative or absolute. Usually jumps and function calls are relative (though they can be absolute), which say: "go this many bytes from the current instruction pointer". Regardless of where the module is loaded, relative jumps and calls will work.
When it comes to addressing data, they are usually absolute references, that is, "access these 4-byte datum at this address". And a full virtual address is specified, not an RVA but a VA.
If a module is not loaded at its base address, absolute references will all be broken, they are no longer pointing to the correct place the linker assumed they should point to. Let's say the ImageBase is 0x04000000 and you have a variable at RVA 0x000000F4, the VA will be 0x040000F4. Now imagine the module is loaded not at its BaseAddress, but at 0x05000000, everything is moved 0x1000 bytes forward, so the VA of your variable is actually 0x050000F4, but the machine code that accessess the data still has the old address hardcoded, so the program is corrupted. In order to fix this, linkers store in the executable where these absolute references are, so they can be fixed by adding to them how much the executable has been displaced: the delta offset, the difference between where the image is loaded and the image base contained in the headers of the executable file. In this case it's 0x1000. This process is called Base Relocation and is performed at load time by the operating system: before the code starts executing.
Sometimes a module has no relocations, so it can't be loaded anywhere else but at its base address. See How do I determine if an EXE (or DLL) participate in ASLR, i.e. is relocatable?
For more information on ASLR: https://insights.sei.cmu.edu/cert/2014/02/differences-between-aslr-on-windows-and-linux.html
There is another way to move the executable in memory and still have it run correctly. There exists something called Position Independent Code. Code crafted in such a way that it will run anywhere in memory without the need for the loader to perform base relocations.
This is very common in Linux shared libraries and it is done addressing data relatively (access this data item at this distance from the instruction pointer).
To do this, in the x64 architecture there is RIP-relative addressing, in x86 a trick is used to emulate it: get the content of the instruction pointer and then calculate the VA of a variable by adding to it a constant offset.
This is very well explained here:
https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html
I don't think PIC code is common in Windows, more often than not, Windows modules contain base relocations to fix absolute addresses when it is loaded somewhere else than its prefered base address, although I'm not exactly sure of this last paragraph so take it with a grain of salt.
More info:
http://opensecuritytraining.info/LifeOfBinaries.html
How are windows DLL actually shared? (a bit confusing because I didn't explain myself well when asking the question).
https://www.iecc.com/linker/
I hope I've helped :)
Address fixup for calls to DLL functions is a multistage process: the linker directs the call instruction to an indirect jump instruction, and the indirect jump instruction to a word of memory in the import table in the .rdata section where the Windows program loader will place the address of the function when the DLL is loaded at runtime.
The indirect jump instruction must be generated by the linker because the compiler doesn't know the function will turn out to be in a DLL. Program file size is minimized by generating only one indirect jump instruction for each function, no matter how many places it's called from.
Given that, the obvious way to do it is to gather all the indirect jump instructions at the end of the text section, after all the compiler-generated code in all the object files, and that does seem to be what happens when I try a simple test case with the Microsoft linker /nodefaultlib switch (which generates a small enough executable that I can understand the full disassembly).
When I link a small program in the normal way with the C standard library, the resulting executable is large enough that I can't follow all of the disassembly, but as far as I can see, the indirect jump instructions seem to be scattered throughout the code in small groups of maybe three at a time.
Is there a reason for this that I'm missing?
The indirect jump instruction must be generated by the linker because
the compiler doesn't know the function will turn out to be in a DLL.
Actually, this is not always the case. If you mark the function with __declspec(dllimport), the compiler does know it will be a DLL import and in that case it can generate an indirect call:
; HMODULE = LoadLibrary("mylib");
push offset $SG66630
call [__imp__LoadLibraryA#4]
(__imp__LoadLibraryA#4 is the pointer to the import in the IAT)
If you do not use dllimport then the compiler generates a relative function call:
push offset $SG66630
call _LoadLibraryA#4
And in such case the linker has to generate a jump stub:
LoadLibraryA proc near
jmp [__imp__LoadLibraryA#4]
LoadLibraryA endp
And, in fact, it does group such jump stubs together (though possibly by compile unit and/or imported DLL, not 100% sure here).
Note: in the past, the linker did not explicitly generate jump stubs but took them from the import libraries. They contained complete object files both the stubs and the structures necessary for generating the PE import directory. See this article for how it all worked: https://www.microsoft.com/msj/0498/hood0498.aspx
These days the import libraries have only the API and DLL names and the linker knows how to generate the necessary code and metadata for importing them.
Using identical source files for a Fortran .dll I can compile them with Compaq Visual Fortran 6.6C or Intel Visual Fortran 12.1.3.300 (IA-32). The problem is that the execution fails on the Intel binary, but works well with Compaq. I am compiling 32-bit on a Windows 7 64-bit system. The .dll calling driver is written in C#.
The failure message comes from the dreaded _chkstk() call when an internal subroutine is called (called from the .dll entry routine). (SO answer on chkstk())
The procedure in question is declared as (pardon the fixed file format)
SUBROUTINE SRF(den, crpm, icrpm, inose, qeff, rev,
& qqmax, lvtyp1, lvtyp2, avespd, fridry, luin,
& luout, lurtpo, ludiag, ndiag, n, nzdepth,
& unit, unito, ier)
INTEGER*4 lvtyp1, lvtyp2, luin, luout, lurtpo, ludiag, ndiag, n,
& ncp, inose, icrpm, ier, nzdepth
REAL*8 den, crpm, qeff, rev, qqmax, avespd, fridry
CHARACTER*2 unit, unito
and called like this:
CALL SRF(den, crpm(i), i, inose, qeff(i), rev(i),
& qqmax(i), lvtyp1, lvtyp2, avespd, fridry,
& luin, luout, lurtpo, ludiag, ndiag, n, nzdepth,
& unit, unito, ier)
with similar variable specifications except for crpm, qeff, rev and qqmax are arrays of which only the i-th elements is used for each SRF() call.
I understand possible stack issues if the arguments are more than 8kb in size, but in this case we have 7 x real(64) + 11 x int(32) + 2 x 2 x char(8) = 832 bits only in passed arguments.
I have worked really hard to move arguments (especially arrays) into a module, but I keep getting the same error
.
The dissasembly from the Intel .dll is
The dissasembly from the Compaq .dll is
Can anyone offer any suggestions on what is causing the SO, or how to debug it?
PS. I have increased the reserved stack space to hundreds of Mb and the problem persists. I have tried skipping the chkstk() call in the dissasembler but in crashes the program. The stack check starts from address 0x354000 and iterates down to 0x2D2000 where it crashes accessing a guard page. The stack bottom address is 0x282000.
You are shooting the messenger. The Compaq generated code also calls _chkstk(), the difference is that it inlined it. A common optimization. The key difference between the two snippets is:
mov eax, 0D3668h
vs
sub esp, 233E4h
The values you see used here are the amount of stack space required by the function. The Intel code requires 0xd3668 bytes = 865869 bytes. The Compaq code requires 0x233e4 = 144356. Big difference. In both cases that's rather a large amount but the Intel one is getting critical, a program normally has a one megabyte stack. Gobbling up 0.86 megabytes of it is pushing it very close, nest a couple of functions calls and you're looking at this site's name.
What you need to find out, I can't help because it is not in your snippet, is why the Intel generated function needs so much space for its local variables. Workarounds are to use the free store to find space for large arrays. Or use the linker's /STACK option to ask for more stack space (guessing at the option name).
The problem wasn't at the function call where the stack overflow occurred.
Earlier in the code, there were some global matrices initialized and they were placed in the stack and due to a bug in the code, they were still in scope and had already almost filled the stack. When the function call happened, the compiler tried to store the return address to the stack and it crashed the program.
The solution was to make the global matrices allocatable and also made sure the "Heap Arrays" option was set at an appropriate value.
Quite the rabbit hole this was, when it was 100% my buggy code the caused the issue.