The question is about loading portable executable images to a random address.
Let's take kernel32.dll as an example, loaded at 0x75A00000.
I can see that at offset 0x10e15 from the image, there is an assembler instruction, which depends on where the image is located.
address:
75A10E13
bytes:
8B 35 18 03 AE 75
command:
MOV ESI,DWORD PTR DS:[75AE0318]
It turns out that by launching the executable file, we must tell the system that we need to relocation to this address.
The system looks at the relocation table, which is in the executable file, and sees the following:
base relocation table
To get the absolute address of the first element to be moved, I do the following: add the virtual address to the address of the image, and then I add the first element of the block to the resulting number.
0x75A00000 + 0x10000 + 0x3E15 = 75A10E15
it's a good number, but always 0x3000 more than I expect. i just subtract 0x3000 and it works. Please, help me find the answer, where does 0x3000 for x86 come from?
Relocation in Portable Executables were resolved when the file was linked. The base relocation table, which you are referring, has a different function: it is used by Windows loader when the PE could not be loaded at the prefered ImageBase address specified by the linker, usually 0x0040_0000.
Dynamically Loaded Libraries shipped with MS Windows are linked to ImageBase addresses different for each core DLL and chosen not to colide with one another, so an executable which imports usual combination of libraries doesn't have to relocate them.
You misinterpreted the format of base relocation section .reloc.
Those 16bit words TypeOrOffset which follow PageRVA and BlockSize have their Base Relocation Type encoded in four most significant bits.
For instance the first TypeOrOffset entry in you dump 0x3E15 has type IMAGE_REL_BASED_HIGHLOW (3) and offset 0x0E15, which is the number to be added to PageRVA.
Related
I have a (C++) program where, in one of its dll's, the following is done:
if (m_Map.GetMaxValue() >= MAX_CLASSES) {
I have two binaries of this program (compiled with various versions of Visual Studio), one where MAX_CLASSES was #define'd to 50, and one where it was 75. These binaries were made from different branches of the code and the other functionality is different as well. What I need is a version of the binary where the MAX_CLASSES was defined as 50, except with the higher limit i.e. 75.
So a sane person would change the constant in the source code of the branch I need, rebuild and go home. But, building this software is complex because it's old, the dependencies and tooling are old, etc.; plus I have issues with building the installers, and data and so on. So, I thought, how about I just patch this binary so that this one constant is changed directly in the DLL. I have vague recollections of doing similar things in the 1990's for, eh, probably 'educational' purposes.
But times have changed and I barely remember doing it, let alone how I did things back then. I opened the DLL (one where the limit is set to 75, this is the binary I have at hand - I will have to re-do this as soon as I have the actual binary with the 50 limit, so the following references 75 i.e. 0x4b for illustrating the principle) in Ghidra and after some poking around, I found the following:
18005160e 3c 4b CMP AL,0x4b
180051610 0f 82 19 JC LAB_18005172f
01 00 00
Which in the decompiler window I could link back to
if (bVar3 < 0x4b)
and some operations after that that I can map to the source code of the function I have.
Now my questions are:
how do I interpret the values above (the Ghidra output) wrt to the binary layout of the dll? When I hover over the first column value ('18005160e') in Ghidra, I get values for 'imagebase offset', 'memory block offset', 'function offset' and 'byte source offset'. Is this 'byte source offset' the physical address from the start of the dll where these instructions start? The actual value in this hover balloon is 50a0eh - is that Ghidra's notation for 0x50a0e ? I.e. does the trailing 'h' denote 'hex'?
I then tried to open the dll in a regular hex editor ('Hex Editor Neo' which I like to use to view/edit binary data files), and went to offset 0x50a0e, and looked for the values '3c 4b' around there which I didn't find. I searched for this byte sequence in the whole file, and found 7 occurrences, none of which are around 0x50a0e, leading me to think I'm misinterpreting Ghidra's 'byte source offset' here.
how do I make a 'patcher' for this? I would think what I need is a program that only does
FILE* fh = fopen('mydll.dll);
fseek(fh, 0x[magic constant]);
fwrite(fh, 0x4b);
fclose(fh);
where '0x[magic constant]' is hopefully just the value I got from Ghidra in 'byte source offset'? Or is there anything else I need to consider here? Is there any software tool that can generate a patcher program?
Thanks.
18005160e is a VA, a Virtual Address.
It is the sum of a Base Address (most likely 180000000) and an RVA, a Relative Virtual Address.
Find the Base Address of the DLL with any PE inspecting tool (e.g. CFF Explorer) or Ghidra itself.
Subtract the base address from 18005160e to the RVA. Let's say the result is 5160e.
Now you need to find which section this RVA lies in. Again use an PE inspecting tool to find the list of the sections and their RVA/Virtual start and RVA/Virtual size.
Say the RVA lies in the .text section with start at the RVA 1000.
Subtract this start RVA from the result above: 5160e - 1000 = 4160e.
This is the offset of the instruction in the .text section.
To find the offset in the file, just add the raw/offset start of the section (again you can find this with a PE inspecting tool).
Say the .text section starts at the offset 400, then 4160e + 400 = 41a0e is the offset corresponding to the VA 18005160e.
This is all PE 101.
I noticed a potential bug in some code i'm writing.
I though that if I used mov ax, seg segment_name, the program might be non-portable and only work on one machine in a specific configuration since the load location can vary from machine to machine.
So I decided to disassemble a program containing just that one instruction on two different machines running DOS and I found that the problem was magically solved.
Output of debug on machine one: 0C7A:014C B8BB0C MOV AX,0CBB
Output of debug on machine two: 06CA:014C B80B07 MOV AX,070B
After hex dumping the program I found that the unaltered bytes are actually B84200.
Manually inserting those bytes back into the program results in mov ax, 0042
So does the PE format store references to those instructions and update them at runtime?
As Peter Cordes noted, MS-DOS doesn't use the PECOFF executable format that Windows uses. It has it's own "MZ" executable format, named after the first two bytes of the executable that identify as being in this format.
The MZ format supports the use of multiple segments through a relocation table containing relocations. These relocations are just simple segment:offset values that indicate the location of 16-bit segment values that need to be adjusted based on where the executable was loaded in memory. MS-DOS performs these adjustments by simply adding the actual load segment of the program to the value contained in the executable. This means that without relocations applied the executable would only work if loaded at segment 0, which happens to be impossible.
Note this isn't just necessary for a program to work on multiple machines, it's also necessary for the same program to work reliably on the same machine. The load address can change based on what various configuration details, was well as other programs and drivers that have already been loaded in memory, so the load address of an MS-DOS executable is essentially unpredictable.
Working backwards from your example, we can tell where your example program was loaded into memory on both machines. Since 0042h was relocated into 0CBBh on the first machine and into 070Bh on the second machine, we know MS-DOS loaded your program on the two machines at segments 0C79h and 06C9h respectively:
0CBB - 0042 = 0C79
070B - 0042 = 06C9
From that we can determine that your example executable has the entry 0001:014D, or equivalent segment:offset value, in it's relocation table:
0C7A:014D - 0C79:0000 = 0001:014D
06CA:014D - 06C9:0000 = 0001:014D
This entry indicates the unrelocated location of the 16-bit immediate operand of the mov ax, seg segname instruction that needs adjusting.
I wonder why some Windows executables do have relocations. Why is there a need for it when an executable always can be loaded at any virtual address, unlike a DLL?
yes, relocation in EXE is optional and can be stripped. but if we want /DYNAMICBASE - generate an executable image that can be randomly rebased at load time by using the address space layout randomization (ASLR) - we need relocs. so this i be say only for security reasons. like security cookies in stack, Control Flow Guard and etc.. - all this is optional but used
that because of pointers and address reference look to this code :
int i;
int *ptr = &i;
If the linker assumed an image base of 0x10000, the address of the variable i will end up containing something like 0x12004. At the memory used to hold the pointer "ptr", the linker will have written out 0x12004, since that's the address of the variable i. If the loader for whatever reason decided to load the file at a base address of 0x70000, the address of i would be 0x72004. The .reloc section is a list of places in the image where the difference between the linker assumed load address and the actual load address needs to be factored in.
According to the ld manual on Output Section Description:
section [address] [(type)] :
[AT(lma)]
[ALIGN(section_align) | ALIGN_WITH_INPUT]
[SUBALIGN(subsection_align)]
[constraint]
{
output-section-command
output-section-command
...
} [>region] [AT>lma_region] [:phdr :phdr ...] [=fillexp] [,]
The address or >region stand for the VMA, i.e. the Virtual Memory Address of the output section.
The AT() or AT>lma_region stand for the LMA, i.e. the Load Memory Address of the output section.
And I decide get a close view with readelf -e to dump the section headers and program headers of a helloworld elf file. The result is below:
My questions are:
Why there's no LMA in the dumped headers? How is LMA represented in ELF file?
What does the Addr column in the red rectangle mean? VMA?
What does the PhysAddr in the green rectangle mean?
ADD 1
So far, It seems the PhysAddr is the LMA.
Why there's no LMA in the dumped headers? How is the LMA represented
in an ELF file
Firstly there is no LMA header within an elf file, it is actually quiet simple, multiple sections in an ELF file are mapped into segments, if the sections mapped into segments have a LOAD flag for example (PROFBITS) is a loadable section type, and the segment they are mapped into is also a load type segment (INTERP and LOAD) for example are also loadable segments, that means every section within that segment within that elf file would be loaded into memory. where? simply to the VMA they were given, so no there is no LMA in an elf file, a LMA is represented by a VMA given that the section should be loaded which is a specified type / flag.
What does the addr column in the red rectangle mean?
This has a direct correlation to your previous question, Yes! it does mean a VMA, in order to have this properly explained we need to understand that an ELF format was designed for architectures that support some memory protection / memory segmentation.
you might want to give some section special permissions, instead of giving every section it's own memory protection, you'll map multiple sections into a segment and give that sole segment it's own memory protections.
This causes the need to map sections into segment, how would the OS loader know how to map each section into segment and by that give it the appropriate memory protection? by it's address.
Each section is also given an address and by those addresses / offsets / sizes they are mapped into a segment which in overall would be allocated into memory and given some memory protection rules that would apply to all sections.
The only way that the OS could know how to map these is by address so yes if the section is of a loadable type it's ADDR means VMA
( at least for modern systems that use Virtual Memory and dont abuse the elf file )
What does the PhysAddr mean?
As much as I know, PhysAddr is only relevant to old fashioned architectures in which physical addressing is relevant to user-space programs, this section should hold the actual physical address the segment would sit in, yet in most modern systems this is simply ignored...
I suggest you read this http://flint.cs.yale.edu/cs422/doc/ELF_Format.pdf,
personally back in the day when learning this, it helped me a lot and gave me a lot of knowledge regarding ELF files
hopefully I've helped you some how! :)
Why is there a need for relocation table when every element in an exe is at a relative offset from the base of the image?? I mean even if the image gets dispacled by a positive offset of say 0X60000, why is there for relocation table, as we would anyways be using RVA's which would be relative to the new base??
The point is that the code doesn't access the globals (global variables and function addresses) via RVA or whats-or-ever. They're accessed by their absolute address. And this address should be changed in case the executable was not loaded at its preferred address.
The relocation table consists exactly of those places. It's a table of all the places that should be adjusted by the difference of the actual base address and the preferred one.
BTW, EXEs, in contrast to DLLs usually don't contain relocation tables. This is because they're the first module to be mapped into the address space, hence they may always be loaded at their preferred address. The situation is different for DLLs, which usually do contains relocation tables.
P.S. In Windows 7 EXE may contain relocation table in case they prefer to be loaded at random address. It's a security feature (pitiful IMHO)
Edit:
Should be mentioned that function addresses are not always accessed by their absolute value. On x86 branching instructions (such as jmp, call and etc.) have a "short" format which works with the relative offset. Such places don't neet to be mentioned in relocation table.
For an EXE file there is no need for the relocation table because the executable image is always loaded at its preferred address. The relocation table can safely be stripped.