DOS inserting segment addresses at runtime - dos

I noticed a potential bug in some code i'm writing.
I though that if I used mov ax, seg segment_name, the program might be non-portable and only work on one machine in a specific configuration since the load location can vary from machine to machine.
So I decided to disassemble a program containing just that one instruction on two different machines running DOS and I found that the problem was magically solved.
Output of debug on machine one: 0C7A:014C B8BB0C MOV AX,0CBB
Output of debug on machine two: 06CA:014C B80B07 MOV AX,070B
After hex dumping the program I found that the unaltered bytes are actually B84200.
Manually inserting those bytes back into the program results in mov ax, 0042
So does the PE format store references to those instructions and update them at runtime?

As Peter Cordes noted, MS-DOS doesn't use the PECOFF executable format that Windows uses. It has it's own "MZ" executable format, named after the first two bytes of the executable that identify as being in this format.
The MZ format supports the use of multiple segments through a relocation table containing relocations. These relocations are just simple segment:offset values that indicate the location of 16-bit segment values that need to be adjusted based on where the executable was loaded in memory. MS-DOS performs these adjustments by simply adding the actual load segment of the program to the value contained in the executable. This means that without relocations applied the executable would only work if loaded at segment 0, which happens to be impossible.
Note this isn't just necessary for a program to work on multiple machines, it's also necessary for the same program to work reliably on the same machine. The load address can change based on what various configuration details, was well as other programs and drivers that have already been loaded in memory, so the load address of an MS-DOS executable is essentially unpredictable.
Working backwards from your example, we can tell where your example program was loaded into memory on both machines. Since 0042h was relocated into 0CBBh on the first machine and into 070Bh on the second machine, we know MS-DOS loaded your program on the two machines at segments 0C79h and 06C9h respectively:
0CBB - 0042 = 0C79
070B - 0042 = 06C9
From that we can determine that your example executable has the entry 0001:014D, or equivalent segment:offset value, in it's relocation table:
0C7A:014D - 0C79:0000 = 0001:014D
06CA:014D - 06C9:0000 = 0001:014D
This entry indicates the unrelocated location of the 16-bit immediate operand of the mov ax, seg segname instruction that needs adjusting.

Related

Can gcc be configured to compile position-independent code for the code but position-dependent code for the data?

I'm trying to build bootable code for an ARM M7-based embedded system that is able to execute in place at two different locations in the QSPI, so that if one version gets corrupted, the backup version of the image can be executed in a different place.
Compiling with -fpic seems to produce a relocatable code image that is (nearly) able to execute in both places fine. However, the problem is that the data/bss the code refers to is also getting offset by the same amount - that is, the compiler is assuming that the .data and .bss segments live immediately after the .text segment, which isn't true for XIP embedded systems (where the RAM is separate).
As a result, if the original binary was linked to run at 0x60000000 (and using a fixed ram area at 0x20000000) but is then executed in place at 0x60100000 instead , the ram addresses will be shifted by 0x100000 as well (i.e. to 0x20100000), which isn't what I want at all.
Clearly, what I'd like to do is to modify gcc's behaviour so that references to the code (executing in place in two different places in the QSPI) are position-independent, while references to the .data/bss segments (in a fixed position in RAM) are position-dependent (as per normal).
Is this something that gcc can be tweaked to achieve (e.g. by some obscure linker attribute flag)? Or is this just out of its reach? Thanks!

IDA Pro disassembly shows ? instead of hex or plain ascii in .data?

I am using IDA Pro to disassemble a Windows DLL file. At one point I have a line of code saying
mov esi, dword_xxxxxxxx
I need to know what the dword is, but double-clicking it brings me to the .data page and everything is in question marks.
How do I get the plain text that is supposed to be there?
If you see question marks in IDA, this means that there's no physical data at this location on the file (on your disk drive).
Sections in PE files have a physical size (given by the SizeOfRawData field of the section header). This physical size (on disk) might be different from the size of the section once it is mapped onto the process memory by the Windows' loader (this size is given by the VirtualSize field of the section header).
So, if the VirtualSize field is bigger than the SizeOfRawData field, a part of the section has no physical existence and it exists only in memory (once the file is mapped onto the process address space).
On most case, at program entry point, you can assume this memory is filled with 0 (but some parts of the memory might be written by the windows loader).
To get the locations where the data is being written, read or loaded you can use cross-references (xref). Here's an example :
Click on the name of the data from which you want the xref :
Then press 'x', you'll be shown all known (to ida) location where the data is used :
The second column indicates how the data is used:
r means it is read
w means it is written
o means it is loaded as a pointer

x86 segmentation, DOS, MZ file format, and disassembling

I'm disassembling "Test Drive III". It's a 1990 DOS game. The *.EXE has MZ format.
I've never dealt with segmentation or DOS, so I would be grateful if you answered some of my questions.
1) The game's system requirements mention 286 CPU, which has protected mode. As far as I know, DOS was 90% real mode software, yet some applications could enter protected mode. Can I be sure that the app uses the CPU in real mode only? IOW, is it guaranteed that the segment registers contain actual offset of the segment instead of an index to segment descriptor?
2) Said system requirements mention 1 MB of RAM. How is this amount of RAM even meant to be accessed if the uppermost 384 KB of the address space are reserved for stuff like MMIO and ROM? I've heard about UMBs (using holes in UMA to access RAM) and about HMA, but it still doesn't allow to access the whole 1 MB of physical RAM. So, was precious RAM just wasted because its physical address happened to be reserved for UMA? Or maybe the game uses some crutches like LIM EMS or XMS?
3) Is CS incremented automatically when the code crosses segment boundaries? Say, the IP reaches 0xFFFF, and what then? Does CS switch to the next segment before next instruction is executed? Same goes for SS. What happens when SP goes all the way down to 0x0000?
4) The MZ header of the executable looks like this:
signature 23117 "0x5a4d"
bytes_in_last_block 117
blocks_in_file 270
num_relocs 0
header_paragraphs 32
min_extra_paragraphs 3349
max_extra_paragraphs 65535
ss 11422
sp 128
checksum 0
ip 16
cs 8385
reloc_table_offset 30
overlay_number 0
Why does it have no relocation information? How is it even meant to run without address fixups? Or is it built as completely position-independent code consisting from program-counter-relative instructions? The game comes with a cheat utility which is also an MZ executable. Despite being much smaller (8448 bytes - so small that it fits into a single segment), it still has relocation information:
offset 1
segment 0
offset 222
segment 0
offset 272
segment 0
This allows IDA to properly disassemble the cheat's code. But the game EXE has nothing, even though it clearly has lots of far pointers.
5) Is there even such thing as 'sections' in DOS? I mean, data section, code (text) section etc? The MZ header points to the stack section, but it has no information about data section. Is data and code completely mixed in DOS programs?
6) Why even having a stack section in EXE file at all? It has nothing but zeroes. Why wasting disk space instead of just saying, "start stack from here"? Like it is done with BSS section?
7) MZ header contains information about initial values of SS and CS. What about DS? What's its initial value?
8) What does an MZ executable have after the exe data? The cheat utility has whole 3507 bytes in the end of the executable file which look like
__exitclean.__exit.__restorezero._abort.DGROUP#.__MMODEL._main._access.
_atexit._close._exit._fclose._fflush._flushall._fopen._freopen._fdopen
._fseek._ftell._printf.__fputc._fputc._fputchar.__FPUTN.__setupio._setvbuf
._tell.__MKNAME._tmpnam._write.__xfclose.__xfflush.___brk.___sbrk._brk._sbrk
.__chmod.__close._ioctl.__IOERROR._isatty._lseek.__LONGTOA._itoa._ultoa.
_ltoa._memcpy._open.__open._strcat._unlink.__VPRINTER.__write._free._malloc
._realloc.__REALCVT.DATASEG#.__Int0Vector.__Int4Vector.__Int5Vector.
__Int6Vector.__C0argc.__C0argv.__C0environ.__envLng.__envseg.__envSize
Is this some kind of debugging symbol information?
Thank you in advance for your help.
Re. 1. No, you can't be sure until you prove otherwise to yourself. One giveaway would be the presence of MOV CR0, ... in the code.
Re. 2. While marketing materials aren't to be confused with an engineering specification, there's a technical reason for this. A 286 CPU could address more than 1M of physical address space. The RAM was only "wasted" in real mode, and only if an EMM (or EMS) driver wasn't used. On 286 systems, the RAM past 640kb was usually "pushed up" to start at the 1088kb mark. The ISA and on-board peripherals' memory address space was mapped 1:1 into the 640-1024kb window. To use the RAM from the real mode needed an EMM or EMS driver. From protected mode, it was simply "there" as soon as you set up the segment descriptor correctly.
If the game actually needed the extra 384kb of RAM over the 640kb available in the real mode, it's a strong indication that it either switched to protected mode or required the services or an EMM or EMS driver.
Re. 3. I wish I remembered that. On reflection, I wish not :) Someone else please edit or answer separately. Hah, I did know it at some point in time :)
Re. 4. You say "[the code] has lots of instructions like call far ptr 18DCh:78Ch". This implies one of three things:
Protected mode is used and the segment part of the address is a selector into the segment descriptor table.
There is code there that relocates those instructions without DOS having to do it.
There is code there that forcibly relocates the game to a constant position in the address space. If the game doesn't use DOS to access on-disk files, it can remove DOS completely and take over, gaining lots of memory in the process. I don't recall whether you could exit from the game back to the command prompt. Some games where "play until you reboot".
Re. 5. The .EXE header does not "point" to any stack, there is no stack section you imply, the concept of sections doesn't exist as far as the .EXE file is concerned. The SS register value is obtained by adding the segment the executable was loaded at with the SS value from the header.
It's true that the linker can arrange sections contiguously in the .EXE file, but such sections' properties are not included in the .EXE header. They often can be reverse-engineered by inspecting the executable.
Re. 6. The SS and SP values in the .EXE header are not file pointers. The EXE file might have a part that maps to the stack, but that's entirely optional.
Re. 7. This is already asked and answered here.
Re. 8. This looks like a debug symbol list. The cheat utility was linked with the debugging information left in. You can have completely arbitrary data there - often it'd various resources (graphics, music, etc.).

8086 Assembly / MS-DOS, passing file name from the command line

Say I have PROGRAM.ASM - I have the following in the data segment:
.data
Filename db 'file.txt', 0
Fhndl dw ?
Buffer db ?
I want 'file.txt' to be dynamic I guess? Once compiled, PROGRAM.exe needs to be able to accept a file name via the command line:
c:\> PROGRAM anotherfile.txt
EXECUTION GOES HERE
How do I enable this? Thank you in advance.
DOS stores the command line in a legacy structure called the Program Segment Prefix ("PSP"). And I do mean legacy. This structure was designed to be backwards-compatible with programs ported from CP/M.
Where's the PSP?
You know how programs built as .COM files always start with ORG 100h? The reason for that is precisely that - for .COM programs - the PSP is always stored at the beginning of the code segment (at CS:0h). The PSP is 0FFh bytes long, and the actual program code starts right after that (that is, at CS:100h).
The address is also conveniently available at DS:00h and ES:00h, since the key characteristic of the .COM format is that all the segment registers start with the same value (and a COM program typically never changes them).
To read the command line from a .COM program, you can pick its length at CS:80h (or DS:80h, etc. as long as you haven't changed those registers). The Command Line starts at CS:81h and takes the rest of PSP, ending with a Carriage Return (0Dh) as a terminator, so the command line is never more than 126 bytes long.
(and that is why the command line has been 126 bytes in DOS forever, despite the fact we all wished for years it could be made longer. Since WinNT uses provides a different mechanism to access the command line, the WinNT/XP/etc. command line doesn't suffer from this size limitation).
For an .EXE program, you can't rely on CS:00h because the startup code segment can be just about anywhere in memory. However, when the program starts, DOS always stores the PSP at the base of the default data segment. So, at startup, DS:00h and ES:00h will always point to the PSP, for both .EXE and .COM programs.
If you didn't keep track of PSP address at the beginning of the program, and you change both DS and ES, you can always ask DOS to provide the segment value at any time, via INT 21h, function 62h. The segment portion of the PSP address will be returned in BX (the offset being of course 0h).

What are possible reasons for not mapping Win32 Portable Executable images at offset 0?

I've been looking into Window's PE format lately and I have noticed that in most examples,
people tend to set the ImageBase offset value in the optional header to something unreasonably high like 0x400000.
What could make it unfavorable not to map an image at offset 0x0?
First off, that's not a default of Windows or the PE file format, it is the default for the linker's /BASE option when you use it to link an EXE. The default for a DLL is 0x10000000.
Selecting /BASE:0 would be bad choice, no program can ever run at that base address. The first 64 KB of the address space is reserved and can never be mapped. Primarily to catch null pointer dereference bugs. And expanded to 64KB to catch pointer bugs in programs that started life in 16-bits and got recompiled to 32-bits.
Why 0x40000 and not 0x10000 is the default is a historical accident as well and goes back to at least Windows 95. Which reserved the first 4 megabytes of the address space for the "16-bit/MS-DOS Compatibility Arena". I don't remember much about it, Windows 9x had a very different 16-bit VM implementation from NT. You can read some more about it in this ancient KB article. It certainly isn't relevant anymore these days, a 64-bit OS will readily allocate heap memory in the space between 0x010000 and 0x400000.
There isn't any point in changing the /BASE option for an EXE. However, there's lots of point changing it for a DLL. They are much more effective if they don't overlap and thus don't have to be relocated, they won't take any space in the paging file and can be shared between processes. There's even an SDK tool for it so you can change it after building, rebase.exe
Practically, the impact of setting /BASE to 0 depends on the Address Space Layout Randomization (ASLR) setting of your image (which is also put by the Linker - /DYNAMICBASE:NO).
Should your image have /BASE:0 and ASLR is on (/DYNAMICBASE:YES), then your image will start and run because the loader will automatically load it at a "valid" address.
Should your image have /BASE:0 and ASLR is off (/DYNAMICBASE:NO), then your image will NOT start because the loader will NOT load it at the desired based address (which is, as explained above, unvalid/reserved).
If you map it to address 0 then that means the code expects to be running starting at address zero.
For the OS, address zero is NULL, which is an invalid address.
(Not "fundamentally", but for modern-day OSes, it is.)
Also, in general you don't want anything in the lower 16 MiB of memory (even virtual), for numerous reasons.
But what's the alternative? It has to be mapped somewhere, so they chose 0x400000... no particular reason for that particular address, probably. It was probably just handy.
Microsoft chose that address as the default starting address specified by the linker at which the PE file will be memory mapped. The linker assumes this address and and can optimize the executable with that assumption. When the file is memory mapped at that address the code can be run without needing to modify any internal offsets.
If for some reason the file cannot be loaded to that location (another exe/dll already loaded there) relocations will need to occur before the executable can run which will increase load times.
Lower memory addresses are usually assumed to contain low level system routines and are generally left alone. The only real requirement for the ImageBase address is that it is a multiple of 0x10000.
Recommended reading:
http://msdn.microsoft.com/en-us/library/ms809762.aspx

Resources