I am confused about PE file relocations - winapi

Wikipedia says this about relocations:
PE files normally do not contain position-independent code. Instead
they are compiled to a preferred base address, and all addresses
emitted by the compiler/linker are fixed ahead of time. If a PE file
cannot be loaded at its preferred address (because it's already taken
by something else), the operating system will rebase it. This involves
recalculating every absolute address and modifying the code to use the
new values. The loader does this by comparing the preferred and actual
load addresses, and calculating a delta value. This is then added to
the preferred address to come up with the new address of the memory
location. Base relocations are stored in a list and added, as needed,
to an existing memory location.
I am confused as to why there would be anything else at 0x00400000 (default preferred base address) besides the base address for the process. It is my understanding that in virtual memory, the process has the view of an empty memory space in which it is the only thing that exists. With this in mind, how would anything be there before the process itself initially?

As a matter of fact, in most cases, when a process starts there is no issue regarding the preferred base address! In some situations like "Process Hollowing" (a technique where an application replaces another one in memory), the preferred base address is an important issue that must be handled. See following link for more (low-level) technical details about this issue related to the preferred address.
Introduction to Process Hollowing

Related

What is the primary reason for using PLT & GOT tables for shared libraries?

I'm reading Ian Lance Taylor's essay on Linkers: http://inai.de/documents/Linkers.pdf
When discussing shared objects around page 9, he mentions that since shared libraries can be loaded into a process at an unpredictable virtual address, a dynamic linker would need to process a large amount of relocations once the address is known. This would slow down loading. In order to avoid this large number of relocations being done by the dynamic linker, the program linker changes function references into PC-relative calls into the PLT table, and global/static variable references are turned into references into the GOT table. Then the dynamic linker only needs to relocate the entries in the PLT/GOT on load time, and not process relocations in the entire binary.
However, this focus on load-time optimization is confusing me, because there seems to be a much more glaring issue here, and speeding up loading is beside the point. The whole point of shared objects is that a single shared object loaded into physical memory can now be mapped into the virtual address space of each process that needs it. This can be done quickly by changing some page tables, and avoids loading a new copy of the library from disk.
So if the dynamic linker did any relocations in the main body of the shared library, these changes would appear in every other process that also has that shared library mapped, and they would break the library if it appears at a different virtual address.
And it's for this reason that we have a GOT and PLT. The program linker modifies all references into position-independent references into the GOT and PLT. And then the dynamic linker relocates the entries in the GOT and PLT uniquely for each process. The main contents of the shared library are shared across each process, but the GOT and PLT is unique for each process and is not shared.
Is this understanding of the PLT and GOT correct? I've inferred some of the mechanisms here based on my understanding, but I don't see any other way that it could work.
You appear to be missing or not understanding the concept of copy-on-write (CoW) pages.
Two processes could mmap the same file on disk into their distinct virtual addresses, and the OS can use a single physical page of RAM for both mappings (that is, the processes share a single physical memory page). But as soon as one process changes the memory, a copy is made for that process, and the changes do not appear in the other process (the physical memory pages are no longer shared).
So if the dynamic linker did any relocations in the main body of the shared library, these changes would appear in every other process that also has that shared library mapped,
Not if the memory is CoW.
And it's for this reason that we have a GOT and PLT
No, the reason is optimization (many fewer pages would have to be copied) and not correctness as your (mis)understanding implies.

UNIX system call to unset the reference bit of a specific page in page table?

I'm trying to count hits of a specific set of pages, by hacking the reference bits in the page table. Is there any system call or any other way to unset reference bits (in UNIX-like systems)?
A page table is the data structure used by a virtual memory system in a computer operating system to store the mapping between virtual addresses and physical addresses. (https://en.wikipedia.org/wiki/Page_table)
In unix-like systems there is a bit associated with each page table entry, called "reference" bit, which indicates if a page was accessed since the bit was unset.
The linux kernel unsets these reference bits periodically and checks a while after that to know what pages have been accessed, in order to detect "hot" pages. But this information is very coarse grain and low-precision since it doesn't say anything about the number of accesses and their time.
I want to count accesses to specific pages during shorter epochs by unsetting reference bits then check if the pages have been accessed after a short time.
Therefore, I was wondering if there is any system call or CPU interrupt which provides means to unset "reference bits". Otherwise, I need to dive deep into kernel to see what goes on down there.
There is no API for resetting the page reference bits. Page management is a very twitchy aspect of kernel tuning and no one wants to upset it. Of course you could modify the kernel to your needs.
Instead, you might look into Valgrind which is a debugging and profiling tool for running a single program. Ordinarily it detects subtle memory errors such as detecting use of a dynamic memory block after it has been freed.
If you need page management information for the system as a whole, I think the most expedient solution is hacking the kernel.

How absolute code get generated

I have gone through with memory managment concepts of Operating system concept of Galvin , I have read a statment :
If you know at compile time where the process will reside in memory, then absolute code can be generated.
How at compile time processor got to know at which memory location in main memory process is going to store.
Can someone explain , what is the exactly does it means , if we know at compile time where process will reside in memory ,
As memory be allocated when program is moving from ready to running state .
Generally, machine code isn't position-independent. In order to be able to load it at an arbitrary starting address and run there one needs some extra information about the machine code (e.g. where it has addresses to the various parts of itself), so it can be adjusted to the arbitrary position.
OTOH, if the code is always going to be loaded at the same fixed address, you don't need any of that extra information and processing.
By absolute he means fixed + final, already adjusted to the appropriate address.
The processor does not "know" anything. You "tell" it.
I don't know exactly what he means with "absolute code", depending which operating system you use, the program with it's code and data will be loaded to a virtual address and executed from there.
Beside of this not the compiler but the linker sets the address where the program will be loaded to.
Modern operating systems like Linux are using Address Space Layout Randomization to avoid having one static address where every program is loaded and to avoid the possibility of exploiting software flaws.
If you're writing your own operating system maybe the osdev.org wiki could be a good ressource for you. If you can read/speak german I recommened lowlevel.eu either.

Why memory-mapped files are always mapped at page boundaries?

This is my first question here; I'm not sure if it is off-topic.
While self-studying, I have found the following statement regarding Operating Systems:
Operating systems that allow memory-mapped files always require files to be mapped at page boundaries. For example, with 4-KB page, a file can be mapped in starting at virtual address 4096, but not starting at virtual address 5000.
This statement is explained in the following way:
If a file could be mapped into the middle of page, a single virtual page would
need two partial pages on disk to map it. The first page, in particular, would
be mapped onto a scratch page and also onto a file page. Handling a page
fault for it would be a complex and expensive operation, requiring copying of
data. Also, there would be no way to trap references to unused parts of pages.
For these reasons, it is avoided.
I would like to ask for help to understand this answer. Particularly, what does it mean to say that "a single virtual page would need two partial pages on disk to map it"? From what I found about memory-mapped files, virtual pages are mapped to files on disk, and not to a paging file. Is this what is meant by "partial page"?
Also, what is meant by "scratch page" here? I've tried to look up this term on books (Tanenbaum's "Modern Operating Systems" and "Structured Computer Organization") and on the Web, but haven't found it.
First of all, when reading books and documentation always try to look critically at what you see. Sometimes authors tend to use language like "there is no other way" just to promote the solution that they are describing. Other ways are always possible.
Now to the matter. Modern operating systems always have a disk location for every allocated memory page. This makes sense. Once it will be necessary to discard the page in the memory - it is already clear where to put this page if it is 'dirty' or just discard it if it is not modified. This strategy is widely accepted. Although alternative policies are possible also.
The disk location can be either paging file or memory mapped file. The most common use of the memory mapped files - executables and dlls. They are (almost) never modified. If a page with the code is not used for some time - discard it. If control will come there - reread it from the file.
In the abstract that you mentioned, they say would need two partial pages on disk to map it. The first page, in particular, would be mapped onto a scratch page. They tend to present situation like there is only one solution here. In fact, it is possible to allocate page in a paging file for such combined page and handle appropriate data copying. It is also possible not to have anything in the paging file for such page and assemble this page from files using transient page. In 99% of cases disk controller can read/write only from/to the page boundary. This means that you need to read from the first file to memory page, from the second file to the transient page. Copy data from the transient page and immediately discard it.
As you see, it is perfectly possible to combine several files in one page. There is no principle problem here. Although algorithms for handling this solution will be more complex and they will consume more CPU clocks. Reconstructing such page (if it will be discarded) will require reading from several different files. In our days 4kb is rather small quantity. Saving 2kb is not a huge gain. In my opinion, looking at the benefits and the cost I would say that benefits are not significant enough.
Virtual address pages (on every machine I've ever heard of) are aligned on page sized boundaries. This is simply because it makes the math very easy. On x86, the page size is 4096. That is exactly 12 bits. To figure out which virtual page an address is referring to, you simply shift the address to the right by 12. If you were to map a disk block (assume 4096 bytes) to an address of 5000, it would start on page #1 (5000 >> 12 == 1) and end on page #2 (9095 >> 12 == 2).
Memory mapped files work by mapping a chunk of virtual address space to the file, but the data is loaded on demand (indeed, the file could be far larger than physical memory and may not fit). When you first access the virtual address, if the data isn't there (i.e. it's not in physical memory). The processor will fault and the OS has to fetch the data. When you fetch the data, you need to fetch all of the data for the page, or else you wouldn't be able to turn off the fault. If you don't have the addresses aligned, then you'd have to bring in multiple disk blocks to fill the page. You can certainly do this, it's just messy and inefficient.

freebsd shpgperproc what is responsible for?

i've googled a lot about what is "page share factor per proc" responsible for and found nothing. It's just interesting for me, i have no current problem with it for now, just curious (wnat to know more). In sysctl it is:
vm.pmap.shpgperproc
Thanks in advance
The first thing to note is that shpgperproc is a loader tunable, so it can only be set at boot time with an appropriate directive in loader.conf, and it's read-only after that.
The second thing to note is that it's defined in <arch>/<arch>/pmap.c, which handles the architecture-dependent portions of the vm subsystem. In particular, it's actually not present in the amd64 pmap.c - it was removed fairly recently, and I'll discuss this a bit below. However, it's present for the other architectures (i386, arm, ...), and it's used identically on each architecture; namely, it appears as follows:
void
pmap_init(void)
{
...
TUNABLE_INT_FETCH("vm.pmap.shpgperproc", &shpgperproc);
pv_entry_max = shpgperproc * maxproc + cnt.v_page_count;
and it's not used anywhere else. pmap_init() is called only once: at boot time as part of the vm subsystem initialization. maxproc, is just the maximum number of processes that can exist (i.e. kern.maxproc), and cnt.v_page_count is just the number of physical pages of memory available (i.e. vm.stats.v_page_count).
A pv_entry is basically just a virtual mapping of a physical page (or more precisely a struct vm_page, so if two processes share a page and both have them mapped, there will be a separate pv_entry structure for each mapping. Thus given a page (struct vm_page) that needs to be dirtied or paged out or something requiring a hw page table update, the list of corresponding mapped virtual pages can be easily found by looking at the corresponding list of pv_entrys (as an example, take a look at i386/i386/pmap.c:pmap_remove_all()).
The use of pv_entrys makes certain VM operations more efficient, but the current implementation (for i386 at least) seems to allocate a static amount of space (see pv_maxchunks, which is set based on pv_entry_max) for pv_chunks, which are used to manage pv_entrys. If the kernel can't allocate a pv_entry after deallocating inactive ones, it panics.
Thus we want to set pv_entry_max based on how many pv_entrys we want space for; clearly we'll want at least as many as there are pages of RAM (which is where cnt.v_page_count comes from). Then we'll want to allow for the fact that many pages will be multiply-virtually-mapped by different processes, since a pv_entry will need to be allocated for each such mapping. Thus shpgperproc - which has a default value of 200 on all arches - is just a way to scale this. On a system where many pages will be shared among processes (say on a heavily-loaded web server running apache), it's apparently possible to run out of pv_entrys, so one will want to bump it up.
I don't have a FreeBSD machine close to me right now but it seems this parameter is defined and used in pmap.c, http://fxr.watson.org/fxr/ident?v=FREEBSD7&im=bigexcerpts&i=shpgperproc

Resources