How Windows executes an Win32 process? [closed] - windows

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
So when you open an PE (.exe) or call CreateProcess (from Win32 API) the following procedure is followed:
The file header, image sectors and also the DLL's which the exe links against are mapped into the Process Own Virtual Memory.
CPU begin execution at the program start address.
So here comes my question - all the instructions in the PE image use an address relative to it's own Private Address Space (Virtual Memory), which begins with 0. Also sometimes this memory is paged out by Windows somewhere in the Secondary Memory (HDD). How the CPU find out the real physical address in the RAM? Also how the Windows switch from one thread to another by it's priority, to support multi-threading and when the CPU is not fully used send Idle instructions? After all this discoveries I'm starting to think that actually the machine code, stored in the PE files, isn't really executed directly by the CPU but instead in some Windows managed environment? Can this be true, and if so doesn't this slow-down the execution?
EDIT: Ok so the question should be rewritten as follows: "Are the Windows Processes executed in an core layout program or directly on the CPU?". I get the answer I wanted, so anyway the question is solved.

A complete answer would fill an entire book, but in short:
From a high-level view, finding the physical address is done by dividing the address by some constant (typically 4096), converting the address to its corresponding "page", and looking up that page in a table, which points to the index of the real, physical memory page, if one exists. Some or all of that may be done automatically by the CPU without anyone noticing, depending on the situation.
If a page does not exist, the OS will have to read the page from disk prior to letting the code that tried to access the page continue -- and not necessarily always into the same physical page.
In reality it's much more complex, as the table is really an entire hierarchy of tables, and in addition there is a small cache (typically around 50 entries) inside the CPU to do this task automatically for recently accessed pages, without firing an interrupt and running special kernel code.
So, depending on the situation, things might happen fully automatically and invisibly, or the OS kernel may be called, traversing an entire hierarchy of tables, and finally resorting to loading data from disk (and I haven't even considered that pages may have protections that prevent them from being accessed, or protections that will cause them being copied when written to, etc. etc.).
Multi-threading is "relatively simple" in comparison. It's done by having a timer periodically fire an interrupt every so and so often (under Windwos typically around 16 milliseconds, but this can be adjusted), and running some code (the "scheduler") inside the interrupt handler which decides whether to return to the current thread or change to another thread's context and run that one instead.
In the particular case of Windows, the scheduler will always satisfy highest priority tasks first, and only consider lower priority tasks when no non-blocked higher priority tasks are left.
If no other tasks are running, the idle task (which has the lowest priority) runs. The idle task may perform tasks such as zeroing reclaimed memory pages "for free", or it may throttle down the CPU (or both).
Further, when a thread blocks (e.g. when reading a file or a socket), the scheduler runs even without a timer interrupt. This ensures that the CPU can be used for something useful during the time the blocked thread can't do anything.

Related

Is there any scenario when we would resort to a process instead of a goroutine? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I understand that goroutines are very light weight and we can spawn thousands of them but I want to know if there is some scenario when we should spawn a process instead of a goroutine (like hitting some kind of process boundaries in terms of resource or something else). Can spawning a new process in some scenario be beneficial in terms of resource utilization or some other dimension?
To get things started, here's three reasons. I'm sure there's more.
Reason #1
In a perfect world, CPUs would be busy doing the most important work they can (and not wasted doing the less important work while more important work waits).
To do this, whatever controls what work a CPU does (the scheduler) has to know how important each piece of work is. This is normally done with (e.g.) thread priorities. When there are 2 or more processes that are isolated from each other, whatever controls what work a CPU does can't be part of either process. Otherwise you get a situation where one process is consuming CPU time doing unimportant work because it can't know that there's a different process that wants the CPU for more important work.
This is why things like "goroutines" are broken (inferior to plain old threads). They simply can't do the right thing (unless there's never more than one process that wants CPU time).
Processes (combined with "process priorities") can fix that problem (while adding multiple other problems).
Reason #2
In a perfect world, software would never crash. The reality is that sometimes processes do crash (and sometimes the reason has nothing to do with software - e.g. a hardware flaw). Specifically, when one process crashes often there's no sane way to tell how much damage was done within that process, so the entire process typically gets terminated. To deal with this problem people use some form of redundancy (multiple redundant processes).
Reason #3
In a perfect world, all CPUs and all memory would be equal. In reality things don't scale up like that, so you get things like ccNUMA where a CPU can access memory in the same NUMA domain quickly, but the same CPU can't access memory in a different NUMA domain as quickly. To cope with that, ideally (when allocating memory) you'd want to tell the OS "this memory needs low latency more than bandwidth" (and OS would allocate memory for the fastest/closest NUMA domain only) or you'd tell the OS "this memory needs high bandwidth more than low latency" (and the OS would allocate memory from all NUMA domains). Sadly every language I've ever seen has "retro joke memory management" (without any kind of "bandwidth vs. latency vs. security" hints); which means that the only control you get is the choice between "one process spread across all NUMA domains vs. one process for each NUMA domain".

Does C Code enjoy the Go GC's fragmentation prevention strategies? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
Corrected the false implications:
Golang's GC does virtual address space defragmentation fragmentation-prevention strategies, which enables a program to run for a very long time (if not ever).
But it seems C code (cgo or SWIG) has no means of having it's memory pointers updated in case they get moved elsewhere. getting benefit from these strategies.
Is it true? Won't C code get benefit from Golang's virtual address space defragmentation fragmentation-prevention, and will finally get fragmentation?
If that's false, how?
Also, what happens to any DLL code loaded by C code (e.g. Windows DLLs) ?
(The question is updated to correct my wrong assumptions)
I'm afraid you might be confusing things on multiple levels here.
First, calling into C in a production-grade Go code is usually a no-go right from the start: it is slow; as slow as making a system call — as for the most part it really works as a system call: one need to switch from Go stack to C stack and have the OS thread which happened to be executing the Go code which made the cgo call to be locked to that thread even if something on the C side blocks.
That is not to say you must avoid calling out to C, but this means you need to think this through up front and measure. May be setting up a pool of worker goroutines onto which to fan out the tasks which need to make C calls.
Second, your memory concerns might be well unfounded; let me explain.
Fragmenting virtual memory should be a non-issue on contemporary systems
usually used to run Go programs (I mean amd64 and the like).
That is pretty much because allocating virtual memory does not force the OS
to actually allocate physical memory pages — the latter happens only
when the virtual memory gets used (that is, accessed at an address
happening to point into an allocated virtual memory region).
So, want you or not, you do have that physical memory fragmentation problem
anyway, and it is getting sorted out
at the OS and CPU level using multiple-layered address translation
tables (and TLB-caches).
Third, you appear to be falling into a common trap of speculating about
how things will perform under load instead of writing a highly simplified
model program and inspecting how it behaves under the estimated production
load. That is, you think a problem with allocating C memory will occur
and then fancy the whole thing will not work.
I would say your worries are unfounded — given the amount of production
code written in C and C++ and working under hardcore loads.
And finally, C and C++ programmers tred the pathways to high-performance
memory management long time ago. A typical solution is using custom
pool allocators for the objects which exhibit the most
allocation/deallocation churn under the typical load. With this approach,
the memory allocated on your C side is mostly stable for the lifetime
of your program.
TL;DR
Write a model program, put the estimated load on it and see how it behaves.
Then analyze, what the problems with the memory are, if any, and
only then start attacking them.
It depends.
The memory which the C code needs can be allocated by Go and it's pointer be passed to C code. In this case, C code will get benefit from Go's fragmentation-prevention strategies.
The same goes for DLL code, so if DLL functions don't allocate their working memory on their own, this can be done for them as well.

Control Block Processes

Whenever a process is moved into the waiting state, I understand that the CPU moved to another process. But whenever a process is in waiting state if it is still needing to make a request to another I/O resource does that computation not require processing? Is there i'm assuming a small part of the processor that is dedicated to help computation of the I/O request to move data back and forth?
I hope this question makes sense lol.
IO operations are actually tasks for peripheral devices to do some work. Usually you set the task by writing data to special areas of memory which belongs to devices. They monitor changes in that small area and start to execute the tasks. So CPU does not need to do anything while the operation is in progress and can switch to another program. When the IO is completed usually an interrupt is triggered. This is a special hardware mechanism which pauses currently executed program in arbitrary place and switches to a special suprogramm, which decides what to do later. There can be another designs, for example device may set special flag somewhere in it's memory region and OS must check it from time to time.
The problem is that these IO are usually quite small, such as send 1 byte over COM port, so CPU has to be interrupted too often. You can't achieve high speed with them. Here is where DMA comes handy. This is a special coprocessor (or part of peripheral device) which has direct access to RAM and can feed big blocks of memory in-to devices. So it can process megabytes of data without interrupting CPU.

Same addresses pointing to different values - fork system call

When a fork is called, the stack and heap are both copied from the parent process to the child process. Before using the fork system call, I malloc() some memory; let's say its address was A. After using the fork system call, I print the address of this memory in both parent and child processes. I see both are printing the same address: A. The child and parent processes are capable of writing any value to this address independently, and modification by one process is not reflected in the other process. To my knowledge, addresses are globally unique within a machine.
My question is: Why is it that the same address location A stores different values at the same time, even though the heap is copied?
There is a difference between the "real" memory address, and the memory address you usually work with, i.e. the "virtual" memory address. Virtual memory is basically just an abstraction from the Operating System in order to manage different pages, which allows the OS to switch pages from RAM into HDD (page file) and vice versa.
This allows the OS to continue operating even when RAM capacity has been reached, and to put the relevant page file into a random location inside RAM without changing your program's logic (otherwise, a pointer pointing to 0x1234 would suddenly point to 0x4321 after a page switch has occured).
What happens if you fork your process is basically just a copy of the page file, which - I assume - allows for smarter algorithms to take place, such as copying only if one process actually modifies the page file.
One important aspect to mention is that forking should not change any memory addresses, since (e.g. in C) there can be quite a bit of pointer logic in your application, relying on the consistency of the memory you allocated. If the addresses were to suddenly change after forking, it would break most, if not all, of this pointer logic.
You can read more on this here: http://en.wikipedia.org/wiki/Virtual_memory or, if you're truly interested, I recommend reading "Operating Systems - Internals and Design Principles" by William Stallings, which should cover most things including why and how virtual memory is used. There is also an excellent answer to this in this StackOverflow thread. Lastly, you might want to also read answers from this, this and this question.

Memory mapping of files vs CreateFile/ReadFile [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What are the drawbacks (if any) of using memory mapped file to read (regular sized files) over doing the same using CreateFile ReadFile combination?
With ReadFile/WriteFile you have deterministic error handling semantics. When you use memory mapped files, errors are returned by throwing an exception.
In addition, if the memory mapped file has to hit the disk (or even worse, the network) your memory read may take several seconds (or even minutes) to complete. Depending on your application, this can cause unexpected stalls.
If you use ReadFile/WriteFile you can use asynchronous variants of the API to allow you to control this behavior.
You also have more deterministic performance if you use ReadFile, especially if your I/O pattern is predictable - memory mapped I/O is often random while as ReadFile is almost always serial (since ReadFile reads at the current file position and advances the current file position).
A big advantage of file mapping is that it doesn't influence system cache. If your application does excessive I/O by means of ReadFile, your system cache will grow, consuming more and more physical memory. If your OS is 32 bit and you have much more than 1GB memory, than you're lucky, since on 32 bit Windows the size of system cache is limited by 1GB. Otherwise system cache will consume all available physical memory and the memory manager will soon start purging pages of other processes to disk, intensifying disk operations instead of actually lessen them. The effect is especially noticeable on 64 bit Windows, where the cache size is limited only by available physical memory. File mapping on the other hand doesn't lead to overgrowing of system cache and at the same time doesn't degrade the performance.
You'll need more complex code for establishing the file mapping than for just opening and reading. File mapping is intended for random access to a section of file. If you don't need that, just don't bother with file mapping.
Also if ever need to port your code onto another platform you'll do it much easier and faster if you don't use file mapping.

Resources