This image and others like it have been bothering me for a while now. When I use malloc, this should be a part of the dynamic data, the heap. However, this seems to be bounded from above by the stack, which seems very ineffective. A program cannot predict how much memory I plan on allocating, so how does the program judge how far up to put the stack? It seems as though all of the memory in the middle is wasted, and I would like to know how this works for programs that could potentially range from a small service program that doesn't use any dynamic memory verses a videogame that could potentially allocate huge sections of memory.
Say, for example, I open up Microsoft paint. If I paste a high resolution picture into it, the memory allocation of paint skyrockets. Where was this memory taken from? What I would truly like is a snapshot of my entire RAM stick labelled as above to visualize how the many programs of a computer partition the computer's memory as a whole, but I can only find diagrams like this one for a single process and a single section of RAM.
Your picture is not of the RAM, but of the address space of some process in virtual memory. The kernel configures the MMU to manage virtual memory, provide the virtual address space of a process, and do some paging, and manage the page cache.
BTW, it is not the compiler which grows the stack (so your picture is wrong). The compiler is generating machine code which may push or pop things on the call stack. For malloc allocated heap, the C standard library implementation contains the malloc function, above operating system primitives or system calls allocating pages of virtual memory (e.g. mmap(2) on Linux).
On Linux, a process can ask its address space to be changed with mmap(2) -and munmap and mprotect. When a program is started with execve(2) the kernel is setting its initial address space. See also /proc/ (see proc(5) and try cat /proc/$$/maps....). BTW mmap is often used to implement malloc(3) and dlopen(3) -runtime loading of plugins, both heavily used in the RefPerSys project (it is a free software artificial intelligence project for Linux).
Since most Linux systems are open source, I suggest you to dive into implementation details by downloading then looking inside the source code of GNU libc or musl-libc: both implement malloc and dlopen above mmap and other syscalls(2).
Windows has similar facilities, but I don't know Windows. Refer to the documentation of the WinAPI
Read also Operating Systems: Three Easy Pieces, and, if you code in C, some good book such as Modern C and some C reference website. Be sure to read the documentation of your C compiler (e.g. GCC). See also the OSDEV website.
Be aware that modern C compilers are permitted to make extensive optimizations. Read what every C programmer should know about undefined behavior. See also this draft report.
In modern systems, a technique called virtual memory is used to give the program its own memory space. Heap and stack locations are at specific locations in virtual memory. The kernel then takes care of mapping memory locations between physical memory and virtual memory. Your program may have a chunk of memory allocated at 0x80000000, but that chunk of memory might be stored in location 0x49BA5400. Your actual RAM stick would be a jumbled mess of chunks of all of those sections in seemingly random locations.
Related
I am studying memory management. In particular, I am studying MMU and the mapping between the process logical space pages and the RAM frames.
My question is: what about low-end embedded systems? If I'm correct, MMU can't be used in this systems due to their smaller memory. So how computers with less memory available can avoid the problem of shared memory between processes?
For embedded systems, the kind of MMU you speak of is only present in high-end microcontrollers like PowerPC or Cortex A.
Low-end to mid-range microcontrollers do often have some simpler form of MMU though. Not as advanced as used to create virtual memory sections, but a simpler kind which allows remapping of RAM, flash, registers and so on. Similarly, they often have various mechanisms for protecting certain parts of the memory from accidental writes. They may or may not be smart enough to do a "MMU-like" realization that code is executing from data memory or when data access happens in code memory. Harvard vs von Neumann architecture also matters here.
As for multiple processes in a RTOS, it can't be compared with multiple processes in a desktop computer. Each process in a RTOS typically got its own stack but that's about it - the MMU isn't involved in that but it's handled by the RTOS. Code in embedded systems is typically executed directly from flash, so it doesn't make sense to assign chunks of RAM memory for executable code like in a PC. Several processes will simply execute code from flash and it might be the same code or different code between processes simply depending on whether they share common code or not.
Similarly, it is senseless to use heap allocation in embedded systems (see Why should I not use dynamic memory allocation in embedded systems?) so we don't need to create a RAM image for that purpose either. The only thing left as unique per process is the stack, as well as separate parts of .data/.bss.
When the operating system loads Program onto the main memory , it , along with the stack and heap memory , also attaches the static data along with it. I googled about what is present in the static data which said it contained the global variables and static variables. But I am confused as both of these are already present in the text file of the program then why do we add them seperately?
The data in the executable is often referred as the data segment. The CPU doesn't interact with the hard-disk but only with RAM. The data segment must thus be loaded in RAM before the CPU can access it. The file of the executable is not really a text file. It is an executable so it has a different extension. Text files often refer to an actual file with a .txt extension.
With that said, you also asked another question not long ago (If the amount of stack memory provided to a program is fixed then why does it grow downwards in the process architecture? Or am I getting it wrong?) so I will try to give some insight for both of these in this same answer.
I don't know much about caching and low level inner CPU workings but, today mostly, the CPU doesn't even operate on RAM directly. It will load a bunch of RAM chunks into the cache and make operations on them and keep RAM-cache consistency by implementing complex mechanisms. The OS also has its role to play in RAM-cache consistency but, like I said, I am far from an expert here. Other than that, caching is mostly transparent to the OS. The CPU handles it and the OS simply provides instructions to the CPU which executes them.
Today, you have paging used by most OS and implemented on most CPU architectures. With paging, every process sees a full contiguous virtual address space. The virtual address space is accessed contiguously and the hardware MMU translates those addresses to physical ones automatically by crossing the page tables. The OS is responsible to make sure the page tables are consistent and the MMU does the rest of the job (for more info read: What is paging exactly? OSDEV). If you understand paging well, things become much clearer.
For a process, there is mostly 3 types of memory. There is the stack (often called automatic storage), the heap and the static/global data. I will attempt to give precision on all of these to give a global picture.
The stack is given a maximum size when the process begins. The OS handles that and creates the page tables and places the proper address in the stack pointer register so that stack accesses reach the proper region of physical memory. The stack is automatic storage which means that it isn't handled manually by the high level programmer. For example, in C/C++, the stack is managed by the compiler which, at the entry of a function, will create a stack frame and place offsets from the stack base pointer in the instructions. Every local variable (within a function) will be accessed with a relative negative offset from the stack base pointer. What the compiler needs to do is to create a stack frame of the proper size so that there will be enough place for all local variables of a particular function (for more info on the stack see: Each program allocates a fixed stack size? Who defines the amount of stack memory for each application running?).
For the heap, the OS reserves a very big amount of virtual memory. Today, virtual memory is very big (2^48 bytes or more). The amount of heap available for each process is often only limited by the amount of physical memory available to back virtual memory allocations. For example, a process could use malloc() to allocate 4KB of memory in C. The OS will be called with a system call by the libc library which is an implementation of the C standard library. The OS will then reserve a page of the virtual memory available for the heap and change the page tables so that accessing that portion of virtual memory will translate to somewhere in RAM (probably somewhere another process wasn't already using).
The static/global data are simply placed in the executable in the data segment. The data segment is loaded in the virtual memory alongside the text segment. The text segment will thus be able to access this data often using RIP-relative addressing.
I have a sequential user space program (some kind of memory intensive search data structure). The program's performance, measured as number of CPU cycles, depends on memory layout of the underlying data structures and data cache size (LLC).
So far my user space program is tuned to death, now I am wondering if I can get performance gain by moving the user space code into kernel (as a kernel module). I can think of the following factors that improve the performance in kernel space ...
No system call overhead (how many CPU cycles is gained per system call). This is less critical as I am barely using any system call in my program except for allocating memory that too just when the program starts.
Control over scheduling, I can create a kernel thread and make it run on a given core without being thrown away.
I can use kmalloc memory allocation and thus can have more control over memory allocated, may can also control the cache coloring more precisely by controlling allocated memory. Is it worth trying?
My questions to the kernel experts...
Have I missed any factors in the above list that can improve performance further?
Is it worth trying or it is straight way known that I will NOT get much performance improvement?
If performance gain is possible in kernel, is there any estimate how much gain it can be (any theoretical guess)?
Thanks.
Regarding point 1: kernel threads can still be preempted, so unless you're making lots of syscalls (which you aren't) this won't buy you much.
Regarding point 2: you can pin a thread to a specific core by setting its affinity, using sched_setaffinity() on Linux.
Regarding point 3: What extra control are you expecting? You can already allocate page-aligned memory from user space using mmap(). This already lets you control for the cache's set associativity, and you can use inline assembly or compiler intrinsics for any manual prefetching hints or non-temporal writes. The main difference between memory allocated in the kernel and in user space is that kmalloc() allocates wired (non-pageable) memory. I don't see how this would help.
I suspect you'll see much better ROI on parallelising using SIMD, multithreading or making further algorithmic or memory optimisations.
Create a dedicated cpuset for your program and move all other processes out of it. Then bump your process' priority to realtime with FIFO scheduling policy using something like:
struct sched_param schedparams;
// Be portable - don't just set priority to 99 :)
schedparams.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &schedparams);
Don't do that on a single-core system!
Reserve large enough stack space with alloca(3) and touch all of the allocated stack memory, map more than enough heap space and then use mlock(2) or mlockall(2) to pin process memory.
Even if your program is a sequential one, if run on a multisocket Nehalem or post-Nehalem Intel system or an AMD64 system, NUMA effects can slow your program down. Use API functions from numa(3) to allocate and keep memory as close to the NUMA node where your program executes as possible.
Try other compilers - some of them might optimise better than the compiler that you are currently using. Intel's compiler for example is very aggresive on laying out instructions as to benefit from out of order execution, pipelining and branch prediction.
Saw this questions asked many times. But couldn't find a reasonable answer. What is actually the limit of virtual memory?
Is it the maximum addressable size of CPU? For example if CPU is 32 bit the maximum is 4G?
Also some texts relates it to hard disk area. But I couldn't find it is a good explanation. Some says its the CPU generated address.
All the address we see are virtual address? For example the memory locations we see when debugging a program using GDB.
The historical reason behind the CPU generating virtual address? Some texts interchangeably use virtual address and logical address. How does it differ?
Unfortunately, the answer is "it depends". You didn't mention an operating system, but you implied linux when you mentioned GDB. I will try to be completely general in my answer.
There are basically three different "address spaces".
The first is logical address space. This is the range of a pointer. Modern (386 or better) have memory management units that allow an operating system to make your actual (physical) memory appear at arbitrary addresses. For a typical desktop machine, this is done in 4KB chunks. When a program accesses memory at some address, the CPU will lookup where what physical address corresponds to that logical address, and cache that in a TLB (translation lookaside buffer). This allows three things: first it allows an operating system to give each process as much address space as it likes (up to the entire range of a pointer - or beyond if there are APIs to allow programs to map/unmap sections of their address space). Second it allows it to isolate different programs entirely, by switching to a different memory mapping, making it impossible for one program to corrupt the memory of another program. Third, it provides developers with a debugging aid - random corrupt pointers may point to some address that hasn't been mapped at all, leading to "segmentation fault" or "invalid page fault" or whatever, terminology varies by OS.
The second address space is physical memory. It is simply your RAM - you have a finite quantity of RAM. There may also be hardware that has memory mapped I/O - devices that LOOK like RAM, but it's really some hardware device like a PCI card, or perhaps memory on a video card, etc.
The third type of address is virtual address space. If you have less physical memory (RAM) than the programs need, the operating system can simulate having more RAM by giving the program the illusion of having a large amount of RAM by only having a portion of that actually being RAM, and the rest being in a "swap file". For example, say your machine has 2MB of RAM. Say a program allocated 4MB. What would happen is the operating system would reserve 4MB of address space. The operating system will try to keep the most recently/frequently accessed pieces of that 4MB in actual RAM. Any sections that are not frequently/recently accessed are copied to the "swap file". Now if the program touches a part of that 4MB that isn't actually in memory, the CPU will generate a "page fault". THe operating system will find some physical memory that hasn't been accessed recently and "page in" that page. It might have to write the content of that memory page out to the page file before it can page in the data being accessed. THis is why it is called a swap file - typically, when it reads something in from the swap file, it probably has to write something out first, effectively swapping something in memory with something on disk.
Typical MMU (memory management unit) hardware keeps track of what addresses are accessed (i.e. read), and modified (i.e. written). Typical paging implementations will often leave the data on disk when it is paged in. This allows it to "discard" a page if it hasn't been modified, avoiding writing out the page when swapping. Typical operating systems will periodically scan the page tables and keep some kind of data structure that allows it to intelligently and quickly choose what piece of physical memory has not been modified, and over time builds up information about what parts of memory change often and what parts don't.
Typical operating systems will often gently page out pages that don't change often (gently because they don't want to generate too much disk I/O which would interfere with your actual work). This allows it to instantly discard a page when a swapping operation needs memory.
Typical operating systems will try to use all the "unused" memory space to "cache" (keep a copy of) pieces of files that are accessed. Memory is thousands of times faster than disk, so if something gets read often, having it in RAM is drastically faster. Typically, a virtual memory implementation will be coupled with this "disk cache" as a source of memory that can be quickly reclaimed for a swapping operation.
Writing an effective virtual memory manager is extremely difficult. It needs to dynamically adapt to changing needs.
Typical virtual memory implementations feel awfully slow. When a machine starts to use far more memory that it has RAM, overall performance gets really, really bad.
When my task manager (top, ps, taskmgr.exe, or Finder) says that a process is using XXX KB of memory, what exactly is it counting, and how does it get updated?
In terms of memory allocation, does an application written in C++ "appear" different to an operating system from an application that runs as a virtual machine (managed code like .NET or Java)?
And finally, if memory is so transparent - why is garbage collection not a function-of or service-provided-by the operating system?
As it turns out, what I was really interested in asking is WHY the operating system could not do garbage collection and defrag memory space - which I see as a step above "simply" allocating address space to processes.
These answers help a lot! Thanks!
This is a big topic that I can't hope to adequately answer in a single answer here. I recommend picking up a copy of Windows Internals, it's an invaluable resource. Eric Lippert had a recent blog post that is a good description of how you can view memory allocated by the OS.
Memory that a process is using is basically just address space that is reserved by the operating system that may be backed by physical memory, the page file, or a file. This is the same whether it is a managed application or a native application. When the process exits, the operating system deletes the memory that it had allocated for it - the virtual address space is simply deleted and the page file or physical memory backings are free for other processes to use. This is all the OS really maintains - mappings of address space to some physical resource. The mappings can shift as processes demand more memory or are idle - physical memory contents can be shifted to disk and vice versa by the OS to meet demand.
What a process is using according to those tools can actually mean one of several things - it can be total address space allocated, total memory allocated (page file + physical memory) or memory a process is actually using that is resident in memory. Task Manager has a separate column for each of these possibilities.
The OS can't do garbage collection since it has no insight into what that memory actually contains - it just sees allocated pages of memory, it doesn't see objects which may or may not be referenced.
Whereas the OS handles allocates at the virtual address level, in the process itself there are other memory managers which take these large, page-sized chunks and break them up into something useful for the application to use. Windows will return memory allocated in 64k boundaries, but then the heap manager breaks it up into smaller chunks for use by each individual allocation done by the program via new. In .Net applications, the CLR will hand off new objects off of the garbage collected heap and when that heap reaches its limits, will perform garbage collection.
I can't speak to your question about the differences in how the memory appears in C++ vs. virtual machines, etc, but I can say that applications are typically given a certain memory range to use upon initialization. Then, if the application ever requires more, it will request it from the operating system, and the operating system will (generally) grant it. There are many implementations of this - in some operating systems, other applications' memory is moved away so as to give yours a larger contiguous block, and in others, your application gets various chunks of memory. There may even be some virtual memory handling involved. Its all up to an abstracted implementation. In any case, the memory is still treated as contiguous from within your program - the operating system will handle that much at least.
With regard to garbage collection, the operating system knows the bounds of your memory, but not what is inside. Furthermore, even if it did look at the memory used by your application, it would have no idea what blocks are used by out-of-scope variables and such that the GC is looking for.
The primary difference is application management. Microsoft distinguishes this as Managed and Unmanaged. When objects are allocated in memory they are stored at a specific address. This is true for both managed and unmanaged applications. In the managed world the "address" is wrapped in an object handle or "reference".
When memory is garbage collected by a VM, it can safely suspend the application, move objects around in memory and update all the "references" to point to their new locations in memory.
In a Win32 style app, pointers are pointers. There's no way for the OS to know if it's an object, an arbitrary block of data, a plain-old 32-bit value, etc. So it can't make any inferences about pointers between objects, so it can't move them around in memory or update pointers between objects.
Because the way references are handled, the OS can't take over the GC process and instead it's left up to the VM to manage the memory used by the application. For that reason, VM applications appear exactly the same to the OS. They simply request blocks of memory for use and the OS gives it to them. When the VM performs GC and compacts it's memory it's able to free memory back to the OS for use by another app.