Today we have an (batch) application that scales horizontally, and we are running multiple processes on multiple servers. All processes running on the same phys. machine share a large amount (~100 GB) of read-only memory with customer and contract related data which we load upfront and only ones into memory. Each phys server loads the same data into mem. We are loading the data into memory, because of performance reasons. We tried Redis and the like, but the performance dropped.
Since all data are read-only we are thinking of memory-mapping them instead. Each process would theoretically load the same files, but because we would memory-map them and they are read-only, the OS should only (lazy) load them on demand, only ones, and based on free phys memory. An additional advantage will be that each process is now completely decoupled from any other process, which has huge benefits in K8S/container environments.
Next step is, we want to put the processes (which are then using mmap) into containers. But we didn't find a way yet to bullet proof that the mmapped files are loaded only ones, despite the processes now being in different containers (if on the same server). To our current knowledge containers don't change the behavior of mmapping files, but we want to be sure.
Since different Linux distros and versions seem to have different means ... => we are using a fairly recent version of RedHat, but Ubuntu would be ok as well. We found several different (linux) tools and means to list info about mmapped files, etc., e.g. which process is mmapping which file. May be we are just paranoid, but none of the information we found so far, clearly says: this file (or block of the file), loaded into this phys location, assigned to processes X, Y, and Z, and accessible via virtual addresses A, B, .. within these processes.
Any insight on whether containers change mmap behavior or not, and how to proof that mmapped files are loaded only ones, would be very much welcome.
Related
I have two programs, the first program (lets' call it A) creates a huge chunk of data and save them on disk, the second program (lets' call it B) reads data from disk and perform data processing. The old workflow is that, I run program A, save data on disk, then run program B, load the data from disk, then process the data. However, this is very time-consuming, since we need two disk IO for large data.
One trivial way to solve this problem is to simply merge the two programs. However, I do NOT want to do this (imagine with a single dataset, we want to have multiple data processing programs running in parallel on the same node, which makes it necessary to separate the two programs). I was told that there is a technique called memory mapping file, which allows multiple processes to communicate and share memory. I find some reference in https://man7.org/linux/man-pages/man3/shm_unlink.3.html.
However, in the example shown there, the execution of two programs (processes) is overlapped, and the two processes communicate with each other in a "bouncing" fashion. In my case, I am not allowed to have such communication pattern. For some reason I have to make sure that program B is executed only after program A is finished (serial workflow). I just wonder if mmap can still be used in my case? I know it seems weird since at some point there is some memory allocated by program A while no program is running (between A and B), which might leads to memory leak, but if this optimization is possible, it would be a huge improvement. Thanks!
Memory mapped files and shared memory are two different concepts.
The former enable you to map a file in memory so that reads to the memory location read the file and write to the memory location write into the file. This kind of operation is very useful to abstract IO accesses as basic memory read/write. It is especially useful for big-data applications (or just to reuse code so to compute files directly).
The later is typically used for multiple running programs to communicate together while being in different processes. For example, programs like the Chrome/Chromium browser use that so to communicate between tabs that are different processes (for sake of security). It is also used in HPC for fast MPI communication between processes lying on the same computing node.
Linux also enable you to use pipes so for one process to send data to another. The pipe is closed when the process emitting data ends. This is useful for dataflow-based processing (eg. text filtering using grep for example).
In your case, it seems like 1 process is run and then the other starts only when the first process is finished. This means data needs to be mandatory stored in a file. Shared memory cannot be used here. That being said, this does not mean the file has to be stored on a storage device. On Linux for example, you can store files in RAM using RAMFS for example which is a filesystem stored in RAM. Note that files stored in such filesystem are not saved anywhere when the machine is shutdown (accidentally or deliberately) so it should not be used for critical data unless you can be sure the machine will not crash / be shutdown. RAMFS have a limited space and AFAIK the configuration of such filesystem require root privileges.
An alternative solution is to create a mediator process (M) with one purpose: receiving data from a process and sending it to other processes. Shared memory can be used in this case since A and B communicate with M and pair of processes are alive simultaneously. A can directly write in the memory shared by M once shared and B can read it later. M needs to be created before A/B and finished after A/B.
I have an application where I have 3 different processes that need to run concurrently, in 3 different languages, on Windows:
A "data gathering" process, which interfaces with a sensor array. The developers of the sensor array have been kind enough to provide us with their C# source code, which I can modify. This should be generating raw data and shoving it into shared memory
A "post-processing" process. This is C++ code that uses CUDA to get the processing done as fast as possible. This should be taking raw data, moving it to the GPU, then taking the results from the GPU and communicating it to--
A feedback controller written in Matlab, which takes the results of the post-processing and uses it to make decisions on how to control a mechanical system.
I've done coursework on parallel programming, but that coursework all worked in Linux, where I used the mmap.h for coordination between multiple processes. This makes sense to me--you ask the OS for a page in virtual memory to be mapped to shared physical memory addresses, and the OS gives you some shared memory.
Googling around, it seems like the preferred way to set up shared memory between processes in Windows (in fact, the only "easy" way to do it in Matlab) is to use memory-mapped files, but this seems completely bonkers to me. If I'm understanding correctly, a memory-mapped file grabs some disk space and maps it to the physical address space, which is then mapped into the virtual address space for any process that accesses the same memory-mapped file.
This seems about three times more complex than it needs to be just to get multiple processes to map pages in their virtual address space to the same physical memory. I don't feel like I should be doing anything remotely related to disk I/O for what I'm trying to accomplish, especially since performance is a big issue for me (ideally I should be able to process 1000 sets of data per second, though that's not a hard limit). Is this really the right way to coordinate my processes?
Question
Are there any notable differences between context switching between processes running the same executable (for example, two separate instances of cat) vs processes running different executables?
Background
I already know that having the same executable means that it can be cached in the same place in memory and in any of the CPU caches that might be available, so I know that when you switch from one process to another, if they're both executing the same executable, your odds of having a cache miss are smaller (possibly zero, if the executable is small enough or they're executing in roughly the same "spot", and the kernel doesn't do anything in the meantime that could cause the relevant memory to be evicted from the cache). This of course applies "all the way down", to memory still being in RAM vs. having been paged out to swap/disk.
I'm curious if there are other considerations that I'm missing? Anything to do with virtual memory mappings, perhaps, or if there are any kernels out there which are able to somehow get more optimal performance out of context switches between two processes running the same executable binary?
Motivation
I've been thinking about the Unix philosophy of small programs that do one thing well, and how taken to its logical conclusion, it leads to lots of small executables being forked and executed many times. (For example, 30-something runsv processes getting started up nearly simultaneously on Void Linux boot - note that runsv is only a good example during startup, because they mostly spend their time blocked waiting for events once they start their child service, so besides early boot, there isn't much context-switching between them happening. But we could easily image numerous cat or /bin/sh instances running at once or whatever.)
The context switching overhead is the same. That is usually done with a single (time consuming) instruction.
There are some more advanced operating systems (i.e. not eunuchs) that support installed shared programs. They have reduced overhead when more than one process accesses them. E.g., only one copy of read only data loaded into physical memory.
I do not quite understand the benefit of "multiple independent virtual address, which point to the same physical address", even though I read many books and posts,
E.g.,in a similar question Difference between physical addressing and virtual addressing concept,
The post claims that program will not crash each other, and
"in general, a particular physical page only maps to one application's
virtual space"
Well, in http://tldp.org/LDP/tlk/mm/memory.html, in section "shared virtual memory", it says
"For example there could be several processes in the system running
the bash command shell. Rather than have several copies of bash, one
in each processes virtual address space, it is better to have only one
copy in physical memory and all of the processes running bash share
it."
If one physical address (e.g., shell program) mapped to two independent virtual addresses, how can this not crash? Wouldn't it be the same as using the physical addressing?
what does virtual addressing provide, which is not possible or convenient from physical addressing? If no virtual memory exists, i.e., two directly point to the same physical memory? i think, by using some coordinating mechanism, it can still work. So why bother "virtual addressing, MMU, virtual memory" these stuff?
There are two main uses of this feature.
First, you can share memory between processes, that can communicate via the shared pages. In facts, shared memory is one of the simplest forms of IPC.
But shared readonly pages can also be used to avoid useless duplication: most of times, the code of a program does not change after it has been loaded in memory, so its memory pages can be shared among all the processes that are running that program. Obviously only the code is shared, the memory pages containing the stack, the heap and in general the data (or, if you prefer, the state) of the program are not shared.
This trick is improved with "copy on write". The code of executables usually doesn't change when running, but there are programs that are actually self-modifying (they were quite common in the past, when most of the development was still done in assembly); to support this stuff, the operating system does read-only sharing as explained before, but, if it detects a write on one of the shared pages, it disables the sharing for such page, creating an independent copy of it and letting the program write there.
This trick is particularly useful in situations in which there's a good chance that the data won't change, but it may happen.
Another case in which this technique can be used is when a process forks: instead of copying every memory page (which is completely useless if the child process does immediately an exec) , the new process shares with the parent all its memory pages in copy on write mode, allowing quick process creation, still "faking" the "classic" fork behavior.
If one physical address (e.g., shell program) mapped to two independent virtual addresses
Multiple processes can be built to share a piece of memory; e.g. with one acting as a server that writes to the memory, the other as a client reading from it, or with both reading and writing. This is a very fast way of doing inter-process communication (IPC). (Other solutions, such as pipes and sockets, require copying data to the kernel and then to the other process, which shared memory skips.) But, as with any IPC solution, the programs must coordinate their reads and writes to the shared memory region by some messaging protocol.
Also, the "several processes in the system running the bash command shell" from the example will be sharing the read-only part of their address spaces, which includes the code. They can execute the same in-memory code concurrently, and won't kill each other since they can't modify it.
In the quote
in general, a particular physical page only maps to one application's virtual space
the "in general" part should really be "typically": memory pages are not shared unless you set them up to be, or unless they are read-only.
When developing a Windows application for x64, then the user address space on Windows Vista and Windows 7 x64 is 8TB.
Let's say that I have an application which consumes significantly less than available physical memory (500MB-1GB) in normal working set, and in addition, I have much more than that (let's say 3GB-4GB) in distinct chunks (much smaller than the remaining memory size- let's say 100MB) which are to be loaded exclusively. Of course, whilst technically, I could easily fit an extra 4GB in the address space, the reality is that most of it would have to be paged, except on the higher-end computers which have 6-8GB of RAM.
The question is, am I going to destroy the computer's performance by exhausting the page file by consuming very large amounts of page memory for a single application? Or, equivalently, what would be an appropriate maximum for the amount of memory that I can put into page file?
In addition, would this actually increase my performance on the higher end of machines, as opposed to just loading the data manually from the associated files at the appropriate time?
If your distinct chunks are already stored as files, then treat them as memory mapped files. By doing so, your application doesn't have to manage reading / writing the data. Furthermore (and germane to your question), the data is backed by your files on disk and not the system's page file.
It is up to the operating system to manage the system's resources. Current page files are usually only constrained by available disk space. The OS will manage the system's performance by balancing the physical allocations given to each process. Physical memory allocation and not page file use is more likely to cause performance issues in other running applications.
You may consider providing a setting to toggle how much memory your application will use in case any customers see adverse performance effects.