Memory allocation in MPI programs - memory-management

How is memory allocated in slave nodes for execution of MPI programs ? How do slave nodes know the amount of memory to reserve ? What happens when a slave node can't find the data that it wants to access ?
This is not a homework problem , but a question that I tried came up in my mind and could'nt find on googling

With a non-specific question, the best answer you can expect will also be non-specific
When programming using MPI you typically write a single program which is launched (via mpirun/mpiexec, or some batching system eg. torque) on a set of notes.
The master-slave model is but one approach.
The memory allocation is typically under program control, just as you would in any application allocate memory as needed, so to in your MPI program.
As to finding the data, it is often provided to them (directly or indirectly) (by the master
process, if the master-slave model is used). If indeed each MPI instance has to "search" for the data it is to be processing, then as with any program that is unable to find what it requires, it should send a suitable error message/status back to the caller (or the master process)
.PMCD.

Related

Can I use memory mapping file to remove or improve the disk IO time in a data generating-processing workflow?

I have two programs, the first program (lets' call it A) creates a huge chunk of data and save them on disk, the second program (lets' call it B) reads data from disk and perform data processing. The old workflow is that, I run program A, save data on disk, then run program B, load the data from disk, then process the data. However, this is very time-consuming, since we need two disk IO for large data.
One trivial way to solve this problem is to simply merge the two programs. However, I do NOT want to do this (imagine with a single dataset, we want to have multiple data processing programs running in parallel on the same node, which makes it necessary to separate the two programs). I was told that there is a technique called memory mapping file, which allows multiple processes to communicate and share memory. I find some reference in https://man7.org/linux/man-pages/man3/shm_unlink.3.html.
However, in the example shown there, the execution of two programs (processes) is overlapped, and the two processes communicate with each other in a "bouncing" fashion. In my case, I am not allowed to have such communication pattern. For some reason I have to make sure that program B is executed only after program A is finished (serial workflow). I just wonder if mmap can still be used in my case? I know it seems weird since at some point there is some memory allocated by program A while no program is running (between A and B), which might leads to memory leak, but if this optimization is possible, it would be a huge improvement. Thanks!
Memory mapped files and shared memory are two different concepts.
The former enable you to map a file in memory so that reads to the memory location read the file and write to the memory location write into the file. This kind of operation is very useful to abstract IO accesses as basic memory read/write. It is especially useful for big-data applications (or just to reuse code so to compute files directly).
The later is typically used for multiple running programs to communicate together while being in different processes. For example, programs like the Chrome/Chromium browser use that so to communicate between tabs that are different processes (for sake of security). It is also used in HPC for fast MPI communication between processes lying on the same computing node.
Linux also enable you to use pipes so for one process to send data to another. The pipe is closed when the process emitting data ends. This is useful for dataflow-based processing (eg. text filtering using grep for example).
In your case, it seems like 1 process is run and then the other starts only when the first process is finished. This means data needs to be mandatory stored in a file. Shared memory cannot be used here. That being said, this does not mean the file has to be stored on a storage device. On Linux for example, you can store files in RAM using RAMFS for example which is a filesystem stored in RAM. Note that files stored in such filesystem are not saved anywhere when the machine is shutdown (accidentally or deliberately) so it should not be used for critical data unless you can be sure the machine will not crash / be shutdown. RAMFS have a limited space and AFAIK the configuration of such filesystem require root privileges.
An alternative solution is to create a mediator process (M) with one purpose: receiving data from a process and sending it to other processes. Shared memory can be used in this case since A and B communicate with M and pair of processes are alive simultaneously. A can directly write in the memory shared by M once shared and B can read it later. M needs to be created before A/B and finished after A/B.

MPI: Ensure an exclusive access to a shared memory (RMA)

I would like to know which is the best way to ensure an exclusive access to a shared resource (such as memory window) among n processes in MPI. I've tried MPI_Win_lock & MPI_Win_fence but they don't seem to work as expected, i.e: I can see that multiple processes enter a critical region (code between MPI_Win_lock & MPI_Win_unlock that contains MPI_Get and/or MPI_Put) at the same time.
I would appreciate your suggestions. Thanks.
In MPI 2 you cannot truly do atomic operations. This is introduced in MPI 3 using MPI_Fetch_and_op. This is why your critical data is modified.
Furthermore, take care with `MPI_Win_lock'. As described here:
The name of this routine is misleading. In particular, this routine need not block, except when the target process is the calling process.
The actual blocking process is MPI_Win_unlock, meaning that only after returning from this procedure you can be sure that the values from put and get are correct. Perhaps this is better described here:
MPI passive target operations are organized into access epochs that are bracketed by MPI Win lock and MPI Win unlock calls. Clever MPI implementations [10] will combine all the data movement operations (puts, gets, and accumulates) into one network transaction that occurs at the unlock.
This same document can also provide a solution to your problem, which is that critical data is not written atomically. It does this through the use of a mutex, which is a mechanism that ensures only one process can access data at the time.
I recommend you read this document: The solution they propose is not difficult to implement.

Finding the amount of execution time spent on each processor in a Beowulf cluster

I have downloaded an LU Decomposition program from the following link http://www.cs.nyu.edu/wanghua/course...el/h3/mpi_lu.c and the programming is running very well...The reason for me writing this thread is that can any one help me with getting the time of execution spent on the processors of the nodes connected in the cluster so that it aid me in getting the statistical value from my cluster.
Kindly, help me as I don't know much about MPI Programming, all I want is the amount of time spent on each processor of nodes in the cluster for the above program.
There are at least 2 ways of getting the times you seek, or at least a close approximation to them.
If you have a job management system installed on your cluster (if you don't you should have) then I expect that it will log the time spent on each node by each process involved in your computation. Certainly Grid Engine keeps this data in its accounting file and provides the utility qacct for inspecting that file. I'd be very surprised to learn that the other widely used job management systems don't offer similar data and functions.
You could edit your program and insert mpi_wtime calls at the critical points. Of course, like all MPI routines, this can only be called after mpi_init and before mpi_finalize; you would have to make other arrangements for timing the parts of your code which lie outside the scope of MPI. (on most MPI implementations that do not support clock synchronisation calls to mpi_wtime are possible before mpi_init and after mpi_finalize was called, as there mpi_wtime is simply a wrapper around the system timer routines, but that's not guaranteed to be portable)

what does it mean configuring MPI for shared memory?

I have a bit of research related question.
Currently I have finished implementation of structure skeleton frame work based on MPI (specifically using openmpi 6.3). the frame work is supposed to be used on single machine.
now, I am comparing it with other previous skeleton implementations (such as scandium, fast-flow, ..)
One thing I have noticed is that the performance of my implementation is not as good as the other implementations.
I think this is because, my implementation is based on MPI (thus a two sided communication that require the match of send and receive operation)
while the other implementations I am comparing with are based on shared memory. (... but still I have no good explanation to reason out that, and it is part of my question)
There are some big difference on completion time of the two categories.
Today I am also introduced to configuration of open-mpi for shared memory here => openmpi-sm
and there come comes my question.
1st what does it means to configure MPI for shared memory? I mean while MPI processes live in their own virtual memory; what really is the flag like in the following command do?
(I thought in MPI every communication is by explicitly passing a message, no memory is shared between processes).
shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out
2nd why is the performance of MPI is so much worse with compared to other skeleton implementation developed for shared memory? At least I am also running it on one single multi-core machine.
(I suppose it is because other implementation used thread parallel programming, but I have no convincing explanation for that).
any suggestion or further discussion is very welcome.
Please let me know if I have to further clarify my question.
thank you for your time!
Open MPI is very modular. It has its own component model called Modular Component Architecture (MCA). This is where the name of the --mca parameter comes from - it is used to provide runtime values to MCA parameters, exported by the different components in the MCA.
Whenever two processes in a given communicator want to talk to each other, MCA finds suitable components, that are able to transmit messages from one process to the other. If both processes reside on the same node, Open MPI usually picks the shared memory BTL component, known as sm. If both processes reside on different nodes, Open MPI walks the available network interfaces and choses the fastest one that can connect to the other node. It puts some preferences on fast networks like InfiniBand (via the openib BTL component), but if your cluster doesn't have InfiniBand, TCP/IP is used as a fallback if the tcp BTL component is in the list of allowed BTLs.
By default you do not need to do anything special in order to enable shared memory communication. Just launch your program with mpiexec -np 16 ./a.out. What you have linked to is the shared memory part of the Open MPI FAQ which gives hints on what parameters of the sm BTL could be tweaked in order to get better performance. My experience with Open MPI shows that the default parameters are nearly optimal and work very well, even on exotic hardware like multilevel NUMA systems. Note that the default shared memory communication implementation copies the data twice - once from the send buffer to shared memory and once from shared memory to the receive buffer. A shortcut exists in the form of the KNEM kernel device, but you have to download it and compile it separately as it is not part of the standard Linux kernel. With KNEM support, Open MPI is able to perform "zero-copy" transfers between processes on the same node - the copy is done by the kernel device and it is a direct copy from the memory of the first process to the memory of the second process. This dramatically improves the transfer of large messages between processes that reside on the same node.
Another option is to completely forget about MPI and use shared memory directly. You can use the POSIX memory management interface (see here) to create a shared memory block have all processes operate on it directly. If data is stored in the shared memory, it could be beneficial as no copies would be made. But watch out for NUMA issues on modern multi-socket systems, where each socket has its own memory controller and accessing memory from remote sockets on the same board is slower. Process pinning/binding is also important - pass --bind-to-socket to mpiexec to have it pinn each MPI process to a separate CPU core.

What makes VxWorks so deterministic and fast?

I worked on VxWorks 5.5 long time back and it was the best experience working on world's best real time OS. Since then I never got a chance to work on it again. But, a question keeps popping to me, what makes is so fast and deterministic?
I have not been able to find many references for this question via Google.
So, I just tried thinking what makes a regular OS non-deterministic:
Memory allocation/de-allocation:- Wiki says RTOS use fixed size blocks, so that these blocks can be directly indexed, but this will cause internal fragmentation and I am sure this is something not at all desirable on mission critical systems where the memory is already limited.
Paging/segmentation:- Its kind of linked to Point 1
Interrupt Handling:- Not sure how VxWorks implements it, as this is something VxWorks handles very well
Context switching:- I believe in VxWorks 5.5 all the processes used to execute in kernel address space, so context switching used to involve just saving register values and nothing about PCB(process control block), but still I am not 100% sure
Process scheduling algorithms:- If Windows implements preemptive scheduling (priority/round robin) then will process scheduling be as fast as in VxWorks? I dont think so. So, how does VxWorks handle scheduling?
Please correct my understanding wherever required.
I believe the following would account for lots of the difference:
No Paging/Swapping
A deterministic RTOS simply can't swap memory pages to disk. This would kill the determinism, since at any moment you could have to swap memory in or out.
vxWorks requires that your application fit entirely in RAM
No Processes
In vxWorks 5.5, there are tasks, but no process like Windows or Linux. The tasks are more akin to threads and switching context is a relatively inexpensive operation. In Linux/Windows, switching process is quite expensive.
Note that in vxWorks 6.x, a process model was introduced, which increases some overhead, but mainly related to transitioning from User mode to Supervisor mode. The task switching time is not necessarily directly affected by the new model.
Fixed Priority
In vxWorks, the task priorities are set by the developer and are system wide. The highest priority task at any given time will be the one running. You can thus design your system to ensure that the tasks with the tightest deadline always executes before others.
In Linux/Windows, generally speaking, while you have some control over the priority of processes, the scheduler will eventually let lower priority processes run even if higher priority process are still active.

Resources