Fast inter-process (inter-threaded) communications IPC on large multi-cpu system - algorithm

What would be the fastest portable bi-directional communication mechanism for inter-process communication where threads from one application need to communicate to multiple threads in another application on the same computer, and the communicating threads can be on different physical CPUs).
I assume that it would involve a shared memory and a circular buffer and shared synchronization mechanisms.
But shared mutexes are very expensive (and there are limited number of them too) to synchronize when threads are running on different physical CPUs.

You probably want to start by looking at the existing libraries such as MPI and OpenMP. They tend to be tuned fairly well.
If you're willing to entertain more cutting-edge approaches, then you can try what Barrelfish is doing, see http://www.barrelfish.org/barrelfish_sosp09.pdf .

If you are going to use C++, boost has a portable pretty low level IPC library. It allows you to synchronize and share memory between processes.
http://www.boost.org/doc/libs/1_42_0/doc/html/interprocess.html

Related

How to run multiple OS simultaneously on different cores of ARMv8

I have an ARM Cortex-A53 based embedded system which has 4 cores. It is not implemented with ARM TrustZone.
Is it possible to run the following OSs simultaneously?
Core0:Some type of RTOS
Core1:Some type of RTOS
Core2 and Core3: Linux
All of them use some shared memory space to exchange data.
Boot sequences until loading image(monolithic RTOS and Linux kernel) into DDR are processed by external chip.
Do I need to use a hypervisor, or just treat all cores as independent logical CPUs?
I am not familiar with ARMv8, should I pay additional attentions in setting MMU, GIC, etc. in my case?
That's a very-very vague question, so answer gonna be the same sort.
That's how ARMv8 looks like.
Is it possible to run the following OSs simultaneously?
Yes, there should not be restrictions for that.
All of them use some shared memory space to exchange data.
Yes, you could map same region of physical memory to all of them. How to sync access to that shared memory from different OSs (eg isolated from each other environments) is more important question though.
Boot sequences until loading image(monolithic RTOS and Linux kernel)
into DDR are processed by external chip.
For sure you should have an image of OS in memory before passing control to Kernel entry. So should be done from EL3 or EL2.
Do I need to use a hypervisor, or just treat all cores as independent
logical CPUs?
Yes, you do need hypervisor. That's probably the best way to organise interaction between different OSs.
should I pay additional attentions in setting MMU, GIC, etc. in my
case?
There are MMU for each EL. So MMU-EL0 are totally independent. MMU-EL1 (OS/Kernel) to organise interaction between App in same OS. MMU-EL2 (hypervisor) to organise interaction between different OS. But all in all probably not something special.
GIC, that's depends on how you are gonna organise interrupts. It's possible to route interrupts to all cores or only particular one. Use them to change EL and select which OS is gonna to handle it. So yes, GIC might need quite an attention.

Is the last level cache shared by the sockets on a multiple socket machine?

I am trying to understand the architecture of a multi-socket machine.
I read the LLC or the Last-level Ccahe is shared by all the cores in a multicore machine. Now if a machine supports multiple sockets will a single last level cache be shared by the multiple sockets or each socket has it's own LLC?
Thank you in advance.
I'm not sure whether "last-level cache" is a widely-used term. I suspect that it is not.
References that I could find defined it as an on-chip cache used for SOC (System On Chip).
If that's the meaning that you're referencing, and if I'm correct to interpret "socket" in your question to mean the sockets that the chips are plugged into... then the answer would be "no". On-chip cache would not be shared between different chips.
I think that you should look at the system architecture for some targets that you are interested in to answer the question.

how to run a openmp program on clusters with multiple nodes? [duplicate]

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.
Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.
Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.
Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

Difference between multi-process programming with fork and MPI

Is there a difference in performance or other between creating a multi-process program using the linux "fork" and the functions available in the MPI library?
Or is it just easier to do it in MPI because of the ready to use functions?
They don't solve the same problem. Note the difference between parallel programming and distributed-memory parallel programming.
Using the fork/join model you mentioned usually is for parallel programming on the same physical machine. You generally don't distribute your work to other connected machines (with the exceptions of some of the models in the comments).
MPI is for distributed-memory parallel programming. Instead of using a single processor, you use a group of machines (even hundreds of thousands of processors) to solve a problem. While these are sometimes considered one large logical machine, they are usually made up of lots of processors. The MPI functions are there to simplify communication between these processes on distributed machines to avoid having to do things like manually open TCP sockets between all of your processes.
So there's not really a way to compare their performance unless you're only running your MPI program on a single machine, which isn't really what it's designed to do. Yes, you can run MPI on a single machine and people do that all the time for small test codes or small projects, but that's not the biggest use case.

Sandboxing vs. Virtualisation

Maybe I am missing something but isn't sandboxing and virtualisation exactly the same
concept, ie., separating the memory space for applications running in parallel. So I am wondering why they are having different names, are there maybe differences in the way
they are employed?
Many thanks,
Simon
These concepts address different problems: When we virtualize, we are hiding physical limitations of the machine. Sandboxing, on the other hand, sets artificial limits on access across a machine. Consider memory as a representative analogy.
Virtualization of memory is to allow every program to access every address in a 32- or 64-bit space, even when there isn't that much physical RAM.
Sandboxing of memory is to prevent one program from seeing another's data, even though they might occupy neigboring cells in memory.
The two concepts are certainly related in the common implementation of virtual memory. However, this is a convenient artifact of the implementation, since the hardware page table is only accessible by the kernel.
Consider how to implement them separately, on an x86 machine: You could isolate programs' memory using page tables without ever swapping to disk (sandboxing without virtualization). Alternatively, you could implement full virtual memory, but also give application-level access to the hardware page table so they could see whatever they wanted (virtualization without sandboxing).
There are actually 3 concepts that you are muddling up here. The first and foremost is what is provided by the OS and what it does is it separates the memory space for applications running in parallel. And it is called virtual memory.
In Virtual memory systems, the OS maps the memory address as seen by applications onto real physical memory. Thus memory space for applications can be separated so that they never collide.
The second is sandboxing. It is any technique you, the programmer, use to run untrusted code. If you, the programmer, are writing the OS then from your point of view the virtual memory system you are writing is a sandboxing mechanism. If you, the programmer, are writing a web browser then the virtual memory system, in itself, is not a sandboxing mechanism (different perspectives, you see). Instead it is a pontential mechanism for you to implement your sandbox for browser plug-ins. Google Chrome is an example of a program that uses the OS's virtual memory mechanism to implement its sandboxing mechanism.
But virtual memory is not the only way to implement sandboxing. The tcl programming language for example allows you to instantiate slave interpreters via the interp command. The slave interpreter is often used to implement a sandbox since it runs in a separate global space. From the OS's point of view the two interpreters run in the same memory space in a single process. But because, at the C level, the two interpreters never share data structures (unless explicitly programmed) they are effectively separated.
Now, the third concept is virtualization. Which is again separate from both virtual memory and sandboxing. Whereas virtual memory is a mechanism that, from the OS's perspective, sandboxes processes from each other, virtualisation is a mechanism that sandboxes operating systems from each other. Example of software that does this include: Vmware, Parallels Desktop, Xen and the kernel virtual machine.
Sandboxing means isolation only, when virtualization usually means simulating of some sort of hardware (virtual machine). Virtualization can happen with our without sandboxing.
Sandboxing is limiting access by a particular program. Virtualization is a mechanism that can be used to help do this but sandboxing is acheived with other mechanisms as well, and likewise virtualization has uses besides sandboxing. Sandboxing is a "what", virtualization is a "how".

Resources