Difference between multi-process programming with fork and MPI

Difference between multi-process programming with fork and MPI - fork

Is there a difference in performance or other between creating a multi-process program using the linux "fork" and the functions available in the MPI library?
Or is it just easier to do it in MPI because of the ready to use functions?

They don't solve the same problem. Note the difference between parallel programming and distributed-memory parallel programming.
Using the fork/join model you mentioned usually is for parallel programming on the same physical machine. You generally don't distribute your work to other connected machines (with the exceptions of some of the models in the comments).
MPI is for distributed-memory parallel programming. Instead of using a single processor, you use a group of machines (even hundreds of thousands of processors) to solve a problem. While these are sometimes considered one large logical machine, they are usually made up of lots of processors. The MPI functions are there to simplify communication between these processes on distributed machines to avoid having to do things like manually open TCP sockets between all of your processes.
So there's not really a way to compare their performance unless you're only running your MPI program on a single machine, which isn't really what it's designed to do. Yes, you can run MPI on a single machine and people do that all the time for small test codes or small projects, but that's not the biggest use case.

Related

how to run a openmp program on clusters with multiple nodes? [duplicate]

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.

Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.

Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.

Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

Ruby - how to thread across cores / processors

Im (re)writing a socket server in ruby in hopes of simplifying it. Reading about ruby sockets I ran across a site that says multithreaded ruby apps only use one core / processor in a machine.
Questions:
Is this accurate?
Do I care? Each thread in this server will potentially run for several minutes and there will be lots of them. Is the OS (CentOS 6.5) smart enough to share the load?
Is this any different from threading in C++? (language of the current socket server) IE do pthreads use multiple cores automatically?
What if I fork instead of thread?

CRuby has a global interpreter lock, so it cannot run threads in parallel. Jruby and some other implementations can do it, but CRuby will never run any kind of code in parallel. This means that, no matter how smart your OS is, it can never share the load.
This is different in threading in C++. pthreads create real OS threads, and the kernal's scheduler will run them on multiple cores at the same time. Technically Ruby uses pthreads as well, but the GIL prevents them from running in parallel.
Fork creates a new process, and your OS's scheduler will almost certainly be smart enough to run it on a separate core. If you need parallelism in Ruby, either use an implementation without a GIL, or use fork.

There is a very nice gem called parallel which allows data processing with parallel threads or multiple processes by forking (work around GIL of current CRuby implementation).

Due to GIL in YARV, ruby is not thread friendly. If you want to write multithreaded ruby use jruby or rubinius. It would be even better to use a functional language with actor model such as Erlang or Elixir and let the Virtual Machine handle the threads and you only manage the Erlang processes.

Threading
If you're going to want multi-core threading, you need to use an interpreter that actively uses multiple cores. MRI Ruby as of 2.1.3 is still only single-core; JRuby and Rubinius allow access to multiple cores.
Threading Alternatives
Alternatives to changing your interpreter include:
DRb with multiple Ruby processes.
A queuing system with multiple workers.
Socket programming with multiple interpreters.
Forking processes, if the underlying platform supports the fork(2) system call.

Mathematica Parallel computing with VMware Client

I have access to a virtual machine via vmware client and want to do parallel computing with mathematica (no direct www access on the vm).
But on the VM there's not mathematica installed and i don't want to buy an additinal one.
So I want to have the interface on my laptop and transfer as much computing as possible to the VM.
Following questions:
- Is it possible?
- How does it work?
Thanks,
Andreas

The use of Mathematica's native parallel computing capabilities does kind of depend on there being a Mathematica kernel running on each processing element (core, CPU, functional unit, whatever the heck they're called these days) in the gang so, without going outside Mathematica, you're dead in the water, it ain't possible.
On the other hand, you could use Mathematica's recently enhanced capabilities to play nicely with programs written in other languages (as long as those other languages are C) to write a back-end to run on the VM and call it from the Mathematica installation on your laptop. And, of course, once you start writing in C, you can do all the parallelisation you want.

high performance runtime

It’s the first time I submit a question in this forum.
I’m posting a general question. I don’t have to develop an application for a specific purpose.
After a lot of “googling” I still haven’t found a language/runtime/script engine/virtual machine that match these 5 requirements:
memory allocation of variables/values or objects cleaned at run time
(e.g. a la C++ that use keyword delete or free in C )
language (and consequently the program) is a script or
pseudo-compiled a la byte code that should be portable on main
operating system (windows, linux, *bsd, solaris) & platform(32/64bit)
native use of multicore (engine/runtime)
no limit on the heap usage
library for network
The programming language for building application and that run on this engine is agnostic oriented (paradigm is not important).
I hope that this post won’t stir up a Holy-War but I'd like to put focus on engine behavior during program execution.
Sorry for my bad english.
Luke

I think Erlang might fit your requirement:
most data is either allocated in local scopes and therefore immediately deleted after use or contained in a library-powered permanent storage like ETS, DETS or Mnesia. There is Garbage Collection, though, but the paradigm of the language makes the need for it not as important.
the Erlang compiler compiles the source code to the BEAM virtual machine byte code, which, unlike Java is register-based and thus much faster. The VM is available for:
Solaris (including 64 bit)
BSD
Linux
OSX
TRU64
Windows NT/2000/2003/XP/Vista/7
VxWorks
Erlang has been designed for distributed systems, concurrency and reliability from day one
Erlang's Heap grows with your demand for it, it's initially limited and expanded automatically (there are numerous tweaks you can use to configure this on a per-VM-basis)
Erlang comes from a networking background and provides tons of libraries from IP to higher-level protocols

Fast inter-process (inter-threaded) communications IPC on large multi-cpu system

What would be the fastest portable bi-directional communication mechanism for inter-process communication where threads from one application need to communicate to multiple threads in another application on the same computer, and the communicating threads can be on different physical CPUs).
I assume that it would involve a shared memory and a circular buffer and shared synchronization mechanisms.
But shared mutexes are very expensive (and there are limited number of them too) to synchronize when threads are running on different physical CPUs.

You probably want to start by looking at the existing libraries such as MPI and OpenMP. They tend to be tuned fairly well.
If you're willing to entertain more cutting-edge approaches, then you can try what Barrelfish is doing, see http://www.barrelfish.org/barrelfish_sosp09.pdf .

If you are going to use C++, boost has a portable pretty low level IPC library. It allows you to synchronize and share memory between processes.
http://www.boost.org/doc/libs/1_42_0/doc/html/interprocess.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio