MPI collective operations implementation - parallel-processing

Are the MPI collective operations built on top of point-to-point operations ? Is the implementation of point-to-point operations improving ?

This is implementation and interconnect dependent.
A simple implementation is likely based on point to point operations.
That being said, intra node collectives can use the shared memory and hence might not require any point to point operations.
Also, some hardware can be used to offload collective operations, and might be accessed via the higher level Portals4 library.

Related

Dynamic allocation algorithm for NOR Flash memories

I need to implement a dynamic allocation algorithm for a NOR flash memory.
The idea is to implement a memory management algorithm and an abstraction mechanism that allows you to dynamically allocate rewritable blocks that are smaller than the Erasable Block Size.
I would like to understand if there are already case studies and approaches to known algorithms, from which to take inspiration.
Thanks
Dynamic allocation is related to how you write your program: you want to allocate memory at runtime depending on the execution context and not once for all when your program boots. It requires pure software mechanisms, basically book a dedicated area of memory and implement allocation / free functions.
It is not really related to the kind of memory you are using.
I guess you could implement malloc / free functions to Flash memory. However Flash is not designed to do many cycles of erase/program.
If you really need that, you should have a look to the concept of Flash Translation Layer. It is some kind of library providing a virtual memory space on the Flash. It aims to reduce Flash wear and to improve write time.
But again, you should really question the real need to do dynamic allocation to Flash.

Memory barriers through programming

Can memory barriers be achieved through code(without using CAS or other locking primitives like volatile,atomic classes etc.)?
I believe the disruptor is able to achieve it,without actually ersorting to any of the locking primitives.
Any pointers or references in understanding this would be helpful.
Suggestions on other programmatic modes(preferrably in java) is also appreciated.
The notion of a memory barrier is orthogonal to CAS and other locking primitives. For example, C++11 allows a CAS operation to not have any memory barrier at all if specified with memory_order_relaxed. Some hardware, notably x68, always associates a memory barrier with an atomic read-modify-write operation.
The best example of an algorithm that requires a memory barrier, but no CAS or locking, is Dekker's protocol. Section 1 of "Location-Based Memory Fences" gives a good overview of the protocol.
See my blog Volatile: Almost Useless for Multi-Threaded Programming for why volatile is useless as a memory barrier.
C++-specific information: In C++11, use std::atomic_thread_fence. The preceding link has a nice example of using it without locking. If dealing with older C++ compilers, you'll need to resort to vendor-specific routines. One way is to use Intel Threading Building Blocks' tbb::atomic_fence(). It's a wrapper around whatever platform-specific fence we could find.

Hybrid OpenMP + OpenMPI for mixed distributed & shared memory?

I am developing a code to perform a few very large computations by my standards. Based on single-CPU estimates, expected run-time is ~10 CPU years, and memory requirements are ~64 GB. Little to no IO is required. My serial version of the code in question (written in C) is working well enough and I have to start thinking about how to best parallelize the code.
I have access to clusters with ~64 GB RAM and 16 cores per node. I will probably limit myself to using e.g. <= 8 nodes. I'm imagining a setup where memory is shared between threads on a single node, with separate memory used on different nodes and relatively little communication between nodes.
From what I've read so far, the solution I have come up with is to use a hybrid OpenMP + OpenMPI design, using OpenMP to manage threads on individual compute nodes, and OpenMPI to pass information between nodes, like this:
https://www.rc.colorado.edu/crcdocs/openmpi-openmp
My question is whether this is the "best" way to implement this parallelization. I'm an experienced C programmer but have very limited experience in parallel programming (a little bit with OpenMP, none with OpenMPI; most of my jobs in the past were embarrassingly parallel). As an alternative suggestion, is it possible with OpenMPI to efficiently share memory on a single host? If so then I could avoid using OpenMP, which would make things slightly simpler (one API instead of two).
Hybrid OpenMP and MPI coding is most appropriate for problems where one can clearly identify two separate levels of parallelism - corase grained one and the fine grained one nested inside each coarse subdomain. Since fine grained parallelism requires lots of communication when implemented with message passing, it doesn't scale, because the communication overhead can become comparable to the amount of work being done. As OpenMP is a shared memory paradigm, no data communication is necessary, only access synchronisation, and it is more appropriate for finer grained parallel tasks. OpenMP also benefits from data sharing between threads (and the corresponding cache sharing on modern multi-core CPUs with shared last-level cache) and usually requires less memory than the equivalent message passing code, where some of the data might need to be replicated in all processes. MPI on the other side can run cross nodes and is not limited to running on a single shared-memory system.
Your words suggest that your parallelisation is very coarse grained or belongs to the so-called embarassingly parallel problems. If I were you, I would go hybrid. If you only employ OpenMP pragmas and don't use runtime calls (e.g. omp_get_thread_num()) your code can be compiled as both pure MPI (i.e. with non-threaded MPI processes) or as hybrid, depending on whether you enable OpenMP or not (you can also provide a dummy OpenMP runtime to enable code to be compiled as serial). This will give you both the benefits of OpenMP (data sharing, cache reusage) and MPI (transparent networking, scalability, easy job launching) with the added option to switch off OpenMP and run in an MPI-only mode. And as an added bonus, you will be able to meet the future, which looks like brining us interconnected many-many-core CPUs.

MPI vs GPU vs Hadoop, what are the major difference between these three parallelism?

I know for some machine learning algorithm like random forest, which are by nature should be implemented in parallel. I do a home work and find there are these three parallel programming framework, so I am interested in knowing what are the major difference between these three types of parallelism?
Especially, if some one can point me to some study compare the difference between them, that will be perfect!
Please list the pros and cons for each parallelism , thanks
MPI is a message passing paradigm of parallelism. Here, you have a root machine which spawns programs on all the machines in its MPI WORLD. All the threads in the system are independent and hence the only way of communication between them is through messages over network. The network bandwidth and throughput is one of the most crucial factor in MPI implementation's performance. Idea : If there is just one thread per machine and you have many cores on it, you can use OpenMP shared memory paradigm for solving subsets of your problem on one machine.
CUDA is a SMT paradigm of parallelism. It uses state of the art GPU architecture to provide parallelisim. A GPU contains (blocks of ( set of cores)) working on same instruction in a lock-step fashion (This is similar to SIMD model). Hence, if all the threads in your system do a lot of same work, you can use CUDA. But the amount of shared memory and global memory in a GPU are limited and hence you should not use just one GPU for solving a huge problem.
Hadoop is used for solving large problems on commodity hardware using Map Reduce paradigm. Hence, you do not have to worry about distributing data or managing corner cases. Hadoop also provides a file system HDFS for storing data on compute nodes.
Hadoop, MPI and CUDA are completely orthogonal to each other. Hence, it may not be fair to compare them.
Though, you can always use ( CUDA + MPI ) to solve a problem using a cluster of GPU's. You still need a simple core to perform the communication part of the problem.

Boost.MPI vs Boost.Asio

Good day!
What difference between these libraries?
I read MPI's docs and have small experience with asio. For me it's different
implementations of network communication and no more.
But each of them introduces different abstractions ( I'm not sure about same level
of these abstractions ) which leads to different application design.
When I should use one library or another? What I must to know for choosing right
decision in each separate situation?
Yes, Asio is good for several nodes (and very generic framework in general), but why MPI is less better for such tasks? I don't think that dependency on MPI C library is restrictive or MPI is hard to understand and what about scalability? With Asio we can implement things like broadcasting and others and from another hand MPI doesn't forbid to write simple network applications. Is it conceptually difficult to rewrite Asio-specific logic with MPI if needed?
What about socket-like communications: if it's mandatory, we can encapsulate such one in module on Asio or any other framework and still use MPI for other communications.
For me sokets and MPI standart are different network services and it's not clear what is fundamental in real world, where distance from simple client-server pair to some medium computations is one step. Also I don't think that MPI has notable overhead in comparison with Asio.
Maybe it's bad question and all we need it's something like ICE (Internet Communications Engine)? Different languages support and again (as assures ZeroC) great performance.
And, of course, I never seen in any documentation topic like 'don't use this library for it!'.
I simply can't take such disunity: in one case it's sockets, in another - asynchronous messages and finally heavy middleware platform. Where is clarity in lifecycle of development? Maybe it's not fair question, but for starting to reduce this zoo we need some point.
Each library solves different problems, they don't really overlap. It also depends what you are trying to solve, and the communication patterns of your application. Use Boost.MPI for scalability, such as scaling to thousands, or tens of thousands of nodes. Depending on the underlying network architecture, MPI also excels at collective operations: gather, scatter, broadcast, etc.
Use Boost.Asio for a socket abstraction layer if you only need a handful of nodes, such as a single server and some clients. I'd suggest using Boost.Asio if you aren't already using an MPI distribution in some fashion.
I haven't used both of them, but Boost.ASIO is more an abstraction layer for networking on a low level, whereas Boost.MPI implements the MPI standard which let's you create distributed computing systems.
So if you need some, say, socket-like communication, I'd go with ASIO. If you want to do distributed computing and maybe even interoperate with MPI programs written in other languages/for other platforms, go with Boost.MPI.

Resources