I am studying the differences b/w parallel and distributed systems. I have been told that the division is blurring. Also, clusters can be viewed both as parallel and distributed systems (depending on context--whatever that means).
How is this possible? Shouldn't clusters be distiributed systems only?
Parallel computing :
Same application/process shall be split, executed/run concurrently on multiple cores/GPUs to process tasks in parallel (It can be at bit-level, instruction-level, data, or task level).
Resources are tightly coupled - Memory shall be shared across all the cores/GPUs within the system which in turn shall be used for exchange of information (Requires minimal communication for synchronization).
Usage brings in improvement of performance of system as the main focus is on using the processing power of multiple cores/GPUs in parallel.
There are various parallel systems.
Multiprocessor parallel system
The Processors have direct access to shared memory(UMA model). Processors are closely placed, connected by an interconnection network and the Inter process communication shall be done through read and write operations on shared memory and message passing primitives provided by MPI . Here typically processors are of same type (also run same OS) and shall be within same computer/device with shared memory. Hardware & software are very tightly coupled
Multicomputer Parallel Systems :
Here, the processors do not have direct access to shared memory and the memory of multiple processors may or may not form a common address space(NUMA). Processors shall be placed closely (do not have common clock) and connected by an interconnection network communicating over common address space or message passing.
Distributed computing :
Program/problem is divided and the components of a larger program are distributed such that these tasks shall be executed/run across multiple computers (computing devices) typically separated but connected in a network.
Resources are loosely coupled - Memory shall be distributed (or private to the computer) and messaging mechanisms shall be used between multiple computers because the tasks can be of varied nature and require IPC during execution. It can be with different processors / different OS and co-operate with one another. Typically they will not have common clock or shared common memory. ( Processors shall be typically communicating over a network - Processors can be geographically placed wide apart or reside on a WAN or on a LAN )
Usage brings in improvement of scalability of system, reliability / availability, heterogeneity.
Shouldn't clusters be distiributed systems only?
Typically, a cluster comprises of many distributed/separate systems that do not share memory but networked across uniformly. However, within a typical cluster, there shall be parallelism of applications for improvement of performance of clusters. It should also be noted that a parallel computing algorithm can be done using shared memory based system or in a distributed system (using message passing).
As you mentioned it depends on the context. There are two major contexts:
how is the cluster internally handling its tasks (for instance to maintain a consistent cluster state)
How are applications using the cluster.
Internal algorithms are by their nature distributed. Think about master election and membership algorithms as an example (of course clusters have considerably more tasks; this doesn't mean that there are no parallel ones).
On the other hand applications parallelize very often their workloads to run on clusters. Clusters very often provide apis or components like schedulers to enable that functionality. Another example are hadoop type of workloads and their apis. Parallelism is also used by databases that use parallel query to execute complex queries concurrently on more than one node.
Related
Now I am studying parallel computing and algorithms I am little bit confused about the terms concurrent execution and simultaneous execution.
What is the difference between these terms? When do we have to use concurrent and when do we have to use simultaneous in parallel computing?
Simultaneous execution is about utilizing multiple resources (cores, HW threads, etc..) in order to perform multiple tasks at the same time. The tasks don't have to interact in any way, you may have two different applications running simultaneously on two different cores for example, or on the same core.
The art of designing systems to be able to perform multiple tasks at the same time can be said to deal with simultaneous execution. Hyper-threading for e.g. is also called "SMT", simultaneous multi-threading, since it deals with the ability to run two threads with their full contexts at the same time on a single core (This is Intels' approach, AMD has a slightly different solution, see - Difference between intel and AMD multithreading)
Concurrency is a term residing on a higher level of abstraction, relating to the OS world. It's a property of your execution environment in which you have multiple tasks that may be executed over time, while you have no control over the order or even the form of interleaving in which they're performed. It doesn't really matter if they operate simultaneously on multiple cores, on one core with SMT, or even on a single-threaded core with some preemption mechanism and some scheduling algorithm that breaks the tasks into chunks and constantly swaps between them. The important thing here is that concurrency forces you to design your tasks in a way that guarantees correctness (especially if they interact or share data) on any type of system with any order or interleaving.
If the task is designed correctly (with proper locking, barriers, semaphores, and anything guaranteeing correct data flow) and the OS does its job properly (saving states on context switch for example or clearing caches and shooting down TLB entries when needed), then it can run with any form of execution model "under the hood".
Since you're referring to parallel algorithms, the proper term for you is probably concurrent execution.
There are quite a lot of examples in this thread (with additional links to sources - I won't copy it here to avoid plagiarism :) - What is the difference between concurrency and parallelism?
I am developing a code to perform a few very large computations by my standards. Based on single-CPU estimates, expected run-time is ~10 CPU years, and memory requirements are ~64 GB. Little to no IO is required. My serial version of the code in question (written in C) is working well enough and I have to start thinking about how to best parallelize the code.
I have access to clusters with ~64 GB RAM and 16 cores per node. I will probably limit myself to using e.g. <= 8 nodes. I'm imagining a setup where memory is shared between threads on a single node, with separate memory used on different nodes and relatively little communication between nodes.
From what I've read so far, the solution I have come up with is to use a hybrid OpenMP + OpenMPI design, using OpenMP to manage threads on individual compute nodes, and OpenMPI to pass information between nodes, like this:
https://www.rc.colorado.edu/crcdocs/openmpi-openmp
My question is whether this is the "best" way to implement this parallelization. I'm an experienced C programmer but have very limited experience in parallel programming (a little bit with OpenMP, none with OpenMPI; most of my jobs in the past were embarrassingly parallel). As an alternative suggestion, is it possible with OpenMPI to efficiently share memory on a single host? If so then I could avoid using OpenMP, which would make things slightly simpler (one API instead of two).
Hybrid OpenMP and MPI coding is most appropriate for problems where one can clearly identify two separate levels of parallelism - corase grained one and the fine grained one nested inside each coarse subdomain. Since fine grained parallelism requires lots of communication when implemented with message passing, it doesn't scale, because the communication overhead can become comparable to the amount of work being done. As OpenMP is a shared memory paradigm, no data communication is necessary, only access synchronisation, and it is more appropriate for finer grained parallel tasks. OpenMP also benefits from data sharing between threads (and the corresponding cache sharing on modern multi-core CPUs with shared last-level cache) and usually requires less memory than the equivalent message passing code, where some of the data might need to be replicated in all processes. MPI on the other side can run cross nodes and is not limited to running on a single shared-memory system.
Your words suggest that your parallelisation is very coarse grained or belongs to the so-called embarassingly parallel problems. If I were you, I would go hybrid. If you only employ OpenMP pragmas and don't use runtime calls (e.g. omp_get_thread_num()) your code can be compiled as both pure MPI (i.e. with non-threaded MPI processes) or as hybrid, depending on whether you enable OpenMP or not (you can also provide a dummy OpenMP runtime to enable code to be compiled as serial). This will give you both the benefits of OpenMP (data sharing, cache reusage) and MPI (transparent networking, scalability, easy job launching) with the added option to switch off OpenMP and run in an MPI-only mode. And as an added bonus, you will be able to meet the future, which looks like brining us interconnected many-many-core CPUs.
I know for some machine learning algorithm like random forest, which are by nature should be implemented in parallel. I do a home work and find there are these three parallel programming framework, so I am interested in knowing what are the major difference between these three types of parallelism?
Especially, if some one can point me to some study compare the difference between them, that will be perfect!
Please list the pros and cons for each parallelism , thanks
MPI is a message passing paradigm of parallelism. Here, you have a root machine which spawns programs on all the machines in its MPI WORLD. All the threads in the system are independent and hence the only way of communication between them is through messages over network. The network bandwidth and throughput is one of the most crucial factor in MPI implementation's performance. Idea : If there is just one thread per machine and you have many cores on it, you can use OpenMP shared memory paradigm for solving subsets of your problem on one machine.
CUDA is a SMT paradigm of parallelism. It uses state of the art GPU architecture to provide parallelisim. A GPU contains (blocks of ( set of cores)) working on same instruction in a lock-step fashion (This is similar to SIMD model). Hence, if all the threads in your system do a lot of same work, you can use CUDA. But the amount of shared memory and global memory in a GPU are limited and hence you should not use just one GPU for solving a huge problem.
Hadoop is used for solving large problems on commodity hardware using Map Reduce paradigm. Hence, you do not have to worry about distributing data or managing corner cases. Hadoop also provides a file system HDFS for storing data on compute nodes.
Hadoop, MPI and CUDA are completely orthogonal to each other. Hence, it may not be fair to compare them.
Though, you can always use ( CUDA + MPI ) to solve a problem using a cluster of GPU's. You still need a simple core to perform the communication part of the problem.
What is the difference between a Cluster and MPP supercomputer architecture?
In a cluster, each machine is largely independent of the others in terms of memory, disk, etc. They are interconnected using some variation on normal networking. The cluster exists mostly in the mind of the programmer and how s/he chooses to distribute the work.
In a Massively Parallel Processor, there really is only one machine with thousands of CPUs tightly interconnected. MPPs have exotic memory architectures to allow extremely high speed exchange of intermediate results with neighboring processors.
The major variants are SIMD (Single Instruction, Multiple Data) and MIMD (Multiple Instruction, Multiple Data). In a SIMD system, every processor is executing the same instruction at the same time, only on different bits of memory. Essentially, there is only one Program Counter. In a MIMD machine, each CPU has it's own PC.
MPPs can be a bitch to program and are of use only on algorithms that are embarrassingly parallel (that's actually what they call it). However, if you have such a problem, then an MPP can be shockingly fast. They are also incredibly expensive.
The top500 list uses a slightly different distinction between an MPP and a cluster, as explained in Dongarra et al. paper:
[a cluster is a] parallel computer system comprising an integrated collection of independent nodes, each of which is a system in its own right, capable of independent operation and derived from products developed and marketed for other stand-alone purposes
Compared to a cluster, a modern MPP (such as the IBM Blue Gene) is more tightly-integrated: individual nodes cannot run on their own and they are connected by a custom network (like a multidimensional torus). But, similarly to a cluster, there is no single, shared memory spanning all the nodes (note: an MPP might be hierarchical and shared memory might be used inside a single node (NUMA), or between a handful of nodes).
I'd be thus extremely careful to use terms SIMD and MIMD in this context as they usually describe shared memory architectures (SMP).
Update:
Dongarra et al. link
Update:
MPP can have nodes that use shared memory internally; but the whole MPP memory is not shared.
A cluster is a bunch of machines, normally usually Ethernet interconnect (read: network), each running it's own and separate copy of an OS which happen to serve a single purpose.
An MPP supercomputer usually implies a faster propitiatory very fast interconnect (e.g. SGI NUMALink) that supports either Distributed Shared Memory (run processes on different MPP nodes that use shared memory over the fast interconnect to share data as if they were running on a single computer) or even a Single System Image (a single instance of an operating system, mostly Linux, running on all the nodes at the same time as if on a single machine - e.g. "ps aux" on any node will show you all the processes running on the MPP).
As you can see the definition is quite fluid, it's more a question of scale rather than clear cut differences.
I've searched in a lot of HPC literature and couldn't find a concrete definition of MPP. There is quite a concesus over a cluster consisting of multiple interconnected regular personal computers or workstations, usually coupled with standard technologies (like Ethernet or open-source operating systems). The term MPP is usually applied to more propietary approches for building distributed-memory computers, usually having propietary technologies.
For example: Tianhe-2 is considered a cluster because it uses x86-64 nodes and a regular operating system (Kylin Linux). Sunway TaihuLight is considered an MPP because its nodes have its particular architecture, SW26010, and work over his own operating system called Sunway Raise OS.
The most concrete explanation of this matter I found was in Sourcebook of Parallel Computing (Dongarra et al.):
We note that the term cluster can be applied both broadly (any system built with a significant number of commodity components) or narrowly (only commodity components and open-source software). In fact, there is no precise definition of a cluster. Some of the issues that are used to argue that a system is a massively parallel processor (MPP) instead of a cluster include proprietary interconnects (...), particularly ones designed for a specific
parallel computer, and special software that treats the entire system as a single machine, particularly for the system administrators. Clusters may be built from personal computers or workstations (either single processors or symmetric multiprocessors (SMPs)) and may run either open-source or proprietary operating systems.
I think the topic says it all. What's the difference, if any, between parallel and multicore programming? Thanks.
Mutli-core is a kind of parallel programming. In particular, it is a kind of MIMD setup where the processing units aren't distributed, but rather share a common memory area, and can even share data like a MISD setup if need be. I believe it is even disctinct from multi-processing, in that a multi-core setup can share some level of caches, and thus cooperate more efficiently than CPUs on different cores.
General parallel programing would also include SIMD systems (like your GPU), and distributed systems.
The difference isn't in approach, just in the hardware the software runs on. Parallel programming is taking a problem and spliting the workload into smaller pieces that can be processed in parallel(Divide and Conquer type problems, etc.) or functions that can run independently of each other. Place that software on a multi-core piece of hardware and it will be optimized by the OS to run on the different cores. This gives it a better performance because each thread you create to do concurrent work can now run without consuming CPU cycles on a single processor/core.
Multicore systems are a subset of parallel systems. Different systems will have different memory architectures, each with their own set of challenges. How does one system deal with cache coherency? Is NUMA involved, etc. etc.