Can we run OpenMP parallelize code on several nodes? - parallel-processing

Assuming that we have four 16-core nodes (node1, node2, node3, node4). How can I run a big parallelized program on node1,2,3 at the same time? Or even using 16 cores in all, however allocated as 7cores in node1 + 8cores in node2 + 1core in node3 (other part being occupied)?
Is MPI the common way? Does openmp solely suffice? I haven't learnt MPI, but have used openmp within single node.

You can use a combination of both OpenMP and MPI if required.
While MPI does utilize every core on each node and has been optimized to use locality of reference when it finds out that its other tasks are on the same machine the code base needs to change a lot in case it has already been developed.
Incrementally parallellizing your code is recommended using OpenMP and so you might want to orchestrate a hybrid where each task of MPI utilizes the cores using OpenMP.
So number of nodes = number of MPI tasks
number of cores per machines = number of OpenMP tasks per machine

Related

Is this strategy a Parallel computing or Distributed Computing? MPI

I have a function to calculate fitness value, say func(). In my implementation, I have used MPI for Parallelization.
There are 3 machines in the MPI cluster connected via LAN. These machines are installed with NFS protocol. Memory is not shared among these machines.
The main while loop runs 500 times.
Inside this while loop, I use MPI to parallelize the 9 func() calls. Meaning, 9 times func() is called inside main while loop, I parallelized that in a way each of 3 nodes gets to call 3 func() calls and return the results to the master node.
MPI Workflow diagram - please refer to this diagram
What Happens Inside each node please refer this diagram
This continues 500 times in the while loop. (Meaning, in each next loop, again 9 func() calls are parallelized)
Is this strategy called a parallel computing or a distributed computing?
Considering the definitions, parallel computing is parallelizing multiple tasks in parallel and distributed computing is distributing a single task on multiple nodes having a common goal. I feel it's parallel computing.
But, here I am executing on different machines, so should I consider it as distributed computing?
Please clear this doubt.
If you use distributed computing to solve a single problem, then it is also parallel computing. You are using multiple computers (or processors) to solve which satisfies the simple definiton of parallel computing.
Parallel computing uses two or more processors (cores, computers) in
combination to solve a single problem.
But, Not all parallel computing is distributed. You can perform parallel tasks to solve a problem using shared memory (Using programming models like OpenMP) where you only use a single computer.
Personal Opinion: You can use MPI to solve a problem by using shared memory or in a single computer (without using shared memory) but it remains parallel computing (by broad definition of distributed computing, there should be multiple computers to make it a distributed computing even though MPI has it's own memory space and uses messsage passing)
A distributed computer system consists of multiple software components
that are on multiple computers, but run as a single system.
In your case it is both distributed and parallel. As Gilles Gouaillardet pointed out in comments:
Your program is MPI, so it is both parallel (several tasks collaborate
to achieve a goal) and distributed (each task has its own memory
space, and communicate with other tasks via message - e.g. no shared
memory)

what is the difference between parallelism and parallel computing in Flink?

I have confusion in the number of tasks that can work in parallel in Flink,
Can someone explain to me:
what is the number of parallelism in a distributed system? and its relation to Flink terminology
In Flink, is it the same as we say 2 parallelism = 2 tasks work in parallel?
In Flink, if 2 operators work separately but the number of parallelism in each one of them is 1, does that count as parallel computation?
Is it true that in a KeyedStream, the maximum number of parallelism is the number of keys?
Does the Current CEP engine in Flink able to work in more than 1 task?
Thank you.
Flink uses the term parallelism in a pretty standard way -- it refers to running multiple copies of the same computation simultaneously on multiple processors, but with different data. When we speak of parallelism with respect to Flink, it can apply to an operator that has parallel instances, or it can apply to a pipeline or job (composed of a several operators).
In Flink it is possible for several operators to work separately and concurrently. E.g., in this job
source ---> map ---> sink
the source, map, and sink could all be running simultaneously in separate processors, but we wouldn't call that parallel computation. (Distributed, yes.)
In a typical Flink deployment, the number of task slots equals the parallelism of the job, and each slot is executing one complete parallel slice of the application. Each parallel instance of an operator chain will correspond to a task. So in the simple example above, the source, map, and sink can all be chained together and run in a single task. If you deploy this job with a parallelism of two, then there will be two tasks. But you could disable the chaining, and run each operator in its own task, in which case you'd be using six tasks to run the job with a parallelism of two.
Yes, with a KeyedStream, the number of distinct keys is an upper bound on the parallelism.
CEP can run in parallel if it is operating on a KeyedStream (in which case, the pattern matching is being done independently for each key).

H2O cluster uneven distribution of performance usage

I set up a cluster with a 4 core (2GHz) and a 16 core (1.8GHz) virtual machine. The creation and connection to the cluster works without problems. But now I want to do some deep learning on the cluster, where I see an uneven distribution for the performance usage of those two virtual machines. The one with 4 cores is always at 100% CPU usage while the 16 core machine is idle most of the time.
Do I have to make additional configuration during the cluster generation? Because it is odd for me that the stronger machine of the two is idle while the weaker one does all the work.
Best regards,
Markus
Two things to keep in mind here.
Your data needs to be large enough to take advantage of data parallelism. In particular, the number of chunks per column needs to be large enough for all the cores to have work to do. See this answer for more details: H2O not working on parallel
H2O-3 assumes your nodes are symmetric. It doesn't try to load balance work across the cluster based on capability of the nodes. Faster nodes will finish their work first and wait idle for the slower nodes to catch up. (You can see this same effect if you have two symmetric nodes but one of them is busy running another process.)
Asymmetry is a bigger problem for memory (where smaller nodes can run out of memory and fail entirely) than it is for CPU (where some nodes are just waiting around). So always make sure to start each H2O node with the same value of -Xmx.
You can limit the number of cores H2O uses with the -nthreads option. So you can try giving each of your two nodes -nthreads 4 and see if they behave more symmetrically with each using roughly four cores. In the case you describe, that would mean the smaller machine is roughly 100% utilized and the larger machine is roughly 25% utilized. (But since the two machines probably have different chips, the cores are probably not identical and won't balance perfectly, which is OK.)
[I'm ignoring the virtualization aspect completely, but CPU shares could also come into the picture depending on the configuration of your hypervisor.]

About nodes number on Flink

I'm developing a Flink toy-application on my local machine before to deploy the real one on a real cluster.
Now I have to determine how many nodes I need to set the cluster.
But I'm still a bit confused about how many nodes I have to consider to execute my application.
For example if I have the following code (from the doc):
DataStream<String> lines = env.addSource(new FlinkKafkaConsumer<>()...);
DataStream<Event> events = lines.map((line)->parse(line));
DataStream<Statistics> stats = events
.keyBy("id");
.timeWindow(Time.seconds(10))
.apply(new MyWindowAggregationFunction());
stats.addSink(new RollingSink(path));
This means that operations "on same line" are executed on same node? (It sounds a bit strange to me)
Some confirms:
If the answer to previous question is yes and if I set parallelism to 1 I can establish how many nodes I need counting how many operations I have to perform ?
If I set parallelism to N but I have less than N nodes available Flink automatically scales the elaboration on available nodes?
My throughput and data load are not relevant I think, it is not heavy.
If you haven't already, I recommend reading https://ci.apache.org/projects/flink/flink-docs-release-1.3/concepts/runtime.html, which explains how the Flink runtime is organized.
Each task manager (worker node) has some number of task slots (at least one), and a Flink cluster needs exactly as many task slots as the highest parallelism used in the job. So if the entire job has a parallelism of one, then a single node is sufficient. If the parallelism is N and fewer than N task slots are available, the job can't be executed.
The Flink community is working on dynamic rescaling, but as of version 1.3, it's not yet available.

Run Map-Reduce application on multiple core on the same machine

I want to run map reduce tasks on a single machine and I want to use all the cores of my machine. Which is the best approach? If I install hadoop in pseudo distributed mode it is possible to use all the cores?
You can make use of the properties mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum to increase the number of Mappers/Reducers spawned simultaneously on a TaskTracker as per your hardware specs. By default, it is set to 2, hence a maximum of 2 maps and 2 reduces will run at a given instance. But, one thing to keep in mind is that if your input is very small then framework will decide it's not worth parallelizing the execution. In such a case you need to handle it by tweaking the default split size through mapred.max.split.size.
Having said that, I, based on my personal experience, have noticed that MR jobs are normally I/O(perhaps memory, sometimes) bound. So, CPU does not really become a bottleneck under normal circumstances. As a result you might find it difficult to fully utilize all the cores on one machine at a time for a job.
I would suggest to devise some strategy to decide the proper number of Mappers/Reducers to efficiently carry out the processing to make sure that you are properly utilizing the CPU since Mappers/Reducers take up slots on each node. One approach could be to take the number of cores, multiply it by .75 and then set the number of Mappers and Reducers as per your needs. For example, you have 12 physical cores or 24 virtual cores, then you could have 24*.75 = 18 slots. Now based on your needs you can decide whether to use 9Mappers+9Reducers or 12Mappers+6Reducers or something else.
I'm reposting my answer from this question: Hadoop and map-reduce on multicore machines
For Apache Hadoop 2.7.3, my experience has been that enabling YARN will also enable multi-core support. Here is a simple guide for enabling YARN on a single node:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node
The default configuration seems to work pretty well. If you want to tune your core usage, then perhaps look into setting 'yarn.scheduler.minimum-allocation-vcores' and 'yarn.scheduler.maximum-allocation-vcores' within yarn-site.xml (https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml)
Also, see here for instructions on how to configure a simple Hadoop sandbox with multicore support: https://bitbucket.org/aperezrathke/hadoop-aee

Resources