OpenMPI : computation speed is different between different nodes - parallel-processing

I am sort of new to the parallel computing and have encountered a problem using OpenMPI. I have an application in main node from which I sent out strings to another node. The sending speed is much faster than the computation speed, should I create a queue in a thread in the second node to store the data ? Or is there other better solutions ?

Related

Data not being partitioned in optimal way on executors when using Spark's ALS Mlib algorithm

I am using ALS algorithm in Spark MLlib. Whenever I run my code, the data is cached on just 1 or 2 executors. I have 8 executors running but caching on just one node decreases the performance of my jobs drastically. The data cached is what ALS uses internally, i.e - itemBlocks, userBlocks, userFactors etc. I have given the number of partitions and am using implicit parameter in the model.
My data size is quite small in some use cases. eg - 40-50MB. But still I can't understand why it would cache on just one node. Also sometimes 3-4 executors are used and sometimes 1-2. The partitioning is very random.

Running multiple "light" mapreduce or a single "heavy" mapreduce

I am writing a mapreduce program that would run on AWS EMR.
My program calculates probabilities out of the google ngram corpus.
I was wondering if there is a difference between running a single mapreduce that handles all calculations at once and multiple mapreduce that handles one calculation at each time.
Both are done without using any data structures (arrays, lists...).
Is there a difference in terms of efficiency? or network communication?
Both are doing exactly the same in in the same manner, I only separate the calculations the job of the reducer.
Yes there will be a difference between them but the magnitude of difference depend on your map reduce program.
Reason for difference is when you will run multiple light map reduce program then there is going to be head over of starting and executing multiple map and reducer as each map reduce program when start require allocation of container for which application master has to communicate back and forth between resource manager and node manager, new log files are generated, network communication between name node and datanode are required similarly there are many other head overs also. So single heavy map reduce is better then various light map reduce if your program is not that large.
But if your single map reducer program is too large and complex such that it cause clogging in JVM and memory ( which acc to me is highly unlikely unless your cluster hardware are too minimal ) then multiple small map reduce are more feasible.
From you question I have a intuition that your map reduce is not that large so I will suggest you to go ahead with single heavy map reduce.

What is the principle of "code moving to data" rather than data to code?

In a recent discussion about distributed processing and streaming I came across the concept of 'code moving to data'. Can someone please help explaining the same. Reference for this phrase is MapReduceWay.
In terms of Hadoop, it's stated in a question but still could not figure out an explanation of the principle in a tech agnostic way.
The basic idea is easy: if code and data are on different machines, one of them must be moved to the other machine before the code can be executed on the data. If the code is smaller than the data, better to send the code to the machine holding the data than the other way around, if all the machines are equally fast and code-compatible. [Arguably you can send the source and JIT compile as needed].
In the world of Big Data, the code is almost always smaller than the data.
On many supercomputers, the data is partitioned across many nodes, and all the code for the entire application is replicated on all nodes, precisely because the entire application is small compared to even the locally stored data. Then any node can run the part of the program that applies to the data it holds. No need to send the code on demand.
I also just came across the sentence “Moving Computation is Cheaper than Moving Data” (from the Apache Hadoop documentation) and after some reading I think this refers to the principle of data locality.
Data locality is a strategy for task scheduling aimed at optimizing performance based on the observation that moving data across a network is costly, so when choosing which task to prioritize whenever a computing/data node is free, preference will be given to the task that's going to operate on the data in the free node or in its proximity.
This (from Delay Scheduling: A Simple Technique for Achieving
Locality and Fairness in Cluster Scheduling, Zaharia et al., 2010) explains it clearly:
Hadoop’s default scheduler runs jobs in FIFO order, with five priority levels. When the scheduler receives a heartbeat indicating that a map
or reduce slot is free, it scans through jobs in order of priority and submit time to find one with a task of the required type. For maps,
Hadoop uses a locality optimization as in Google’s MapReduce [18]: after selecting a job, the scheduler greedily picks the map task in
the job with data closest to the slave (on the same node if possible, otherwise on the same rack, or finally on a remote rack).
Note that the fact Hadoop replicates data across nodes increases fair scheduling of tasks (the higher the replication, the higher the probability of a task to have data on the next free node and hence get picked to run next).

Task scheduling with spark

I am running fairly large task on my 4 node cluster. I am reading around 4 GB of filtered data from a single table and running Naïve Baye’s training and prediction. I have HBase region server running on a single machine which is separate from the spark cluster running in fair scheduling mode, although HDFS is running on all machines.
While executing, I am experiencing strange task distribution in terms of the number of active tasks on the cluster. I observed that only one active task or at most two tasks are running on one/two machines at any point of time while the other are sitting idle. My expectation was that the data in the RDD will be divided and processed on all the nodes for operations like count and distinct etcetera. Why are all nodes not being used for large tasks of a single job? Does having HBase on a separate machine has anything to do with this?
Some things to check:
Presumably you are reading in your data using hadoopFile() or hadoopRDD(): consider setting the [optional] minPartitions parameter to make sure the number of partitions is equal to the number of nodes you want to use.
As you create other RDDs in your application, check the number of partitions of those RDDs and how evenly the data is distributed across them. (Sometimes an operation can create an RDD with the same number of partitions but can make the data within it badly unbalanced.) You can check this by calling the glom() method, printing the number of elements of the resulting RDD (the number of partitions) and then looping through it and printing the number of elements of each of the arrays. (This introduces communication so don't leave it in your production code.)
Many of the API calls on RDD have optional parameters for setting the number of partitions, and then there are calls like repartition() and coalesce() that can change the partitioning. Use them to fix problems you find using the above technique (but sometimes it will expose the need to rethink your algorithm.)
Check that you're actually using RDDs for all your large data, and haven't accidentally ended up with some big data structure on the master.
All of these assume that you have data skew problems rather than something more sinister. That's not guaranteed to be true, but you need to check your data skew situation before looking for something complicated. It's easy for data skew to creep in, especially given Spark's flexibility, and it can make a real mess.

Map Job Performance on cluster

Suppose I have 15 blocks of data and two clusters. The first cluster has 5 nodes and a replication factor is 1, while the second one has a replication factor is 3. If I run my map job, should I expect any change in the performance or the execution time of the map job?
In other words, how does replication affect the performance of the mapper on a cluster?
When the JobTracker assigns a job to a TaskTracker on HDFS, a job is assigned to a particular node based upon locality of data (preference is same node first, then same network switch/frame). By having different replication factors, you limit the ability for the JobTracker to assign a node local to the data (JobTracker will still assign the task nodes, but without the benefits of locality). The effect is to restrict the number of TaskTracker nodes which are both local to the data (either data on task node, or data on same switch frame), thus affecting performance for work on your task (reducing parallelization).
Your smaller cluster likely has a single switch, so data is local to the network/frame, so the only bottleneck you might experience would be data transfer from one TaskTracker to another, as the JobTracker is likely to assign jobs to all available TaskTrackers.
But with a larger hadoop cluster, the replication factor = 1 would limit the number of TaskTracker nodes local to the data and thus able to efficiently operate on your data.
There are several papers which support data locality, http://web.eecs.umich.edu/~michjc/papers/tandon_hpdic_minimizeRemoteAccess.pdf, this paper which you cited also supports data locality, http://assured-cloud-computing.illinois.edu/sites/default/files/PID1974767.pdf, and this one, http://www.eng.auburn.edu/~xqin/pubs/hcw10.pdf (which tested a 5 node cluster, same as the OP).
This paper quotes significant benefits to data locality, http://grids.ucs.indiana.edu/ptliupages/publications/InvestigationDataLocalityInMapReduce_CCGrid12_Submitted.pdf, and observes that an increase in replication factor gives better locality.
Note that this paper claims little difference between network throughput and local disk access (8%), http://www.cs.berkeley.edu/~ganesha/disk-irrelevant_hotos2011.pdf, but reports orders of magnitude difference in performance between local memory access and either disk or network access. Furhtermore, the paper quotes a large fraction of jobs (64%) find their data cached in memory "in large part due to the heavy-tailed nature of the workload", as most jobs "access only a small fraction of the blocks".
EDIT: This part of my answer is obsolete now that the other answer was edited: "The other answer is not entirely correct." This was meant to address the incorrect implication that less replicas = fewer paralelism. The rest of my answer (below) still applies.
Any node can execute your tasks, regardless of whether the data is located in that node or not. Hadoop will try to achieve data locality (preference order is: node-local, then rack-local, then any node), but if it can't, then it will chose any node that has available compute capacity to run your task.
Performance wise, in a typical multi-rack installation, rack-local performs almost as good as node-local, since the bottleneck occurs when transmitting data across racks. However, with high-end networking equipment (i.e., full-bisection bandwidth), then it wouldn't matter if your computation is rack-local or not. For more details on this, read this paper.
How much performance improvement can you expect from having more replicas (and thus achieving higher data locality)? Not much; 5-20% maximum improvement. But this is an upper limit, when you implement additional popularity-based replication as in this and this projects. NOTE: I did not just make-up those performance improvement numbers; they come from the papers I linked.
Since vanilla Hadoop does not have these mechanisms in place, I would expect your performance to improve at most 1-5%. This is just a ballpark guess, but you can easily run some tests yourself. The reason for this, is that more replicas could improve the performance of some of your map tasks (the ones that are now able to run with a data-local copy of the block), but it would not improve your shuffle and reduce phases. Furthermore, even if just one mapper takes longer than the rest, this one will determine the length of your whole map phase; so for many jobs, it is likely that increasing locality will not improve their running times at all. Finally, I/O bound jobs can be map-input IO bound, shuffle IO bound (map output heavy), or reduce output IO bound. Only the first type (map-input IO bound) would benefit from locality. More details on MapReduce workload characterization in this paper.
If you are further interested in this, you can also read this paper, in which they improve the running times of mappers but having the input data of ALL the mappers in memory.

Resources