Can Hadoop tasks run in parallel on single node - hadoop

I am new to hadoop and I have following questions on the same.
This is what I have understood in hadoop.
1) When ever any file is written in hadoop it is stored across all the data nodes in chunks (64MB default)
2) When we run the MR job, a split will be created from this block and on each data node the split will be processed.
3) From each split record reader will be used to generate key/value pair at mapper side.
Questions :
1) Can one data node process more than one split at a time ? What if data node capacity is more?
I think this was limitation in MR1, and with MR2 YARN we have better resource utilization.
2) Will a split be read in serial fashion at data node or can it be processed in parallel to generate key/value pair? [ By randomly accessing disk location in data node split]
3) What is 'slot' terminology in map/reduce architecture? I was reading through one of the blogs and it says YARN will provide better slot utilization in Datanode.

Let me first address the what I have understood in hadoop part.
A file stored on Hadoop file system is NOT stored across all data nodes. Yes, it is split into chunks (default is 64MB), but the number of DataNodes on which these chunks are stored depends on a.File Size b.Current Load on Data Nodes c.Replication Factor and d.Physical Proximity. The NameNode takes these factors into account when deciding which dataNodes will store the chunks of a file.
Again each Data Node MAY NOT Process a split. Firstly, DataNodes are only responsible for managing the storage of data, not executing jobs/tasks. The TaskTracker is the slave node responsible for executing tasks on individual nodes. Secondly, only those nodes which contain the data required for that particular Job will process the splits, unless the load on these nodes is too high, in which case the data in the split is copied to another node and processed there.
Now coming to the questions,
Again, dataNodes are not responsible for processing jobs/tasks. We usually refer to a combination of dataNode + taskTracker as a node since they are commonly found on the same node, handling different responsibilities (data storage & running tasks). A given node can process more than one split at a time. Usually a single split is assigned to a single Map task. This translates to multiple Map tasks running on a single node, which is possible.
Data from input file is read in serial fashion.
A node's processing capacity is defined by the number of Slots. If a node has 10 slots, it means it can process 10 tasks in parallel (these tasks may be Map/Reduce tasks). The cluster administrator usually configures the number of slots per each node considering the physical configuration of that node, such as memory, physical storage, number of processor cores, etc.

Related

How does Hadoop framework decides the node to run Map job

As per my understanding, files stored in HDFS are divided into blocks and and each block is replicated to multiple nodes, 3 by default. How does Hadoop framework choose the node to run a map job, out of all the nodes on which a particular block is replicated.
As I know, there will be same amounts of map tasks as amounts of blocks.
See manual here.
Usually, framework choose those nodes close to the input block for reducing network bandwidth for map task.
That's all I know.
In Mapreduce 1 it depends on how many map task are running in that datanode which hosts a replica, because the number of map tasks is fixed in MR1. In MR2 there are no fixed slots, so it depends on number of tasks already running in that node.

Can a slave node have multiple blocks of the same file in hadoop?

Say I have a hadoop cluster where one node is the Master node and the other is a Data node. The slave node is an 8-core machine just to make sure there are enough cores to process jobs parallelly. Can i still split the file into say 3 blocks and have the slave node store all the three blocks separately on it. In other words, "if we want to utilize all the slave nodes in a hadoop cluster", then is there a 1:1 relation between number of slave nodes and the maximum number of blocks of a file? If yes, then in such a case how would the map-reduce work. Will the master node fire three map jobs to the slave node and have each mapper pick up each block on the slave node?
My question can be seen in a different way. If we have a 1GB file on a cluster with 3 data nodes then how do the 64 MB blocks get divided and how are they distributed between the three nodes?
The second question seems to be more understandable for me so I will take that first.
From HDFS Perspective:
With 64MB block size a 1GB file consists from 16 blocks, blocks are being stored somewhat randomly between DataNodes, if you have more from them as the replication factor, but you can expect an even distribution between the nodes, if you do not load the data from one of the DNs. If you do, that DN will hold a replica from all the blocks, and other DNs will hold the remaining replicas distributed sort of evenly (still randomly placed). So yes, if you have a file consists from 16 blocks, and only 3 DN with a replication factor of 3 all 3 DNs will hold all 16 blocks for example.
From YARN's perspective when you run the MapReduce job:
YARN tries to find a container on a node for a mapper that has the data locally, there is a configurable wait time for a free container on such nodes before YARN starts up the mapper on a node that does not have the data.
YARN does not rely on physical cores directly, you can configure the number of virtual cores and the amount of memory a container uses, and based on these values YARN will allocate the amount of available containers in a NodeManager.
Further reading on YARN tuning on Cloudera Engineering blog
However:
From the first part of the question as I understand you want to achieve paralellism by defining the block size to split your data files.
MapReduce does not care about HDFS blocks, it has its own abstraction to split the input, it is called InputSplit. InputSplits are feeded to the mappers, by the InputFormat. Also InputSplits are defining the place where the split is available locally so that YARN can find a container that is on a node that has the split on local data storage. I suggest to check the API, and the available implementations of InputFormat, as they most likely suit your needs, however if they are not, then you can still write your own implementation, and specify it via the job configuration.

Input Splits in Hadoop

If the input file size is 200MB, there will be 4 blocks/ input splits, but each data node will have a mapper running on it. If all the 4 input splits are in the same data node, then only one map task will be executed?
or how does the number of map task depend on the input split?
Also will the Task Tracker run on all the data nodes and Job Tracker on one data node in the cluster?
Number of maps entirely depends on no of splits, not on the location of the blocks/splits. So for your case it will be 4.
As your are saying all in one node, you also have to consider that there will be replicas of those blocks in different nodes. Now there is concept of map-reduce processing, 'data locality' which hadoop will want to take advantage of. And another thing to consider here is avaiablity of resources. So for a block (a replica of all, commonly 3) to be executed hadoop will find a datanode in which the block is present and resource is available. So it may go up to a situation like you described, replicas of the 4 blocks are present in one of the nodes and it has resources that map-reduce will need. But map task will be 4, that is for sure.

Does execution of Map and Reduce phase happen inside each DataNode by Node Manager?

I understand that Resource Manager sends MapReduce Program to each Node Manager so that MapReduce gets executed in each Node.
But After seeing this image , I am getting confused on where actual Map & Reduce jobs executed and how shuffling is happening between Data Nodes ?
Is it not time taking process to sort and suffle/send data accross difference Data Node to perform Reduce Job ? Please explain me.
Also let me know what is Map Node and Reduce Node in this diagram.
Image Src: http://gppd-wiki.inf.ufrgs.br/index.php/MapReduce
The input split is a logical chunk of the file stored on hdfs , by default an input split represents a block of a file where the blocks of the file might be stored on many data nodes in the cluster.
A container is a task execution template allocated by the Resource Manager on any of the data node in order to execute the Map/Reduce tasks.
First the Map tasks gets executed by the containers on data node where the container was allocated by the Resource Manager as near as possible to the Input Split's location by adhering to the Rack Awareness Policy (Local/Rack Local/DC Local).
The Reduce tasks will be executed by any random containers on any data nodes, and the reducers copies its relevant the data from every mappers by the Shuffle/Sort process.
The mappers prepares the results in such a way the results are internally partitioned and within each partition the records are sorted by the key and the partitioner determines which reducer should fetch the partitioned data.
By Shuffle and Sort, the Reducers copies their relevant partitions from every mappers output through http, eventually every reducer Merge&Sort the copied partitions and prepares the final single Sorted file before the reduce() method invoked.
The below image may give more clarifiations.
[Imagesrc:http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/]

hadoop node unused for map tasks

I've noticed that all map and all reduce tasks are running on a single node (node1). I tried creating a file consisting of a single hdfs block which resides on node2. When running a mapreduce tasks whose input consists only of this block resident on node2, the task still runs on node1. I was under the impression that hadoop prioritizes running tasks on the nodes that contain the input data. I see no errors reported in log files. Any idea what might be going on here?
I have a 3-node cluster running on kvms created by following the cloudera cdh4 distributed installation guide.
I was under the impression that hadoop prioritizes running tasks on
the nodes that contain the input data.
Well, there might be an exceptional case :
If the node holding the data block doesn't have any free CPU slots, it won't be able to start any mappers on that particular node. In such a scenario instead of waiting data block will be moved to a nearby node and get processed there. But before that framework will try to process the replica of that block, locally(If RF > 1).
HTH
I don't understand when you say "I tried creating a file consisting of a single hdfs block which resides on node2". I don't think you can "direct" hadoop cluster to store some block in some specific node.
Hadoop will decide number of mappers based on input's size. If input size is less than hdfs block size (default I think is 64m), it will spawn just one mapper.
You can set job param "mapred.max.split.size" to whatever size you want to force spawning multiple reducers (default should suffice in most cases).

Resources