How containers are assigned in YARN? - hadoop

In Mapreduce 1, the Jobtracker get the Block information from the NameNode and then assign Task(most likely) to the Task Tracker that are available in the Same node as where the Datablocks is present. There by the performance can be increased.
How this is taken care in YARN? Is Application Manager responsible for getting block information from the NameNode?
If so, How the containers are assigned to those Application master? Did Resource Manager considers the DataBlock location while assigning the Container? or it randomly assign any container in a Node?

Technically speaking its the role of JobClient to compute the input splits, this splits information is placed in HDFS from where ApplicationMaster will picks it off and uses this information while requesting the containers from ResourceManager.
So, technically, Application Master while requesting the containers for all the map tasks the information about each map task's data locality is passed along to ResourceManager. The scheduler uses this information to make scheduling decisions, attempting to assign tasks local to data.

Related

When is the Block Placement Policy used?

I know that dfs.block.replicator.classname property can be used to change the BlockPlacementPolicy. I want to know when exactly is this policy used to place data? Like is it used when -copyFromLocal/-put are executed?
I think the output of a job will also be placed according to this policy.
And secondly, the property when specified in the conf file will affect the entire hadoop cluster. If I am using a shared cluster, is there a way to change the BlockPlacement policy for only jobs that are executed under my user, or is there a way to change the policy for each job?
I am using the hadoop streaming jar on a 4 node cluster.
The block placement policy is used, whenever a new block of data is written to HDFS. It could be when the data is ingested into HDFS or a job writes data into HDFS etc. It is used for optimal placement of blocks, so that there is a uniform distribution blocks in a HDFS cluster.
For e.g. the algorithm used by default block placement policy class (BlockPlacementPolicyDefault) is:
The replica placement strategy is that if the writer is on a datanode,
the 1st replica is placed on the local machine, otherwise a random datanode.
The 2nd replica is placed on a datanode that is on a different rack. The 3rd
replica is placed on a datanode which is on a different node of the rack as
the second replica.
The block placement policy is also used by the following HDFS utilities:
Balancer: Balances disk space usage on HDFS. In this case the BlockPlacementPolicy could be used for placing the blocks to other nodes, in order to re-balance the cluster
NamenodeFsck: - Utility to check the HDFS for inconsistencies. In this case BlockPlacementPolicy is used for checking the number of mis-replicated blocks.
You can have your own custom block placement class. To do that you need to extend BlockPlacementPolicy class and set the configuration parameter dfs.block.replicator.classname to your custom class name in hdfs-site.xml.
By default BlockPlacementPolicyDefault class is used for block placement:
final Class<? extends BlockPlacementPolicy> replicatorClass = conf.getClass(
DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_DEFAULT,
BlockPlacementPolicy.class);
You can't change the block placement policy for each job. The reason for this is, the block placement policy is instantiated once, when the NameNode comes up.
Following is the sequence of calls, to initialize BlockPlacementPolicy. These steps are executed, when the NameNode is started:
Initialize NameNode, when NameNode is started
NameNode::initialize(conf); // Initialize NameNode
NameNode::loadNamesystem(conf); // Load name system
Initialize FsNameSystem. FsNameSystem does all book keeping work on NameNode
FSNamesystem.loadFromDisk(conf); // Loads FS Image from disk
Instantiate BlockManager. This is called while instantiating FsNameSystem
this.blockManager = new BlockManager(this, conf);
InstantiateBlockPlacementPolicy. This is called by BlockManager.
blockplacement = BlockPlacementPolicy.getInstance(
conf, datanodeManager.getFSClusterStats(),
datanodeManager.getNetworkTopology(),
datanodeManager.getHost2DatanodeMap());
Since this is instantiated once, you can't change this for each job.

Hadoop Yarn - how to request fix number of containers

How can Apache Spark or Hadoop Mapreduce request a fixed number of containers?
In Spark yarn-client mode, it can be requested by setting the configuration spark.executor.instances, which is directly related to the number of YARN containers it gets. How does Spark transform this into a Yarn parameter that is understood by Yarn?
I know by default, it can depend upon number of splits and configuration values yarn.scheduler.minimum-allocation-mb, yarn.scheduler.minimum-allocation-vcores. But Spark has ability to exactly request fixed number of containers. How can any AM do that?
In Hadoop Map reduce, Number of containers for map task is decided based on number of input splits. It is based on the size of source file. for every Input split, one map container will be requested.
By default number of Reducer per job is one. It can be customized by passing arguments to mapreduce.reduce.tasks. Pig & Hive has different logic to decide number of reducers. ( this also can be customized).
One container (Reduce container, usually bigger than map container) will be requested per reducers.
Total number of mappers & reducers will be clearly defined in job config file during job submission.
I think it's by using AM api that yarn provides. AM provider can use rsrcRequest.setNumContainers(numContainers); http://hadoop.apache.org/docs/r2.5.2/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html#Writing_a_simple_Client
Here I had similar discussion on other questionl. Yarn container understanding and tuning

Mapreduce dataflow Internals

I tried to understand map reduce anatomy from various books/blogs.. But I am not getting a clear idea.
What happens when I submit a job to the cluster using this command:
..Loaded the files into hdfs already
bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
Can anyone explain the sequence of opreations that happens right from the client and inside the cluster?
The processes goes like this :
1- The client configures and sets up the job via Job and submits it to the JobTracker.
2- Once the job has been submitted the JobTracker assigns a job ID to this job.
3- Then the output specification of the job is verified. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
4- Once this is done, InputSplits for the job are created(based on the InputFormat you are using). If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.
5- Based on the number of InputSplits, map tasks are created and each InputSplits gets processed by one map task.
6- Then the resources which are required to run the job are copied across the cluster like the the job JAR file, the configuration file etc. The job JAR is copied with a high replication factor (which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job.
7- Then based on the location of the data blocks, that are going to get processed, JobTracker directs TaskTrackers to run map tasks on that very same DataNode where that particular data block is present. If there are no free CPU slots on that DataNode, the data is moved to a nearby DataNode with free slots and the processes is continued without having to wait.
8- Once the map phase starts individual records(key-value pairs) from each InputSplit start getting processed by the Mapper one by one completing the entire InputSplit.
9- Once the map phase gets over, the output undergoes shuffle, sort and combine. After this the reduce phase starts giving you the final output.
Below is the pictorial representation of the entire process :
Also, I would suggest you to go through this link.
HTH

Hadoop: force 1 mapper task per node from jobconf

I want to run one task (mapper) per node on my Hadoop cluster, but I cannot modify the configuration with which the tasktracker runs (i'm just a user).
For this reason, I need to be able to push the option through the job configuration. I tried to set the mapred.tasktracker.map.tasks.maximum=1 at hadoop jar command, but the tasktracker ignores it as it has a different setting in its configuration file.
By the way, the cluster uses the Capacity Scheduler.
Is there any way I can force 1 task per node?
Edited:
Why? I have a memory-bound task, so I want each task to use all the memory available to the node.
when you set the no of mappers, either through the configuration files or by some other means, it's just a hint to the framework. it doesn't guarantee that you'll get only the specified no of mappers. the creation of mappers is actually governed by the no of Splits. and the split creation is carried out by the logic which your InputFormat holds. if you really want to have just one mapper to process the entire file, set "issplittable" to true in the InputFormat class you are using. but why would you do that?the power of hadoop actually lies in distributed parallel processing.

Does it matter where I submit hadoop jobs from?

Does it have any measurable effect on resources whether I submit a bunch of hadoop jobs from different client servers or all from the same one? I would think not since all the work is done in the cluster. Is this correct?
The only thing which is resource intensive on the client submitting to the Hadoop cluster is the calculation of the input splits. When the input data is huge or when too many jobs are submitted from the same client then because of the input split calculations, the job submission might become a bit slow.
I am not able to recall the Hadoop release or the parameter, but a configurable parameter was included to move the calculation of the input splits from the client submitting a job to the Hadoop cluster.
It really shouldn't matter where you submit your jobs from. The client itself doesn't do much, it uses RPC protocol to contact the services, and then just sits idle until the job is finished.
Also, the most important is what kind of scheduler you use to allocate resource, which is probably going to make the most significant difference and decide which resources to allocate to which job. More on job scheduling here.
I don't think you can move the input split calculation into Job Tracker in 'Classic' version. In YARN, you can move it using
"yarn.app.mapreduce.am.compute-splits-in-cluster"
I am guessing, Hadoop people didn't want to overload Job tracker with input split creation. Similar to the design decision of not assigning too much work for Namenode in HDFS.
In YARN, every job gets its own Application Master, so no worries about overloading a SPOF/bottleneck master like job tracker.
In reference to the original question, the client job would have to reach out to the namenode to get the block locations (I have see parts of code on block storage class calling data node for some meta data...not sure whether these happen during input split creation or in task tracker node) . This can become an issue if you are handling a lot of jobs on the same client node.
If you are using YARN, there would be a slight performance increase if all these communications happen inside the cluster.
Need to check how Oozie handles this issue.
Hopefully, this helps!
Arun

Resources