When is the Block Placement Policy used? - hadoop

I know that dfs.block.replicator.classname property can be used to change the BlockPlacementPolicy. I want to know when exactly is this policy used to place data? Like is it used when -copyFromLocal/-put are executed?
I think the output of a job will also be placed according to this policy.
And secondly, the property when specified in the conf file will affect the entire hadoop cluster. If I am using a shared cluster, is there a way to change the BlockPlacement policy for only jobs that are executed under my user, or is there a way to change the policy for each job?
I am using the hadoop streaming jar on a 4 node cluster.

The block placement policy is used, whenever a new block of data is written to HDFS. It could be when the data is ingested into HDFS or a job writes data into HDFS etc. It is used for optimal placement of blocks, so that there is a uniform distribution blocks in a HDFS cluster.
For e.g. the algorithm used by default block placement policy class (BlockPlacementPolicyDefault) is:
The replica placement strategy is that if the writer is on a datanode,
the 1st replica is placed on the local machine, otherwise a random datanode.
The 2nd replica is placed on a datanode that is on a different rack. The 3rd
replica is placed on a datanode which is on a different node of the rack as
the second replica.
The block placement policy is also used by the following HDFS utilities:
Balancer: Balances disk space usage on HDFS. In this case the BlockPlacementPolicy could be used for placing the blocks to other nodes, in order to re-balance the cluster
NamenodeFsck: - Utility to check the HDFS for inconsistencies. In this case BlockPlacementPolicy is used for checking the number of mis-replicated blocks.
You can have your own custom block placement class. To do that you need to extend BlockPlacementPolicy class and set the configuration parameter dfs.block.replicator.classname to your custom class name in hdfs-site.xml.
By default BlockPlacementPolicyDefault class is used for block placement:
final Class<? extends BlockPlacementPolicy> replicatorClass = conf.getClass(
DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_DEFAULT,
BlockPlacementPolicy.class);
You can't change the block placement policy for each job. The reason for this is, the block placement policy is instantiated once, when the NameNode comes up.
Following is the sequence of calls, to initialize BlockPlacementPolicy. These steps are executed, when the NameNode is started:
Initialize NameNode, when NameNode is started
NameNode::initialize(conf); // Initialize NameNode
NameNode::loadNamesystem(conf); // Load name system
Initialize FsNameSystem. FsNameSystem does all book keeping work on NameNode
FSNamesystem.loadFromDisk(conf); // Loads FS Image from disk
Instantiate BlockManager. This is called while instantiating FsNameSystem
this.blockManager = new BlockManager(this, conf);
InstantiateBlockPlacementPolicy. This is called by BlockManager.
blockplacement = BlockPlacementPolicy.getInstance(
conf, datanodeManager.getFSClusterStats(),
datanodeManager.getNetworkTopology(),
datanodeManager.getHost2DatanodeMap());
Since this is instantiated once, you can't change this for each job.

Related

How containers are assigned in YARN?

In Mapreduce 1, the Jobtracker get the Block information from the NameNode and then assign Task(most likely) to the Task Tracker that are available in the Same node as where the Datablocks is present. There by the performance can be increased.
How this is taken care in YARN? Is Application Manager responsible for getting block information from the NameNode?
If so, How the containers are assigned to those Application master? Did Resource Manager considers the DataBlock location while assigning the Container? or it randomly assign any container in a Node?
Technically speaking its the role of JobClient to compute the input splits, this splits information is placed in HDFS from where ApplicationMaster will picks it off and uses this information while requesting the containers from ResourceManager.
So, technically, Application Master while requesting the containers for all the map tasks the information about each map task's data locality is passed along to ResourceManager. The scheduler uses this information to make scheduling decisions, attempting to assign tasks local to data.

Hadoop Distributed File system

I have a file.txt that has 3 blocks (block a , block b, block c). How does hadoop write these blocks in to Cluster.. My question is Does hadoop follow parallel write? Or does block b has to wait for block a to write into cluster? Or block a and block b and block c are parallely writtten in to hadoop cluster...
When you copy a file from the local file system to HDFS or when you create a new file in HDFS: blocks are copied sequentially - first, the first block is copied to a datanode, then the second block is copied to a datanode and so on.
What is done in parallel, however, is replica placement: while a datanode receives data of the block from the client, the datanode saves the data in a file, which represents the block, and, simultaneously re-sends the data to another datanode, which is supposed to create another replica of the block.
When you copy a file from one location to another location inside a HDFS cluster or between two HDFS clusters: you do it in parallel using DistCp.
WHEN YOU ATTEMPT TO COPY A FILE OR CREATE A NEW FILE FROM A LOCAL SYSTEM TO ANY HDFS: THE BLOCKS ARE COPIED AS A SEQUENCE OF DATA-NODES, THIS IS VERY SIMILAR TO THAT IN AN ARRAY. THIS IS CONSECUTIVE-SEQUENTIAL ARRANGEMENT OF DATA-BLOCKS.
When this handshake is taking place, the moment the datanode receives the first request, this gets replicated to a file, creating a SAVEPOINT and then the same process occurs sequentially for the other blocks, which makes it redundant and the saved state is used for comparison.
Whereas when you copy the file from one segment to the other inside the same block or between two different blocks you use AHDC (Apache Hadoop DistCp).
Hadoop is designed to keep the data state restored till the transaction has been completed.

How does hadoop configuration work

I see we can configure different parameters in hadoop clusters. Bit confused if we configure master, these configurations are replicated in client nodes? Or every nodes should be configured separately?
For eg, like setting block size as 128MB in master, so all client nodes will have 128MB or since those nodes are not configured will it be in default value 64MB? If master setting is used then for configs where system parameters are considered like no of cores how to handle those?
Configuration in Hadoop is more complex. Actually, hadoop lets the API users to decide how to use the configuration.
For example, let's discovery how the file block size is determined. The file block size uses the value of fs.local.block.size in the configuration.
fs.local.block.size is not set in the configuration in the client side
This situation is that conf.get("fs.local.block.size"); returns null in the client side.
If you use the following codes (the codes is in your client) to create a File in HDFS,
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream output = fs.create(new Path("/new/file/in/hdfs"));
// write your data to output...
Then fs.local.block.size use the default value, which is 32MB (32 * 1024 * 1024).
However, if you write a MapReduce job to output some files (I assume you use TextOutputFormat, some custom output format may change the following behaviour), the file block size is determined by the configuration of the TaskTracker. So in this situation, if your configuration is inconsistent in different nodes, you may find that the MapReduce output files have different block sizes.
fs.local.block.size is set in the configuration in the client side
This situation is that you can use conf.get("fs.local.block.size"); to get the value of fs.local.block.size in the client side.
If you use the following codes (the codes is in your client) to create a File in HDFS,
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream output = fs.create(new Path("/new/file/in/hdfs"));
// write your data to output...
The fs.local.block.size is conf.get("fs.local.block.size"). FileSystem.create
However, if you write a MapReduce job to output some files, it's a little complex.
If in one TaskTracker, fs.local.block.size is not final, the block size of output files in this TaskTracker will be fs.local.block.size in the client side. Because the job configuration will be commited to the TaskTracker.
If in this TaskTracker, fs.local.block.size is final, as fs.local.block.size can not be overrided by the job configuration, the block size in this TaskTracker will be fs.local.block.size in the TaskTracker node. So in this situation, if your configuration is inconsistent in different nodes, you may find that the MapReduce output files have different block sizes.
The above analysis is only appropriate for fs.local.block.size. For other configuration, you may need to read the related source codes.
At last, I recommend that making all of your configurations consistent to avoid to trap in strange behaviours.

Mapreduce dataflow Internals

I tried to understand map reduce anatomy from various books/blogs.. But I am not getting a clear idea.
What happens when I submit a job to the cluster using this command:
..Loaded the files into hdfs already
bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
Can anyone explain the sequence of opreations that happens right from the client and inside the cluster?
The processes goes like this :
1- The client configures and sets up the job via Job and submits it to the JobTracker.
2- Once the job has been submitted the JobTracker assigns a job ID to this job.
3- Then the output specification of the job is verified. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
4- Once this is done, InputSplits for the job are created(based on the InputFormat you are using). If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.
5- Based on the number of InputSplits, map tasks are created and each InputSplits gets processed by one map task.
6- Then the resources which are required to run the job are copied across the cluster like the the job JAR file, the configuration file etc. The job JAR is copied with a high replication factor (which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job.
7- Then based on the location of the data blocks, that are going to get processed, JobTracker directs TaskTrackers to run map tasks on that very same DataNode where that particular data block is present. If there are no free CPU slots on that DataNode, the data is moved to a nearby DataNode with free slots and the processes is continued without having to wait.
8- Once the map phase starts individual records(key-value pairs) from each InputSplit start getting processed by the Mapper one by one completing the entire InputSplit.
9- Once the map phase gets over, the output undergoes shuffle, sort and combine. After this the reduce phase starts giving you the final output.
Below is the pictorial representation of the entire process :
Also, I would suggest you to go through this link.
HTH

Hadoop: force 1 mapper task per node from jobconf

I want to run one task (mapper) per node on my Hadoop cluster, but I cannot modify the configuration with which the tasktracker runs (i'm just a user).
For this reason, I need to be able to push the option through the job configuration. I tried to set the mapred.tasktracker.map.tasks.maximum=1 at hadoop jar command, but the tasktracker ignores it as it has a different setting in its configuration file.
By the way, the cluster uses the Capacity Scheduler.
Is there any way I can force 1 task per node?
Edited:
Why? I have a memory-bound task, so I want each task to use all the memory available to the node.
when you set the no of mappers, either through the configuration files or by some other means, it's just a hint to the framework. it doesn't guarantee that you'll get only the specified no of mappers. the creation of mappers is actually governed by the no of Splits. and the split creation is carried out by the logic which your InputFormat holds. if you really want to have just one mapper to process the entire file, set "issplittable" to true in the InputFormat class you are using. but why would you do that?the power of hadoop actually lies in distributed parallel processing.

Resources