Hadoop Distributed File system - hadoop

I have a file.txt that has 3 blocks (block a , block b, block c). How does hadoop write these blocks in to Cluster.. My question is Does hadoop follow parallel write? Or does block b has to wait for block a to write into cluster? Or block a and block b and block c are parallely writtten in to hadoop cluster...

When you copy a file from the local file system to HDFS or when you create a new file in HDFS: blocks are copied sequentially - first, the first block is copied to a datanode, then the second block is copied to a datanode and so on.
What is done in parallel, however, is replica placement: while a datanode receives data of the block from the client, the datanode saves the data in a file, which represents the block, and, simultaneously re-sends the data to another datanode, which is supposed to create another replica of the block.
When you copy a file from one location to another location inside a HDFS cluster or between two HDFS clusters: you do it in parallel using DistCp.

WHEN YOU ATTEMPT TO COPY A FILE OR CREATE A NEW FILE FROM A LOCAL SYSTEM TO ANY HDFS: THE BLOCKS ARE COPIED AS A SEQUENCE OF DATA-NODES, THIS IS VERY SIMILAR TO THAT IN AN ARRAY. THIS IS CONSECUTIVE-SEQUENTIAL ARRANGEMENT OF DATA-BLOCKS.
When this handshake is taking place, the moment the datanode receives the first request, this gets replicated to a file, creating a SAVEPOINT and then the same process occurs sequentially for the other blocks, which makes it redundant and the saved state is used for comparison.
Whereas when you copy the file from one segment to the other inside the same block or between two different blocks you use AHDC (Apache Hadoop DistCp).
Hadoop is designed to keep the data state restored till the transaction has been completed.

Related

When is the Block Placement Policy used?

I know that dfs.block.replicator.classname property can be used to change the BlockPlacementPolicy. I want to know when exactly is this policy used to place data? Like is it used when -copyFromLocal/-put are executed?
I think the output of a job will also be placed according to this policy.
And secondly, the property when specified in the conf file will affect the entire hadoop cluster. If I am using a shared cluster, is there a way to change the BlockPlacement policy for only jobs that are executed under my user, or is there a way to change the policy for each job?
I am using the hadoop streaming jar on a 4 node cluster.
The block placement policy is used, whenever a new block of data is written to HDFS. It could be when the data is ingested into HDFS or a job writes data into HDFS etc. It is used for optimal placement of blocks, so that there is a uniform distribution blocks in a HDFS cluster.
For e.g. the algorithm used by default block placement policy class (BlockPlacementPolicyDefault) is:
The replica placement strategy is that if the writer is on a datanode,
the 1st replica is placed on the local machine, otherwise a random datanode.
The 2nd replica is placed on a datanode that is on a different rack. The 3rd
replica is placed on a datanode which is on a different node of the rack as
the second replica.
The block placement policy is also used by the following HDFS utilities:
Balancer: Balances disk space usage on HDFS. In this case the BlockPlacementPolicy could be used for placing the blocks to other nodes, in order to re-balance the cluster
NamenodeFsck: - Utility to check the HDFS for inconsistencies. In this case BlockPlacementPolicy is used for checking the number of mis-replicated blocks.
You can have your own custom block placement class. To do that you need to extend BlockPlacementPolicy class and set the configuration parameter dfs.block.replicator.classname to your custom class name in hdfs-site.xml.
By default BlockPlacementPolicyDefault class is used for block placement:
final Class<? extends BlockPlacementPolicy> replicatorClass = conf.getClass(
DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_DEFAULT,
BlockPlacementPolicy.class);
You can't change the block placement policy for each job. The reason for this is, the block placement policy is instantiated once, when the NameNode comes up.
Following is the sequence of calls, to initialize BlockPlacementPolicy. These steps are executed, when the NameNode is started:
Initialize NameNode, when NameNode is started
NameNode::initialize(conf); // Initialize NameNode
NameNode::loadNamesystem(conf); // Load name system
Initialize FsNameSystem. FsNameSystem does all book keeping work on NameNode
FSNamesystem.loadFromDisk(conf); // Loads FS Image from disk
Instantiate BlockManager. This is called while instantiating FsNameSystem
this.blockManager = new BlockManager(this, conf);
InstantiateBlockPlacementPolicy. This is called by BlockManager.
blockplacement = BlockPlacementPolicy.getInstance(
conf, datanodeManager.getFSClusterStats(),
datanodeManager.getNetworkTopology(),
datanodeManager.getHost2DatanodeMap());
Since this is instantiated once, you can't change this for each job.

when loading a huge file into hadoop cluster , what happends if the client failed while transfering data to datanodes?

for example,the file is 1280MB and the hdfs block is 128MB,what happends when the
client only transfered 3 blocks and then failed? does the NameNode obtain a file of 3 blocks or delete the 3 blocks ?
No it will not delete the 3 blocks. So here is how it works, we assume block 4 is in the next of the queue maintained by the FSDataOuputStream. After it writes some x bytes due to some network issue the datanode fails, the pipeline is first closed and any data written to that is deleted, the new good datanode is given a new identity and is added to the queue and the same is communicated to name node to update the metadata information for block 4 and then the data would be written to that newly identified datanode starting from the 1st byte of that block. Anatomy of file write in The Definitive Guide would help you in a better understanding on how it is done.

How does hadoop configuration work

I see we can configure different parameters in hadoop clusters. Bit confused if we configure master, these configurations are replicated in client nodes? Or every nodes should be configured separately?
For eg, like setting block size as 128MB in master, so all client nodes will have 128MB or since those nodes are not configured will it be in default value 64MB? If master setting is used then for configs where system parameters are considered like no of cores how to handle those?
Configuration in Hadoop is more complex. Actually, hadoop lets the API users to decide how to use the configuration.
For example, let's discovery how the file block size is determined. The file block size uses the value of fs.local.block.size in the configuration.
fs.local.block.size is not set in the configuration in the client side
This situation is that conf.get("fs.local.block.size"); returns null in the client side.
If you use the following codes (the codes is in your client) to create a File in HDFS,
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream output = fs.create(new Path("/new/file/in/hdfs"));
// write your data to output...
Then fs.local.block.size use the default value, which is 32MB (32 * 1024 * 1024).
However, if you write a MapReduce job to output some files (I assume you use TextOutputFormat, some custom output format may change the following behaviour), the file block size is determined by the configuration of the TaskTracker. So in this situation, if your configuration is inconsistent in different nodes, you may find that the MapReduce output files have different block sizes.
fs.local.block.size is set in the configuration in the client side
This situation is that you can use conf.get("fs.local.block.size"); to get the value of fs.local.block.size in the client side.
If you use the following codes (the codes is in your client) to create a File in HDFS,
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream output = fs.create(new Path("/new/file/in/hdfs"));
// write your data to output...
The fs.local.block.size is conf.get("fs.local.block.size"). FileSystem.create
However, if you write a MapReduce job to output some files, it's a little complex.
If in one TaskTracker, fs.local.block.size is not final, the block size of output files in this TaskTracker will be fs.local.block.size in the client side. Because the job configuration will be commited to the TaskTracker.
If in this TaskTracker, fs.local.block.size is final, as fs.local.block.size can not be overrided by the job configuration, the block size in this TaskTracker will be fs.local.block.size in the TaskTracker node. So in this situation, if your configuration is inconsistent in different nodes, you may find that the MapReduce output files have different block sizes.
The above analysis is only appropriate for fs.local.block.size. For other configuration, you may need to read the related source codes.
At last, I recommend that making all of your configurations consistent to avoid to trap in strange behaviours.

Flume NG and HDFS

I am very new to hadoop , so please excuse the dumb questions.
I have the following knowledge
Best usecase of Hadoop is large files thus helping in efficiency while running mapreduce tasks.
Keeping the above in mind I am somewhat confused about Flume NG.
Assume I am tailing a log file and logs are produced every second, the moment the log gets a new line it will be transferred to hdfs via Flume.
a) Does this mean that flume creates a new file on every line that is logged in the log file I am tailing or does it append to the existing hdfs file ??
b) is append allowed in hdfs in the first place??
c) if the answer to b is true ?? ie contents are appended constantly , how and when should I run my mapreduce application?
Above questions could sound very silly but a answers to the same will be highly appreciated.
PS: I have not yet set up Flume NG or hadoop as yet, just reading the articles to get an understanding and how it could add value to my company.
Flume writes to HDFS by means of HDFS sink. When Flume starts and begins to receive events, the sink opens new file and writes events into it. At some point previously opened file should be closed, and until then data in the current block being written is not visible to other redaers.
As described in the documentation, Flume HDFS sink has several file closing strategies:
each N seconds (specified by rollInterval option)
after writing N bytes (rollSize option)
after writing N received events (rollCount option)
after N seconds of inactivity (idleTimeout option)
So, to your questions:
a) Flume writes events to currently opened file until it is closed (and new file opened).
b) Append is allowed in HDFS, but Flume does not use it. After file is closed, Flume does not append to it any data.
c) To hide currently opened file from mapreduce application use inUsePrefix option - all files with name that starts with . is not visible to MR jobs.

Chain of events when running a MapReduce job

I'm looking for some specific information regarding the chain of events when running a MapReduce job on a Hadoop cluster.
Let's assume that my Reduce tasks are on the verge of completion. After my last reducer has written its output to the output file, how many replicas of the output file are there?
What exactly happens after the last reducer has finished writing to the output file. When does the NameNode request the respective Data Nodes to replicate the output file? And how is the Name Node informed that the output file is ready? Who conveys that information to the NameNode?
Thank you!
The Reduce tasks write output to HDFS. They do this by first communicating with the name node to request a block. The name node then tells the reducer which data nodes to write to, and then the reducer actually sends the data directly to the first data node, which then sends it to the second data node, which sends it to the third node. Typically the name node will keep things local, so the first data node is probably the same machine that is running the reduce task.
Once the reducer has finished writing outputs, and the data nodes have confirmed this, the reducer itself will tell the job tracker that it has finished via periodic heartbeat communication.
To understand the basics of HDFS replication, have a read over replica placement in the HDFS architecture document. In a nutshell, the NameNode will try to use the same rack to minimize latency.

Resources