Data division on Addition of node to distributed System - hadoop

Suppose I am having a distributed networks of computer in which i have say 1000 storage nodes.
Now if a new node is added, what should be done?
Meaning the data now should get equally divided into 1001 nodes ?
Also will the answer change if nodes range is 10 instead of 1000.

The client machine first splits the file into block Say block A, Block B then client machine interact with NameNode to asks the location to place these blocks (Block A Block B).NameNode gives a list of datanodes to the clinet to write the data. NameNode generally choose nearest datanode from network for this.
Then client choose first datanode from those list and write the first block to the datanode and datanode replicates the block to another datanodes. NameNode keeps the information about files and their associated blocks.
HDFS will not move blocks from old datanodes to new datanodes to balance the cluster if a datanode added in hadoop cluster.To do this, you need to run the balancer.
The balancer program is a Hadoop daemon that redistributes blocks by moving them
from over utilized datanodes to underutilized datanodes, while adhering to the block replica placement policy that makes data loss unlikely by placing block replicas on different racks. It moves blocks until the cluster is deemed to be balanced, which means that the utilization of every datanode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage.
Reference: Hadoop Definitive Guide 3rd edition Page No 350
As a hadoop admin you should schedule balance job at once in a day to balance blocks on hadoop cluster.
Useful link related to balancer:
http://www.swiss-scalability.com/2013/08/hadoop-hdfs-balancer-explained.html
http://www.cloudera.com/content/cloudera/en/documentation/cdh4/latest/CDH4-Installation-Guide/cdh4ig_balancer.html

Related

Can a slave node have multiple blocks of the same file in hadoop?

Say I have a hadoop cluster where one node is the Master node and the other is a Data node. The slave node is an 8-core machine just to make sure there are enough cores to process jobs parallelly. Can i still split the file into say 3 blocks and have the slave node store all the three blocks separately on it. In other words, "if we want to utilize all the slave nodes in a hadoop cluster", then is there a 1:1 relation between number of slave nodes and the maximum number of blocks of a file? If yes, then in such a case how would the map-reduce work. Will the master node fire three map jobs to the slave node and have each mapper pick up each block on the slave node?
My question can be seen in a different way. If we have a 1GB file on a cluster with 3 data nodes then how do the 64 MB blocks get divided and how are they distributed between the three nodes?
The second question seems to be more understandable for me so I will take that first.
From HDFS Perspective:
With 64MB block size a 1GB file consists from 16 blocks, blocks are being stored somewhat randomly between DataNodes, if you have more from them as the replication factor, but you can expect an even distribution between the nodes, if you do not load the data from one of the DNs. If you do, that DN will hold a replica from all the blocks, and other DNs will hold the remaining replicas distributed sort of evenly (still randomly placed). So yes, if you have a file consists from 16 blocks, and only 3 DN with a replication factor of 3 all 3 DNs will hold all 16 blocks for example.
From YARN's perspective when you run the MapReduce job:
YARN tries to find a container on a node for a mapper that has the data locally, there is a configurable wait time for a free container on such nodes before YARN starts up the mapper on a node that does not have the data.
YARN does not rely on physical cores directly, you can configure the number of virtual cores and the amount of memory a container uses, and based on these values YARN will allocate the amount of available containers in a NodeManager.
Further reading on YARN tuning on Cloudera Engineering blog
However:
From the first part of the question as I understand you want to achieve paralellism by defining the block size to split your data files.
MapReduce does not care about HDFS blocks, it has its own abstraction to split the input, it is called InputSplit. InputSplits are feeded to the mappers, by the InputFormat. Also InputSplits are defining the place where the split is available locally so that YARN can find a container that is on a node that has the split on local data storage. I suggest to check the API, and the available implementations of InputFormat, as they most likely suit your needs, however if they are not, then you can still write your own implementation, and specify it via the job configuration.

what Hadoop will do after one of datanodes down

I have 10 data noes and 2 name nodes Hadoop cluster with replicates configured 3, I was wondering if one of data nodes goes down, will hadoop try to generate the lost replicates on the other alive nodes? or just do nothing(since still have 2 replicas left).
Add, what if the down data node come back after a while, can hadoop recognize the data on that node? Thanks!
will hadoop try to generate the lost replicates on the other alive nodes? or just do nothing(since still have 2 replicas left).
Yes, Hadoop will recognize it and make copies of that data on some other nodes. When Namenode stop receiving heart beats from the data nodes, it assumes that data node is lost. To keep the replication of the all the data to defined replication factor, it will make the copies on other data nodes.
Add, what if the down data node come back after a while, can hadoop recognize the data on that node?
Yes, when a data node comes back with all its data, Name node will remove/delete the extra copies of data. In the next heart beat to the data node, Name node will send the instruction to remove the extra data and free up the space on disk.
Snippet from Apache HDFS documentation:
Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.

Data replication in hadoop cluster

I am a beginner learning Hadoop. Is it possible that 2 different data blocks from the same file could be stored in the same data node? For example: blk-A and blk-B from file "file.txt" could be placed in the same data node (datanode 1).
Here is the documentation that explains block placement policy. Currently, HDFS replication is 3 by default which means there are 3 replicas of a block. The way they are placed is:
One block is placed on a datanode on a unique rack.
Second block is placed on a datanode on a different rack.
Third block is placed on a different datanode on the same rack as
second block.
This policy helps when there is an event such as datanode is dead, block gets corrupted, etc.
Is it possible?
Unless you make changes in the source code, there is no property that you can change that will allow you to place two blocks on same datanode.
My opinion is that placing two blocks on same datanode beats the purpose of HDFS. Blocks are replicated so HDFS can recover for reasons described above. If blocks are placed on same datanode and that datanode is dead, you will lose two blocks instead of one.
The answer depends on the cluster topology. Hadoop tries to distribute data among data centers and data nodes. But What if you only have one data center ? or if you have only one node cluster (pseudo cluster). In those cases the optimal distribution doesn't happen and it is possible that all blocks end in the same data node. In production it is recommended have more than one data center (physically, not only in configuration) and at least the same number of data nodes than the replication number.

HADOOP HDFS imbalance issue

I have a Hadoop cluster that have 8 machines and all the 8 machines are data nodes.
There's a program running on one machine(say machine A) that will create sequence files ( each of the file is about 1GB) in HDFS continuously.
Here's the problem: All of the 8 machines are the same hardware and has the same capacity. When other machines still have about 50% free space on the disks for HDFS, machine A has only 5% left.
I checked the block info and found that almost every block has one replica on machine A.
Is there any way to balance the replicas?
Thanks.
This is the default placement policy. It works well for the typical M/R pattern, where each HDFS node is also a compute node and the writer machines are uniformly distributed.
If you don't like it, then there is HDFS-385 Design a pluggable interface to place replicas of blocks in HDFS. You need to write a class that implements BlockPlacementPolicy interface, and then set this class in as the dfs.block.replicator.classname in hdfs-site.xml.
There is a way. you can use hadoop command line balancer tool.
HDFS data might not always be be placed uniformly across the DataNode.To spread HDFS data uniformly across the DataNodes in the cluster, this can be used.
hadoop balancer [-threshold <threshold>]
where, threshold is Percentage of disk capacity
see the following links for details:
http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html#Rebalancer

HDFS replication factor

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.

Resources