Datanode one of the disk volume failure - hadoop

One of the disk from my hadoop cluster datanode has become read only. I am not sure what caused this problem.
Will removing this volume from the datanode cause data lose ??
How to handle this if i am going to face data lose?

If your hadoop cluster was having a replication factor of more than 1 (by default it is 3 for a multi-node cluster), your data must have been replicated on multiple datanodes. You can check your replication factor value (dfs.replication) in hdfs-site.xml.
So now if you remove this read-only datanode from your cluster and you have a replication factor of more than 1, then you will not face any data loss. Because your cluster will have a corresponding replica on other datanode. To balance the replicas, under-replicated blocks will be handled by hdfs automatically and subsequently hdfs will be stable.

Related

in cloudera manager, how to migrate deleted datanode data

I have been excluded datanode host "dn001" by "dfs_hosts_exclude.txt", and it works, how to also migrate datanode data from this "dn001" to other datanodes?
You shouldn't have to do anything. Hadoop's HDFS should re-replicate any data lost on your data node.
From HDFS Architecture - Data Disk Failure, Heartbeats and Re-Replication
Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.

How to reduce the replication factor in a HDFS directory and it's impact

We are using Hortonworks HDP 2.1 (HDFS 2.4), with replication factor 3.
We have recently decommissioned a datanode and that left a lot of under replicated blocks in the cluster.
Cluster is now trying to satisfy the replication factor by distributing under replicated blocks among other nodes.
How do I stop that process. I am OK with some files being replicated only twice. If I change the replication factor to 2 in that directory, will that process be terminated?
What's the impact of making the replication factor to 2 for a directory which has files with 3 copies. Will the cluster start another process to remove the excess copy for each file with 3 copies?
Appreciate your help on this. Kindly share the references too.
Thanks.
Sajeeva.
We have recently decommissioned a datanode and that left a lot of under replicated blocks in the cluster.
If the DataNode was gracefully decommissioned, then it should not have resulted in under-replicated blocks. As an edge case though, if decommissioning a node brings the total node count under the replication factor set on a file, then by definition that file's blocks will be under-replicated. (For example, consider an HDFS cluster with 3 DataNodes. Decommissioning a node results in 2 DataNodes remaining, so now files with a replication factor of 3 have under-replicated blocks.)
During decommissioning, HDFS re-replicates (copies) the blocks hosted on that DataNode over to other DataNodes in the cluster, so that the desired replication factor is maintained. More details on this are here:
How do I correctly remove nodes in Hadoop?
​Decommission DataNodes
How do I stop that process. I am OK with some files being replicated only twice. If I change the replication factor to 2 in that directory, will that process be terminated?
There is no deterministic way to terminate this process as a whole. However, if you lower replication factor to 2 on some of the under-replicated files, then the NameNode will stop scheduling re-replication work for the blocks of those files. This means that for the blocks of those files, HDFS will stop copying new replicas across different DataNodes.
The typical replication factor of 3 is desirable from a fault tolerance perspective. You might consider setting replication factor on those files back to 3 later.
What's the impact of making the replication factor to 2 for a directory which has files with 3 copies. Will the cluster start another process to remove the excess copy for each file with 3 copies?
Yes, the NameNode will flag these files as over-replicated. In response, it will schedule block deletions at DataNodes to restore the desired replication factor of 2. These block deletions are dispatched to the DataNodes asynchronously, in response to their heartbeats. Within the DataNode, the block deletion executes asynchronously to clean the underlying files from the disk.
More details on this are described in the Apache Hadoop Wiki.

How NameNode recognizes that the specific file replication is set value, than configured replication 3?

hdfs-site.xml:
dfs.replication value configured 3
Assuming that i set replication of an specific file to 2:
./bin/hadoop dfs -setrep -w 2 /path/to/file.txt
When NameNode receives heartbeat from DataNode,
Will NameNode consider as specified file
path/to/file.txt is in under replication as per the configured replication or not?
If not, how it 'll be?
First, I would like to attempt to restate your question for clarity, to make sure I understand:
Will the NameNode consider a file that has been manually set to a replication factor lower than the default (dfs.replication) to be under-replicated?
No. The NameNode stores the replication factor of each file separately in its metadata, even if the replication factor was not set explicitly by calling -setrep. By default, the metadata for each file will copy the replication factor as specified in dfs.replication (3 in your example). It may be overridden, such as by calling -setrep. When the NameNode checks if a file is under-replicated, it checks the exact replication factor stored in the metadata for that individual file, not dfs.replication. If the file's replication factor is 2, and there are 2 replicas of each of its blocks, then this is fine, and the NameNode will not consider it to be under-replicated.
Your question also makes mention of heartbeating from the DataNodes, which I think means you're interested in how interactions between the DataNodes and NameNodes relate to replication. There is also another form of communication between DataNodes and NameNodes called block reports. The block reports are the means by which DataNodes tell the NameNodes which block replicas they store. The NameNode analyzes block reports from all DataNodes to determine if a block is either under-replicated or over-replicated. If a block is under-replicated (e.g. replication factor is 2, but there is only one replica), then the NameNode schedules re-replication work so that another DataNode makes a copy of the replica. If a block is over-replicated (e.g. replication factor is 3, but there are 4 replicas), then the NameNode schedules one of the replicas to be deleted, and eventually one of the DataNodes will delete it locally.

what Hadoop will do after one of datanodes down

I have 10 data noes and 2 name nodes Hadoop cluster with replicates configured 3, I was wondering if one of data nodes goes down, will hadoop try to generate the lost replicates on the other alive nodes? or just do nothing(since still have 2 replicas left).
Add, what if the down data node come back after a while, can hadoop recognize the data on that node? Thanks!
will hadoop try to generate the lost replicates on the other alive nodes? or just do nothing(since still have 2 replicas left).
Yes, Hadoop will recognize it and make copies of that data on some other nodes. When Namenode stop receiving heart beats from the data nodes, it assumes that data node is lost. To keep the replication of the all the data to defined replication factor, it will make the copies on other data nodes.
Add, what if the down data node come back after a while, can hadoop recognize the data on that node?
Yes, when a data node comes back with all its data, Name node will remove/delete the extra copies of data. In the next heart beat to the data node, Name node will send the instruction to remove the extra data and free up the space on disk.
Snippet from Apache HDFS documentation:
Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.

what happens if i set the replication factor to 3 on a pseudo distributes mode cluster?

I tried changing the replication to 3 and I can see the replication is changed to 3 for the file I loaded into hdfs,but I cannot see the other 2 copies.Could someone answer what happens in this scenario.
You won't see any replica seen you don't have other node to create them. A replica can't be created in the same node. But in your NameNode you will see Number of Under-Replicated Blocks metric different to zero. If you attach a new data node in your cluster further, the under-replicated blocks should start the replication in automatic (obviously that imply to configure a full cluster instead the pseudo cluster).
You can see the Number of Under-Replicated Blocks metric in the Name node web ui: http://localhost:50070/dfshealth.html#tab-overview (By default in a pseudo cluster configuration).
It is recommended to set the dfs.replication to "1", otherwise when running a single datanode or psuedodistributed mode, HDFS can't replicate blocks to the specified number of datanodes and it will warn about blocks being under-replicated

Resources