how hdfs removes over-replicated blocks - hadoop

For example I wrote a file into HDFS using replication factor 2. The node I was writing to has now all the blocks of the file. Others copies of all blocks of the file are scattered around all remaining nodes in the cluster. That's default HDFS policy.
What exactly happens if I lower replication factor of the file to 1?
How HDFS decides which blocks from which nodes to delete? I hope it tries to delete blocks from nodes that have the most count of blocks of the file?
Why I'm asking - if it does, it would make sense - it will alleviate processing of the file. Because if there is only one copy of all blocks and all the blocks are located on the same node, then it would be harder to process the file using map-reduce because of data transferring to other nodes in the cluster.

When a block becomes over-replicated, the name node chooses a replica to remove. The name node will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from the data node with the least amount of available disk space. This may help rebalancing the load over the cluster.
Source: The Architecture of Open Source Applications

Over-replicated blocks are randomly removed from different nodes by the HDFS, and are rebalanced that means they are not just removed from the current node.

Related

How to reduce the replication factor in a HDFS directory and it's impact

We are using Hortonworks HDP 2.1 (HDFS 2.4), with replication factor 3.
We have recently decommissioned a datanode and that left a lot of under replicated blocks in the cluster.
Cluster is now trying to satisfy the replication factor by distributing under replicated blocks among other nodes.
How do I stop that process. I am OK with some files being replicated only twice. If I change the replication factor to 2 in that directory, will that process be terminated?
What's the impact of making the replication factor to 2 for a directory which has files with 3 copies. Will the cluster start another process to remove the excess copy for each file with 3 copies?
Appreciate your help on this. Kindly share the references too.
Thanks.
Sajeeva.
We have recently decommissioned a datanode and that left a lot of under replicated blocks in the cluster.
If the DataNode was gracefully decommissioned, then it should not have resulted in under-replicated blocks. As an edge case though, if decommissioning a node brings the total node count under the replication factor set on a file, then by definition that file's blocks will be under-replicated. (For example, consider an HDFS cluster with 3 DataNodes. Decommissioning a node results in 2 DataNodes remaining, so now files with a replication factor of 3 have under-replicated blocks.)
During decommissioning, HDFS re-replicates (copies) the blocks hosted on that DataNode over to other DataNodes in the cluster, so that the desired replication factor is maintained. More details on this are here:
How do I correctly remove nodes in Hadoop?
​Decommission DataNodes
How do I stop that process. I am OK with some files being replicated only twice. If I change the replication factor to 2 in that directory, will that process be terminated?
There is no deterministic way to terminate this process as a whole. However, if you lower replication factor to 2 on some of the under-replicated files, then the NameNode will stop scheduling re-replication work for the blocks of those files. This means that for the blocks of those files, HDFS will stop copying new replicas across different DataNodes.
The typical replication factor of 3 is desirable from a fault tolerance perspective. You might consider setting replication factor on those files back to 3 later.
What's the impact of making the replication factor to 2 for a directory which has files with 3 copies. Will the cluster start another process to remove the excess copy for each file with 3 copies?
Yes, the NameNode will flag these files as over-replicated. In response, it will schedule block deletions at DataNodes to restore the desired replication factor of 2. These block deletions are dispatched to the DataNodes asynchronously, in response to their heartbeats. Within the DataNode, the block deletion executes asynchronously to clean the underlying files from the disk.
More details on this are described in the Apache Hadoop Wiki.

Data replication in hadoop cluster

I am a beginner learning Hadoop. Is it possible that 2 different data blocks from the same file could be stored in the same data node? For example: blk-A and blk-B from file "file.txt" could be placed in the same data node (datanode 1).
Here is the documentation that explains block placement policy. Currently, HDFS replication is 3 by default which means there are 3 replicas of a block. The way they are placed is:
One block is placed on a datanode on a unique rack.
Second block is placed on a datanode on a different rack.
Third block is placed on a different datanode on the same rack as
second block.
This policy helps when there is an event such as datanode is dead, block gets corrupted, etc.
Is it possible?
Unless you make changes in the source code, there is no property that you can change that will allow you to place two blocks on same datanode.
My opinion is that placing two blocks on same datanode beats the purpose of HDFS. Blocks are replicated so HDFS can recover for reasons described above. If blocks are placed on same datanode and that datanode is dead, you will lose two blocks instead of one.
The answer depends on the cluster topology. Hadoop tries to distribute data among data centers and data nodes. But What if you only have one data center ? or if you have only one node cluster (pseudo cluster). In those cases the optimal distribution doesn't happen and it is possible that all blocks end in the same data node. In production it is recommended have more than one data center (physically, not only in configuration) and at least the same number of data nodes than the replication number.

Metadata storage by Namenode for all file blocks

While reading the book Hadoop: The Definitive Guide, I came across this page with the following line:
The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.
I am struggling to understand how this works. Let's say, that I copy a 1 GB file on an 8 node cluster with replication factor of 3. So each datanode will have 1 block and these blocks will be replicated on other nodes, bringing the total number of blocks on each node effectively to 3. Now the namenode is supposed to keep an index containing the location of each block. But according to the text, if the namenode does not store block locations persistently, how are they reconstructed after the cluster is shut down and restarted. There will be no way of telling which block belongs to which file. Can someone please explain this to me?
The namenode does preserve some state about the files (name, path, size, block size, block IDs etc), just not eh physical location of where the blocks are.
When the data nodes start up, they effectively tree walk the dfs data directory discovering all the file blocks they have and once complete, reports to the name node the blocks that it hosts.
The namenode builds up a map of the files to block locations from the reports from each data node.
This is one of the reasons it sometimes takes a few minutes to come out of safe mode when the cluster first starts up - if you have lots of files, it can take a few moments for each data node to tree walk and discover the blocks it hosts.
Each fsimage file contains a serialized form of all the directory and file inodes in the filesystem. Each inode is an internal representation of a file or directory’s metadata and contains such information as the file’s replication level, modification and access times, access permissions, block size, and the blocks the file is made up of. For directories, the modification time, permissions, and quota metadata are stored.An fsimage file does not record the datanodes on which the blocks are stored. Instead, the namenode keeps this mapping in memory, which it constructs by asking the datanodes for their block lists when they join the cluster and periodically afterward to ensure the namenode’s block mapping is up to date.

HDFS replication factor

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.

How can I be sure that data is distributed evenly across the hadoop nodes?

If I copy data from local system to HDFS, сan I be sure that it is distributed evenly across the nodes?
PS HDFS guarantee that each block will be stored at 3 different nodes. But does this mean that all blocks of my files will be sorted on same 3 nodes? Or will HDFS select them by random for each new block?
If your replication is set to 3, it will be put on 3 separate nodes. The number of nodes it's placed on is controlled by your replication factor. If you want greater distribution then you can increase the replication number by editing the $HADOOP_HOME/conf/hadoop-site.xml and changing the dfs.replication value.
I believe new blocks are placed almost randomly. There is some consideration for distribution across different racks (when hadoop is made aware of racks). There is an example (can't find link) that if you have replication at 3 and 2 racks, 2 blocks will be in one rack and the third block will be placed in the other rack. I would guess that there is no preference shown for what node gets the blocks in the rack.
I haven't seen anything indicating or stating a preference to store blocks of the same file on the same nodes.
If you are looking for ways to force balancing data across nodes (with replication at whatever value) a simple option is $HADOOP_HOME/bin/start-balancer.sh which will run a balancing process to move blocks around the cluster automatically.
This and a few other balancing options can be found in at the Hadoop FAQs
Hope that helps.
You can open HDFS Web UI on port 50070 of Your namenode. It will show you the information about data nodes. One thing you will see there - used space per node.
If you do not have UI - you can look on the space used in the HDFS directories of the data nodes.
If you have a data skew, you can run rebalancer which will solve it gradually.
Now with Hadoop-385 patch, we can choose the block placement policy, so as to place all blocks of a file in the same node (and similarly for replicated nodes). Read this blog about this topic - look at the comments section.
Yes, Hadoop distributes data per block, so each block would be distributed separately.

Resources