HDFS behavior when a file got corrupted - hadoop

I find Sample Questions on cloudera exam I believe the answer is D. Agree ??
Question 1
You use the hadoop fs -put command to add sales.txt to HDFS. This file is small enough that it fits into a single block, which is replicated to three nodes within your cluster. When and how will the cluster handle replication following the failure of one of these nodes?
A. The cluster will make no attempt to re-replicate this block.
B. This block will be immediately re-replicated and all other HDFS operations on the cluster will halt while this is in progress.
C. The block will remain under-replicated until the administrator manually deletes and recreates the file.
D. The file will be re-replicated automatically after the NameNode determines it is under-replicated based on the block reports it receives from the DataNodes.

Yes. It's D. When Namenode determines the datanode is no longer active it'll make one of the datanodes that has the given block to replicate to another node.

Related

How often are blocks on HDFS replicated?

I have a question regarding hadoop hdfs blocks replication. Suppose a block is written on a datanode and the DFS has a replication factor 3, how long does it take for the namenode to replicate this block on other datanodes? Is it instantaneuos? If not, right after writing the block to a datanode suppose the disk on this datanode fails which cannot be recovered, does it mean the block is lost forever? And also how often does the namenode check for missing/corrupt blocks?
You may want to review this article which has a good description of hdfs writes. it should be immediate depending upon how busy the cluster is:
https://data-flair.training/blogs/hdfs-data-write-operation/
What happens if DataNode fails while writing a file in the HDFS?
While writing data to the DataNode, if DataNode fails, then the following actions take place, which is transparent to the client writing the data.
The pipeline gets closed, packets in the ack queue are then added to the front of the data queue making DataNodes downstream from the failed node to not miss any packet.

Deleting HDFS Block Pool

I am running a Spark on Hadoop cluster. I tried running a Spark job and noticed I was getting some issues, eventually realised by looking at the logs of the data node that the file system of one of the datanodes is full
I looked at hdfs dfsadmin -report to identify this. The category DFS remaining is 0B because the non-DFS used is massive (155GB of 193GB configured capacity).
When I looked at the file system on this data node I could see most of this comes from the /usr/local/hadoop_work/ directory. There are three block pools there and one of them is very large (98GB). When I look on the other data node in the cluster it only has one block pool.
What I am wondering is can I simply delete two of these block pools? I'm assuming (but don't know enough about this) that the namenode (I have only one) will be looking at the most recent block pool which is smaller in size and corresponds to the one on the other data node.
As outlined in the comment above, eventually I did just delete the two block pools. I did this based on the fact that these block pool ID's didn't exist in the other data node and by looking through the local filesystem I could see the files under these ID's hadn't been updated for a while.

Data replication in hadoop cluster

I am a beginner learning Hadoop. Is it possible that 2 different data blocks from the same file could be stored in the same data node? For example: blk-A and blk-B from file "file.txt" could be placed in the same data node (datanode 1).
Here is the documentation that explains block placement policy. Currently, HDFS replication is 3 by default which means there are 3 replicas of a block. The way they are placed is:
One block is placed on a datanode on a unique rack.
Second block is placed on a datanode on a different rack.
Third block is placed on a different datanode on the same rack as
second block.
This policy helps when there is an event such as datanode is dead, block gets corrupted, etc.
Is it possible?
Unless you make changes in the source code, there is no property that you can change that will allow you to place two blocks on same datanode.
My opinion is that placing two blocks on same datanode beats the purpose of HDFS. Blocks are replicated so HDFS can recover for reasons described above. If blocks are placed on same datanode and that datanode is dead, you will lose two blocks instead of one.
The answer depends on the cluster topology. Hadoop tries to distribute data among data centers and data nodes. But What if you only have one data center ? or if you have only one node cluster (pseudo cluster). In those cases the optimal distribution doesn't happen and it is possible that all blocks end in the same data node. In production it is recommended have more than one data center (physically, not only in configuration) and at least the same number of data nodes than the replication number.

When I store files in HDFS, will they be replicated?

I am new to Hadoop.
When I store Excel files using hadoop -fs put commoad, it is stored in HDFS.
Replication factor is 3.
My question is: Does it take 3 copies and store them into 3 nodes each?
Here is a comic for HDFS working.
https://docs.google.com/file/d/0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1/edit?pli=1
Does it take 3 copies and store them into 3 nodes each.
answer is: NO
Replication is done in pipelining
that is it copies some part of file to datanode1 and then copies to datanode2 from datanode1 and to datanode3 from datanode1
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Replication+Pipelining
see here for Replication Pipelining
Your HDFS Client (hadoop fs in this case) will be given the block names and datanode locations (the first being the closest location if the NameNode can determine this from the rack awareness script) of where to store these files by the NameNode.
The client then copies the blocks to the closest Data node. The data node is then responsible for copying the block to a second datanode (preferably on another rack), where finally the second will copy to the third (on the same rack as the third).
So your client will only copy data to one of the data nodes, and the framework will take care of the replication between datanodes.
It will store the original file to one (or more in case of large files) blocks. These blocks will be replicated to two other nodes.
Edit: My answer applies to Hadoop 2.2.0. I have no experience with prior versions.
Yes it will be replicated in 3 nodes (maximum upto 3 nodes).
The Hadoop Client is going to break the data file into smaller “Blocks”, and place those blocks on different machines throughout the cluster. The more blocks you have, the more machines that will be able to work on this data in parallel. At the same time, these machines may be prone to failure, so it is safe to insure that every block of data is on multiple machines at once to avoid data loss.
So each block will be replicated in the cluster as its loaded. The standard setting for Hadoop is to have (3) copies of each block in the cluster. This can be configured with the dfs.replication parameter in the file hdfs-site.xml.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Yes it make n(replications factor) number copies in hdfs
use this command to find out the location of file, find #rack it is stored, what is the block name on all racks
hadoop fsck /path/to/your/directory -files -blocks -locations -racks
Use this command to load data into hdfs with replication
hadoop fs -Ddfs.replication=1 -put big.file /tmp/test1.file
and -Ddfs.replication=1 you can define number of replication copy will created while to loading data into hdfs

Adding a new volume to a pseudo-distributed Hadoop node failing silently

Im attempting to add a new volume to a Hadoop pseudo-distributed node, by adding the location of the volume in dfs.name.dir in hdfs-site.xml, and i can see the lock file in this location - but try as i might, it seems that when i load files (using hive) these locations are hardly used (even though the lock files, and some sub-folders appears.. so Hadoop clearly had access to them). When the main volume comes close to running out of space, i get the following exception:
Failed with exception java.io.IOException: File /tmp/hive-ubuntu/hive_2011-02-24_15-39-15_997_1889807000233475717/-ext-10000/test.csv could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:643)
Any pointers on how to add new volumes to Hadoop ? FWIW im using EC2.
There are a few things you can do, according to the FAQ:
Manually copy files in HDFS to a new name, delete the old files, then rename the new files to be what they were originally.
Increase the replication factor temporarily, setting it back once blocks have balanced out between nodes.
Remove the full node, wait for its blocks to replicate to the other nodes, then bring it back up. This doesn't really help because your full node is still full when you bring it back online.
Run the rebalancer script on the head node.
I'd try running #4 first, then #2.
When adding new disks / capacity to a data node Hadoop does not guarantee that the disks will be load balanced fairly (Ex: It will not put more blocks on drives with more free space). The best way I have solved this is to increase the replication factor (Ex: From 2 to 3).
hadoop fs -setrep 3 -R /<path>
Watch the 'under replicated blocks' report on the name node. As soon as this reaches 0, decrease the replication factor (Ex: From 3 to 2). This will randomly delete replicas from the system which should balance out the local node.
hadoop fs -setrep 2 -R /<path>
It's not going to be 100% balanced, but it should be in a lot better shape then it was before. This is covered in the Hadoop wiki to some extent. If you are running pseudo-distributed, and have no other data nodes then the balancer script will not help you.
http://wiki.apache.org/hadoop/FAQ#If_I_add_new_DataNodes_to_the_cluster_will_HDFS_move_the_blocks_to_the_newly_added_nodes_in_order_to_balance_disk_space_utilization_between_the_nodes.3F

Resources