What will happen if HDFS client failed to upload the file? - hadoop

In my understanding, uploading files to HDFS (the filesystem for Apache Hadoop) follows the below procedures:
client (hdfs shell) asks Namenode which datanodes to put the data chunks
Namenode answers it and saves the files location and some metadata in itself
client puts the data chunks to the given datanodes
Suppose, in 1, Namenode returned which datanode to store the data. However, after that, the datanode gets unavailable because of some reasons (e.g. Network failure or Machine outage). So, the data cannot be saved in datanode but the metadata is stored in Namenode. The data gets inconsistent state.
Can someone explain how HDFS is avoiding this situation? I tried to read hadoop source code but I finally gave up because it's huge.

Related

Corrupted block in hdfs cluster

The screenshot added below shows the output of hdfs fsck /. It shows that the "/" directory is corrupted. This is the masternode of my Hadoop cluster. What to do?
If you are using Hadoop 2, you can run a Standby namenode to achieve High Availability. Without that, your cluster's master will be a Single Point of Failure.
You can not retrieve the data of Namenode from anywhere else since it is different from the usual data you store. If your namenode goes down, your blocks and files will still be there, but you won't be able to access them since there would be no related metadata in the namenode.

Why HDFS client caches the file data into a temporary local file?

Why can't HDFS client send directly to DataNode?
What's the advantage of HDFS client cache?
An application request to create a file does not reach the NameNode immediately.
In fact, initially the HDFS client caches the file data into a temporary local file.
Application writes are transparently redirected to this temporary local file.
When the local file accumulates data worth at least one HDFS block size, the client contacts the NameNode to create a file.
The NameNode then proceeds as described in the section on Create. The client flushes the block of data from the local temporary file to the specified DataNodes.
When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode.
The client then tells the NameNode that the file is closed.
At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.
It sounds like you are referencing the Apache Hadoop HDFS Architecture documentation, specifically the section titled Staging. Unfortunately, this information is out-of-date and no longer an accurate description of current HDFS behavior.
Instead, the client immediately issues a create RPC call to the NameNode. The NameNode tracks the new file in its metadata and replies with a set of candidate DateNode addresses that can receive writes of block data. Then, the client starts writing data to the file. As the client writes data, it is writing on a socket connection to a DataNode. If the written data becomes large enough to cross a block size boundary, then the client will interact with the NameNode again for an addBlock RPC to allocate a new block in NameNode metadata and get a new set of candidate DataNode locations. There is no point at which the client is writing to a local temporary file.
Note however that alternative file systems, such as S3AFileSystem which integrates with Amazon S3, may support options for buffering to disk. (See the Apache Hadoop documentation for Integration with Amazon Web Services if you're interested in more details on this.)
I have filed Apache JIRA HDFS-11995 to track correcting the documentation.

Hadoop: How does NameNode know which blocks correspond to a file?

NameNode in hadoop does not store the block information. It is kept in-memory and on startup DataNodes report the block information.
If I copyFromLocal a file to hdfs, it is transferred to hdfs, because I can see with "hadoop fs -ls".
I was wondering how Hadoop knows which filename correspond to which blocks.
The NameNode maintains a File System Image which stores the mapping between files -> blocks. It also stores an edit log which maintains any edits to the File System. The Secondary namenode periodically reads the File System Image and the Edit Log from the Namenode, and combines them to create the new File System Image for the NameNode.

Hadoop Namenode without HDFS storage

I have installed a hadoop cluster with total 3 machines, with 2 nodes acting as datanodes and 1 node acting as Namenode and as well as a Datanode.
I wanted to clear certain doubts regarding hadoop cluster installation and architecture.
Here is a list of questions I am looking answers for----
I uploaded a data file around 500mb size in the cluster and then checked the hdfs report.
I noticed that the namenode I made is also occupying 500mb size in the hdfs, along with datanodes with a replication factor of 2.
The problem here is that I want the namenode not to store any data on it, in short i dont want it to work as a datanode as it is also storing the file I am uploading. So what is the way of making it only act as a Master Node and not like a datanode?
I tried running the command hadoop -daemon.sh stop on the Namenode to stop the datanode services on it but it wasnt of any help.
How much metadata does a Namenode generate for a filesize typically of 1 GB? Any approximations?
Go to conf directory inside your $HADOOP_HOME directory on your master. Edit the file named slaves and remove the entry corresponding to your name node from it. This way you are only asking the other two nodes to act as slaves and name node as only the master.

Writing to local file during map phase in hadoop

Hadoop writes the intermediate results to the local disk and the results of the reducer to the HDFS. what does HDFS mean. What does it physically translate to
HDFS is the Hadoop Distributed File System. Physically, it is a program running on each node of the cluster that provides a file system interface very similar to that of a local file system. However, data written to HDFS is not just stored on the local disk but rather is distributed on disks across the cluster. Data stored in HDFS is typically also replicated, so the same block of data may appear on multiple nodes in the cluster. This provides reliable access so that one node's crashing or being busy will not prevent someone from being able to read any particular block of data from HDFS.
Check out http://en.wikipedia.org/wiki/Hadoop_Distributed_File_System#Hadoop_Distributed_File_System for more information.
As Chase indicated, HDFS is Hadoop Distributed File System.
If I may, I recommend this tutorial and video of how HDFS and the Map/Reduce framework works and will serve you as a guide into the world of Hadoop: http://www.cloudera.com/resource/introduction-to-apache-mapreduce-and-hdfs/

Resources