I am new to hadoop. I want to know the difference between snapshot and fsimage used for file system state in hadoop. I heard that both do the same work. then what makes the difference between them?
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. Any change to the file system namespace or its properties is recorded by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS, change in replication factor, etc causes the NameNode to insert a record into the EditLog indicating this. The NameNode uses a file in its local host OS file system to store the EditLog.
FsImage and EditLog come hand in hand that's why this explanation. Now:
The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system.
Snapshots support storing a copy of data at a particular instant of time. A snapshot can be taken of the entire file system also. This does not involve copying of data but recording filesize, block info, etc to a snapshottable directory.
In very normal terms, FsImage stores the info as to where the data is stored, in how many blocks and related information while Snapshot stores the read-only image of the data/file system.
I hope this explains the difference.
fsimage doesn't store mapping of blocks to files right? That is stored in block address table, and is written every single time during namenode restart.
Related
I've recently been settings up hadoop in pseudo distributed mode and I have created data and loaded that into HDFS. Later I have formatted namenode because of a problem. Now when I do that, I find that the directories and the files which were already there before on the datanodes don't show up anymore. (the word "Formatting" makes sense though) But now, I do have this doubt. As the namenode doesn't hold the metadata of the files anymore, is access to the previously loaded files cut-off? If that's a yes, then how do we delete the data already there on the datanodes?
Your previous datanode directories are now stale, yes.
You need to manually go through each datanode and delete the contents of those directories. There is no such format command via the Hadoop CLI
By default, the data node directory is a single folder under /tmp
Otherwise, you've configured your XML files where to store data
Where HDFS stores data
Why can't HDFS client send directly to DataNode?
What's the advantage of HDFS client cache?
An application request to create a file does not reach the NameNode immediately.
In fact, initially the HDFS client caches the file data into a temporary local file.
Application writes are transparently redirected to this temporary local file.
When the local file accumulates data worth at least one HDFS block size, the client contacts the NameNode to create a file.
The NameNode then proceeds as described in the section on Create. The client flushes the block of data from the local temporary file to the specified DataNodes.
When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode.
The client then tells the NameNode that the file is closed.
At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.
It sounds like you are referencing the Apache Hadoop HDFS Architecture documentation, specifically the section titled Staging. Unfortunately, this information is out-of-date and no longer an accurate description of current HDFS behavior.
Instead, the client immediately issues a create RPC call to the NameNode. The NameNode tracks the new file in its metadata and replies with a set of candidate DateNode addresses that can receive writes of block data. Then, the client starts writing data to the file. As the client writes data, it is writing on a socket connection to a DataNode. If the written data becomes large enough to cross a block size boundary, then the client will interact with the NameNode again for an addBlock RPC to allocate a new block in NameNode metadata and get a new set of candidate DataNode locations. There is no point at which the client is writing to a local temporary file.
Note however that alternative file systems, such as S3AFileSystem which integrates with Amazon S3, may support options for buffering to disk. (See the Apache Hadoop documentation for Integration with Amazon Web Services if you're interested in more details on this.)
I have filed Apache JIRA HDFS-11995 to track correcting the documentation.
I understand that Name node doesn't keep the block locations of files in FSImage. It keeps all that information in RAM.
So what does FSImage file and edits log files have?
thanks
basam
FSImage is a snapshot of what is the actual metadata of a cluster at some point of time, and a copy this snapshot will be in RAM and if you made any changes to the metadata of cluster like, created or deleted one file in hdfs. This is changed metadata is captured by editlogs. Editlogs and FSImage are merged periodically to have always the latest information about metadata in FSImage. So when restart your cluster due to any reason , Namenode machine does all the transactions from the EditLog to the in-memory representation of the FsImage.
NameNode in hadoop does not store the block information. It is kept in-memory and on startup DataNodes report the block information.
If I copyFromLocal a file to hdfs, it is transferred to hdfs, because I can see with "hadoop fs -ls".
I was wondering how Hadoop knows which filename correspond to which blocks.
The NameNode maintains a File System Image which stores the mapping between files -> blocks. It also stores an edit log which maintains any edits to the File System. The Secondary namenode periodically reads the File System Image and the Edit Log from the Namenode, and combines them to create the new File System Image for the NameNode.
Hadoop writes the intermediate results to the local disk and the results of the reducer to the HDFS. what does HDFS mean. What does it physically translate to
HDFS is the Hadoop Distributed File System. Physically, it is a program running on each node of the cluster that provides a file system interface very similar to that of a local file system. However, data written to HDFS is not just stored on the local disk but rather is distributed on disks across the cluster. Data stored in HDFS is typically also replicated, so the same block of data may appear on multiple nodes in the cluster. This provides reliable access so that one node's crashing or being busy will not prevent someone from being able to read any particular block of data from HDFS.
Check out http://en.wikipedia.org/wiki/Hadoop_Distributed_File_System#Hadoop_Distributed_File_System for more information.
As Chase indicated, HDFS is Hadoop Distributed File System.
If I may, I recommend this tutorial and video of how HDFS and the Map/Reduce framework works and will serve you as a guide into the world of Hadoop: http://www.cloudera.com/resource/introduction-to-apache-mapreduce-and-hdfs/