I am a java programmer, learning Hadoop.
I read that the Name node in HDFS stores its information into two files namely fsImage & editLog. In case of start up it reads this data from the disk & performs checkpoint operation.
But at many places I also read that Name Node stores the data in RAM & that is why apache recommends a machine with high RAM for Name Node server.
Please enlighten me on this.
What data does it store in RAM & where does it store fsImage and edit Log ?
Sorry if I asked anything obvious.
Let me first answer
What data does it store in RAM & where does it store fsImage and edit Log ?
In RAM -- file to block and block to data node mapping.
In persistent storage (includes both edit log and fsimage) -- file related metadata (permissions, name and so on)
Regarding the storage location of the fsimage and editlog #mashuai's answer is spot on.
For a more detailed discussion you can read up on this
When namenode starts, it loads fsimage from persistent storage(disk) it's location specified by the property dfs.name.dir (hadoop-1.x) or dfs.namenode.name.dir (hadoop-2.x) in hdfs-site.xml. Fsimage is loaded into main memory. Also as you asked during namenode starting it performs check point operation. Namenode keeps the Fsimage in RAM inorder to serve requests fast.
Apart from initial checkpoint, subsequent checkpoints can be controlled by tuning the following parameters in hdfs-site.xml.
dfs.namenode.checkpoint.period # in second 3600 Secs by default
dfs.namenode.checkpoint.txns # No of namenode transactions
It store fsimage and editlog in dfs.name.dir , it's in hdfs-site.xml. When you start the cluster, NameNode load fsimage and editlog to the memory.
When Name Node starts, it goes in safe mode. It loads FSImage from persistent storage and replay edit logs to create updated view of HDFS storage(FILE TO BLOCK MAPPING). Then it writes this updated FSImage to to persistent storage. Now Name node waits for block reports from data nodes. From block reports it creates BLOCK TO DATA NODE MAPPING. When name node received certain threshold of block reports, it goes out of safe mode and Name Node can start serving client requests. Whenever any change in meta data done by client, NameNode(NN) first write thing change in edit log segment with increasing transaction ID to persistent storage (Hard Disk). Then it updates FSImage present in its RAM.
Fsimage and editlog are stored in dfs.name.dir , it's in hdfs-site.xml.
During the start of cluster, NameNode load fsimage and editlog to the memory(RAM).
Related
While saving a file on the HDFS, it will split the file and store accordingly and stores the information on the edit log and it's all fine.
My question is: when I request the read operation to the namenode, from where it will look the datanode details?
From fsimage or the edit log?
If it is looking from the fsimage, a new fsimage will be generated at one hour interval.
If I'd request it before that time interval, what would happen?
Let's break down where each bit of information about the filesystem is stored on the NameNode.
The filesystem namespace (hierarchy of directories and files) is stored entirely in memory on the NameNode. There is no on-disk caching. Everything is in memory at all times. The FsImage is used only for persistence in case of failure. It is read only on startup. The EditLog stores changes to the FsImage; again, the EditLog is read only on startup. The active NameNode will never read an FsImage or EditLog during normal operation. However, a BackupNode or Standby NameNode (depending on your configuration) will periodically combine new EditLog entries with an old FsImage to produce a new FsImage. This is done to make startup more rapid and to reduce the size of on-disk data structures (if no compaction was done, the size of the EditLog would grow indefinitely).
The namespace discussed above includes the mapping from a file to the blocks contained within that file. This information is persisted in the FsImage / EditLog. However, the location of those blocks is not persisted into the FsImage. This information lives only transiently in the memory of the NameNode. On startup, the location of the blocks is reconstructed using the blocks reports received from all of the DataNodes. Each DataNode essentially tells the NameNode, "I have block ID AAA, BBB, CCC, ..." and so on, and the NameNode uses these reports to construct the location of all blocks.
To answer your question simply, when you request a read operation from the NameNode, all information is read from memory. Disk I/O is only performed on a write operation, to persist the change to the EditLog.
Primary Source: HDFS Architecture Guide; also I am a contributor to HDFS core code.
Why cant the metadata be stored in HDFS with 3 replication. Why does it store in the local disk?
Because it will take more time to name node in resource allocation due to several I/o operations. So it's better to store metadata in memory of name node.
There are multiple reason
If it stored on HDFS, there will be network I/O. which will be
slower.
Name-node will have dependency on data node for metadata.
Again Metadata will be require for metadata to Name-node, So that it can identify where the metadata is on hdfs.
METADATA is the data about the data such as where the block is stored in rack, so that it can be located and if metadata is stored in hdfs and if those datanodes fail's you will lose all your data because now you don't know how to access those blocks where your data was stored.
Even though if you keep replication factor more, for each changes in datanodes, the changes are made in replicas of data nodes as well as in namenode's edit log.
Now since we have 3 replicas of namenodes for every change in datanode it first have to change in
1.Its own replica blocks
In namenode and replicas of namenode.(edit_log is edited 3times )
This would cause to write more data than first.But data storage is not the only and major problem,the main problem is the time that is required to do all these operations.
Therefore namenodes are backup on remote disk,so that even though your whole clusters get fails(possibilities are less) you can always backup your data.
To save from namenode failure Hadoop comes with
Primary Namenode ->consisits of namespace image and edit logs.
Secondary Namenode -> merging namespace and editlogs so that edit logs dont become too large.
After reading Apache Hadoop documentation , there is a small confusion in understanding responsibilities of secondary node & check point node
I am clear on Namenode role and responsibilities:
The NameNode stores modifications to the file system as a log appended to a native file system file, edits. When a NameNode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file. It then writes new HDFS state to the fsimage and starts normal operation with an empty edits file. Since NameNode merges fsimage and edits files only during start up, the edits log file could get very large over time on a busy cluster. Another side effect of a larger edits file is that next restart of NameNode takes longer.
But I have a small confusion in understanding Secondary namenode & Check point namenode responsibilities.
Secondary NameNode:
The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode.
Check point node:
The Checkpoint node periodically creates checkpoints of the namespace. It downloads fsimage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode. The Checkpoint node usually runs on a different machine than the NameNode since its memory requirements are on the same order as the NameNode. The Checkpoint node is started by bin/hdfs namenode -checkpoint on the node specified in the configuration file.
It seems that responsibility between Secondary namenode & Checkpoint node are not clear. Both are working on edits. So who will modify finally?
On a different note, I have created two bugs in jira to remove ambiguity in understanding these concepts.
issues.apache.org/jira/browse/HDFS-8913
issues.apache.org/jira/browse/HDFS-8914
NameNode(Primary)
The NameNode stores the metadata of the HDFS. The state of HDFS is stored in a file called fsimage and is the base of the metadata. During the runtime modifications are just written to a log file called edits. On the next start-up of the NameNode the state is read from fsimage, the changes from edits are applied to that and the new state is written back to fsimage. After this edits is cleared and contains is now ready for new log entries.
Checkpoint Node
A Checkpoint Node was introduced to solve the drawbacks of the NameNode. The changes are just written to edits and not merged to fsimage during the runtime. If the NameNode runs for a while edits gets huge and the next startup will take even longer because more changes have to be applied to the state to determine the last state of the metadata.
The Checkpoint Node fetches periodically fsimage and edits from the NameNode and merges them. The resulting state is called checkpoint. After this is uploads the result to the NameNode.
There was also a similiar type of node called “Secondary Node” but it doesn’t have the “upload to NameNode” feature. So the NameNode need to fetch the state from the Secondary NameNode. It also was confussing because the name suggests that the Secondary NameNode takes the request if the NameNode fails which isn’t the case.
Backup Node
The Backup Node provides the same functionality as the Checkpoint Node, but is synchronized with the NameNode. It doesn’t need to fetch the changes periodically because it receives a strem of file system edits. from the NameNode. It holds the current state in-memory and just need to save this to an image file to create a new checkpoint.
NameNode- It is also known as Master node. Namenode stores meta-data i.e. number of blocks, their location, replicas and other details. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the slave nodes, and assigns tasks to them. It should be deployed on reliable hardware as it is the centerpiece of HDFS. Namenode holds its namespace using two files which are as follows:
FsImage: FsImage is an “Image file”. It contains the entire filesystem namespace and stored as a file in the namenode’s local file system.
EditLogs: It contains all the recent modifications made to the file system about the most recent FsImage.
Checkpoint node- Checkpoint node is a node which periodically creates checkpoints of the namespace. Checkpoint Node in Hadoop first downloads fsimage and edits from the active Namenode. Then it merges them (FsImage and edits) locally, and at last it uploads the new image back to the active NameNode. The checkpoint Node stores the latest checkpoint in a directory. It is structured in the same as the Namenode’s directory. It permits the checkpointed image to available for reading by the namenode.
Backup node- Backup node provides the same checkpointing functionality as the Checkpoint node. In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace, which is always synchronized with the active NameNode state. The Backup node does not need to download fsimage and edits files from the active NameNode in order to create a checkpoint, as would be required with a Checkpoint node or Secondary Namenode, since it already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into the local fsimage file and reset edits. One Backup node is supported by the NameNode at a time. No checkpoint nodes may be registered if a Backup node is in use.
In case of "Name Node",
what gets stored in main memory and what gets stored in secondary memory ( hard disk ).
What we mean by "file to block mapping" ?
What exactly is fsimage and edit logs ?
In case of "Name Node", what gets stored in main memory and what gets
stored in secondary memory ( hard disk ).
The file to block mapping, locations of blocks on data nodes, active data nodes, a bunch of other metadata is all stored in memory on the NameNode. When you check the NameNode status website, pretty much all of that information is stored in memory somewhere.
The only thing stored on disk is the fsimage, edit log, and status logs. It's interesting to note that the NameNode never really uses these files on disk, except for when it starts. The fsimage and edits file pretty much only exist to be able to bring the NameNode back up if it needs to be stopped or it crashes.
What we mean by "file to block mapping" ?
When a file is put into HDFS, it is split into blocks (of configurable size). Let's say you have a file called "file.txt" that is 201MB and your block size is 64MB. You will end up with three 64MB blocks and a 9MB block (64+64+64+9 = 201). The NameNode keeps track of the fact that "file.txt" in HDFS maps to these four blocks. DataNodes store blocks, not files, so the mapping is important to understanding where your data is and what your data is.
What exactly is fsimage and edit logs ?
A recent checkpoint of the memory of the NameNode is stored in the fsimage. The NameNode's state (i.e. file->block mapping, file properties, etc.) from that checkpoint can be restored from this file.
The edits file are all the new updates from the fsimage since the last checkpoint. These are things like a file being deleted or added. This is important for if your NameNode goes down, as it has the most recent changes since the last checkpoint stored in fsimage. The way the NameNode comes up is it materializes the fsimage into memory, and then applies the edits in the order it sees them in the edits file.
fsimage and edits exist the way they do because editing the potentially massive fsimage file every time a HDFS operation is done can be hard on the system. Instead, the edits file is simply appended to. However, for the NameNode starting up and for data storage reasons, rolling the edits into the fsimage every now and then is a good thing.
The SecondaryNameNode is the process that periodically takes the fsimage and edits file and merges them together, into a new checkpointed fsimage file. This is an important process to prevent edits from getting huge.
Hadoop is Consistent and partition tolerant, i.e. It falls under the CP category of the CAP theoram.
Hadoop is not available because all the nodes are dependent on the name node. If the name node falls the cluster goes down.
But considering the fact that the HDFS cluster has a secondary name node why cant we call hadoop as available. If the name node is down the secondary name node can be used for the writes.
What is the major difference between name node and secondary name node that makes hadoop unavailable.
Thanks in advance.
The namenode stores the HDFS filesystem information in a file named fsimage. Updates to the file system (add/remove blocks) are not updating the fsimage file, but instead are logged into a file, so the I/O is fast append only streaming as opposed to random file writes. When restaring, the namenode reads the fsimage and then applies all the changes from the log file to bring the filesystem state up to date in memory. This process takes time.
The secondarynamenode job is not to be a secondary to the name node, but only to periodically read the filesystem changes log and apply them into the fsimage file, thus bringing it up to date. This allows the namenode to start up faster next time.
Unfortunatley the secondarynamenode service is not a standby secondary namenode, despite its name. Specifically, it does not offer HA for the namenode. This is well illustrated here.
See Understanding NameNode Startup Operations in HDFS.
Note that more recent distributions (current Hadoop 2.6) introduces namenode High Availability using NFS (shared storage) and/or namenode High Availability using Quorum Journal Manager.
Things have been changed over the years especially with Hadoop 2.x. Now Namenode is highly available with fail over feature.
Secondary Namenode is optional now & Standby Namenode has been to used for failover process.
Standby NameNode will stay up-to-date with all the file system changes the Active NameNode makes .
HDFS High availability is possible with two options : NFS and Quorum Journal Manager but Quorum Journal Manager is preferred option.
Have a look at Apache documentation
From Slide 8 from : http://www.slideshare.net/cloudera/hdfs-futures-world2012-widescreen
When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is reads these edits from the JNs and apply to its own name space.
In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
Have a look at about fail over process in related SE question :
How does Hadoop Namenode failover process works?
Regarding your queries on CAP theory for Hadoop:
It can be strong consistent
HDFS is almost highly Available unless you met with some bad luck
( If all three replicas of a block are down, you won't get data)
Supports data Partition
Name Node is a primary node in which all the metadata into is stored into fsimage and editlog files periodically. But, when name node down secondary node will be online but this node only have the read access to the fsimage and editlog files and dont have the write access to them . All the secondary node operations will be stored to temp folder . when name node back to online this temp folder will be copied to name node and the namenode will update the fsimage and editlog files.
Even in HDFS High Availability, where there are two NameNodes instead of one NameNode and one SecondaryNameNode, there is not availability in the strict CAP sense. It only applies to the NameNode component, and even there if a network partition separates the client from both of the NameNodes then the cluster is effectively unavailable.
If I explain it in simple way, suppose Name Node as a men(working/live) and secondary Name Node as a ATM machine(storage/data storage)
So all the functions carried out by NN or men only but if it goes down/fails then SNN will be useless it doesn’t work but later it can be used to recover your data or logs
When NameNode starts, it loads FSImage and replay Edit Logs to create latest updated namespace. This process may take long time if size of Edit Log file is big and hence increase startup time.
The job of Secondary Name Node is to periodically check edit log and replay to create updated FSImage and store in persistent storage. When Name Node starts it doesn't need to replay edit log to create updated FSImage, it uses FSImage created by secondary name node.
The namenode is a master node that contains metadata in terms of fsimage and also contains the edit log. The edit log contains recently added/removed block information in the namespace of the namenode. The fsimage file contains metadata of the entire hadoop system in a permanent storage. Every time we need to make changes permanently in fsimage, we need to restart namenode so that edit log information can be written at namenode, but it takes a lot of time to do that.
A secondary namenode is used to bring fsimage up to date. The secondary name node will access the edit log and make changes in fsimage permanently so that next time namenode can start up faster.
Basically the secondary namenode is a helper for namenode and performs housekeeping functionality for the namenode.