Use of secondary namenode in Hadoop in 2.x - hadoop

As far as i know, Hadoop 1.x had secondary namenode but was used to create an image of the primary namenode and it updates the primary namenode when it fails and again starts up. But what is the use of secondary namenode in Hadoop 2.x given that we already have a hot standby present?

As far as I know the Hadoop 2.x can be done in 2 ways:
1. With HA (High Availability Cluster): if you are setting up HA cluster then you may not need to use Secondary namenode because standby namenode keep its state synchronized with the Active namenode.
The HDFS NameNode High Availability feature enables you to run redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.Both NameNode require the same type of hardware configuration.In HA hadoop cluster Active NameNode reads and write metadata information in Separate JournalNode.
In the event of failover, standby NameNode will ensure that its namespace is completely updated according to edit logs before it is changes to active state. So there is no need of Secondary NameNode in this Cluster Setup.
2. Without HA: you can have a hadoop setup without standby node. Then the secondary NameNode will act as you already mentioned in Hadoop 1.x

When you configure HA for NameNodes, Secondary Namenode is not used. However you can still configure HDFS without HA (with NameNode and Secondary NameNode). This part didn't change much since hadoop 1.x.

Related

Can I add standby namenode into existing Hadoop cluster (with Namenode and Secondary namenode)

I have Hadoop 2.7.2 setup where Namenode and Secondary Namenode node run together with few datanodes. After namenode failure (it was just restart) I realized that Secondary namenode is not redundant namenode as I thought.
So question is, can I make my cluster high available and add Standby namenode without deleting existing metadata from namenode?
You need a Zookeeper cluster, but yes, you can add a namenode to enable High Availability

Namenode with high availability vs zookeeper based leader selection

I am reading 2 different things in Apache Hadoop documentation and cloudera's documentation.
Based on cloudera, we should set up namenode in high availability mode, i.e.: by defining primary and secondary namenode, but based on Hadoop documentation, this should automatically taken care by zookeeper and it should decide namenode among the available datanodes.
Can anyone explain the difference and which one to use?
by defining primary and secondary namenode
There is such a thing as a "secondary namenode", but it's actually a very different thing as it's not a standby and able to become active.
There's no "vs". Namenode HA needs Zookeeper
If you read more of the Cloudera documentation it doesn't fail to mention Zookeeper.
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
Cloudera doesn't package much extras, if any, on top of the core Hadoop functions.
Regarding your question...
this should automatically taken care by zookeeper
The failover is automatic if HDFS Zookeeper properties are (manually) configured, Zookeeper is running, and the Active Namenode goes down.
among the available datanodes
The operation has nothing to do with datanodes

secondary name node functionality

Can someone explain what exactly the words in bold mean which are taken from text book? What does "state of the secondary namenode lags that of the primary " mean?
Secondary name node keeps a copy of the merged namespace image, which can be used in the event of the namenode failing. **However, the state
of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain.**The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary.
Thanks in advance
Hadoop 1.x:
When we start ha hadoop cluster its creates a file system image which keeps the metadata information of your entire hadopp cluster. When a new entry comes into the hadoop cluster it goes to edits log. Secondary NameNode periodically reads and query the edits and retrieve the information and merge the information with fsimage. In case NameNode fails, hadoop administrator can start the hadoop cluster with the help of fsimage and edits.(during start NameNode reads the edits and fsimage so there wont be data loss)
Fsimage and edits log already keeps the updated information about file system in the form of metadata so in case of total failure of primary hadoop administrator can recover the cluster information with help of edits log and fsimage.
Hadoop 2.x:
In hadoop 1.x NameNode was a single point of failure. Failure of NameNode was downtime for your entire hadoop cluster. Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in periods of cluster downtime.To overcome this issue hadoop community added High Availability feature. During the setting up of hadoop cluster you can choose which type of cluster you want.
The HDFS NameNode High Availability feature enables you to run redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.Both NameNode require the same type of hardware configuration.
In HA configuration one NameNode will be active and other will be in standby state.The ZKFailoverController (ZKFC) is a ZooKeeper client that monitors and manages the state of the NameNode. When active NameNode goes down, It makes standby as active NameNode, and primary NameNode will become standby when you start them. Please can get more on it on this website: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.8.0/bk_system-admin-guide/content/ch_hadoop-ha-5.html
In HA hadoop cluster Active NameNode reads and write metadata information in JournalNode(Quorum-based Storage only). JournalNode is a separate node in HA hadoop cluster used for reads and write edits log and fsimage.
Standby NameNodealways synchronized with active NameNode, both communicate with each other through Journal Node. When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. Standby NameNode constantly monitors edit logs at journal nodes and updates its namespace accordingly.In the event of failover, standby NameNode will ensure that its namespace is completely updated according to edit logs before it is changes to active state. When standby will be in active state it will start writing edits log into JournalNode.
Hadoop don't keep any data into NameNode, All data resides in datanode, In case of NameNode failure there wont be any loss of data.

How Namenode High availability achieved in Hadoop 1.x?

Is there any possible solution to achieve Namenode HA in Hadoop 1.x ?
Hadoop 1.x is known for its single point of failure; there is a single Master Node that contains Hadoop Namenode and Hadoop JobTracker. The Namenode keeps look up table for every file (blocks of the file) location on the cluster. The Name node manages Hadoop Distributed File system and act as a HDFS master.
The Secondary NameNode is used for fault tolerance and it is a copy of the NameNode records. It is only used to backup the Namenode in case of crash.

What is the impact on hadoop cluster when Secondary Namenode fails

What happens to hadoop cluster when Secondary NameNode fails.
Hadoop cluster is said to be a single point of failure as all medata is stored by NameNode. What about Secondary NameNode, if secondary namenode fails, will Cluster fail or keep running.
Secondary name node is little bit confusing name. Hadoop Cluster will run when it crashes. You can run Hadoop claster even without it and it is not used for high availability. I am talking about Hadoop versions <2.
More info: http://wiki.apache.org/hadoop/FAQ#What_is_the_purpose_of_the_secondary_name-node.3F

Resources