How to add a Secondary NameNode in a HBase cluster setup? - hadoop

I've a Hbase cluster setup with 3 nodes: A NameNode and 2 DataNodes.
The NameNode is a server with 4GB memory and 20GB hard disk while each DataNode has 8GB memory and 100GB hard disk.
I'm using
Apache Hadoop version: 2.7.2 and
Apache Hbase version: 1.2.4
I've seen some people mentioned about a Secondary NameNode.
My questions are,
What is the impact of not having a Secondary NameNode in my setup?
Is it possible to use one of the DataNodes as the Secondary NameNode?
If possible how can I do it? (I inserted only the NameNode in /etc/hadoop/masters file.)

What is the impact of not having a Secondary NameNode in my setup?
SecondaryNamenode does the job of periodically merging the namespace image with the edit log (called as checkpointing). Your setup is not an High-Availability setup, thus not having one will cause the edit log to grow large in size which would eventually add an overhead to the NameNode during startup.
Is it possible to use one of the DataNodes as the Secondary NameNode?
Running the SNN in a Datanode host is not recommended. A separate host is preferred to run the Secondary Namenode process. The host chosen for SNN must have identical memory as the NN.
If possible how can I do it? (I inserted only the NameNode in /etc/hadoop/masters file.)
masters file is not in use anymore. Add this property in hdfs-site.xml
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>SNN_host:50090</value>
</property>
Also note that, SecondaryNamenode process is started by default in the node where start-dfs.sh is executed.

Related

Can I add standby namenode into existing Hadoop cluster (with Namenode and Secondary namenode)

I have Hadoop 2.7.2 setup where Namenode and Secondary Namenode node run together with few datanodes. After namenode failure (it was just restart) I realized that Secondary namenode is not redundant namenode as I thought.
So question is, can I make my cluster high available and add Standby namenode without deleting existing metadata from namenode?
You need a Zookeeper cluster, but yes, you can add a namenode to enable High Availability

Use of secondary namenode in Hadoop in 2.x

As far as i know, Hadoop 1.x had secondary namenode but was used to create an image of the primary namenode and it updates the primary namenode when it fails and again starts up. But what is the use of secondary namenode in Hadoop 2.x given that we already have a hot standby present?
As far as I know the Hadoop 2.x can be done in 2 ways:
1. With HA (High Availability Cluster): if you are setting up HA cluster then you may not need to use Secondary namenode because standby namenode keep its state synchronized with the Active namenode.
The HDFS NameNode High Availability feature enables you to run redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.Both NameNode require the same type of hardware configuration.In HA hadoop cluster Active NameNode reads and write metadata information in Separate JournalNode.
In the event of failover, standby NameNode will ensure that its namespace is completely updated according to edit logs before it is changes to active state. So there is no need of Secondary NameNode in this Cluster Setup.
2. Without HA: you can have a hadoop setup without standby node. Then the secondary NameNode will act as you already mentioned in Hadoop 1.x
When you configure HA for NameNodes, Secondary Namenode is not used. However you can still configure HDFS without HA (with NameNode and Secondary NameNode). This part didn't change much since hadoop 1.x.

How to allocate memory to datanode in hadoop configuration

we have a below requirement.
We have a totally 5 servers which will be utilized for building Bigdata Hadoop data warehouse system (we are not going to use any distribution like cloudera, hortonworks...etc).
All servers configurations are 512GB RAM, 30TB storage and 16 cores, Ubuntu Linux 14.04LTS server
We would install hadoop on all the servers. Server3,4,5 will be completely using them for datanode (slave machines) whereas server1 would have Active Namenode and Datanode. Server2 would have standby Namenode and datanode.
we want to configure 300GB RAM for Namenode and 212GB RAM for datanode while configuring hadoop.
Could anyone help me how to do that. which configuration file in hadoop needs to be changed. what are the parameter we need to configure in hadoop configuration files.
Thanks and Regards,
Suresh Pitchaipillai
You can cset these properties from cloudera manager (in case you are using CDH) or from Ambari (if you use Hortonworks).
Also you do not need 300GB for Namenode as namenode only stores metadat. Roughly speaking 1GB of namenode heap can store metadata of 1milion blocks (block size = 128MB).
More details here : https://issues.apache.org/jira/browse/HADOOP-1687
Assuming that you are going to use latest hadoop distribution with Yarn.
Read this article - Reference. It has explained every parameter in details and it is awesome in explanation.
There is one more article from Hortenworks, though it is applicable to all apache based hadoop distribution.
At last keep this handly - Yarn-configuration. It is self explanatory.

How Namenode High availability achieved in Hadoop 1.x?

Is there any possible solution to achieve Namenode HA in Hadoop 1.x ?
Hadoop 1.x is known for its single point of failure; there is a single Master Node that contains Hadoop Namenode and Hadoop JobTracker. The Namenode keeps look up table for every file (blocks of the file) location on the cluster. The Name node manages Hadoop Distributed File system and act as a HDFS master.
The Secondary NameNode is used for fault tolerance and it is a copy of the NameNode records. It is only used to backup the Namenode in case of crash.

Hadoop Namenode without HDFS storage

I have installed a hadoop cluster with total 3 machines, with 2 nodes acting as datanodes and 1 node acting as Namenode and as well as a Datanode.
I wanted to clear certain doubts regarding hadoop cluster installation and architecture.
Here is a list of questions I am looking answers for----
I uploaded a data file around 500mb size in the cluster and then checked the hdfs report.
I noticed that the namenode I made is also occupying 500mb size in the hdfs, along with datanodes with a replication factor of 2.
The problem here is that I want the namenode not to store any data on it, in short i dont want it to work as a datanode as it is also storing the file I am uploading. So what is the way of making it only act as a Master Node and not like a datanode?
I tried running the command hadoop -daemon.sh stop on the Namenode to stop the datanode services on it but it wasnt of any help.
How much metadata does a Namenode generate for a filesize typically of 1 GB? Any approximations?
Go to conf directory inside your $HADOOP_HOME directory on your master. Edit the file named slaves and remove the entry corresponding to your name node from it. This way you are only asking the other two nodes to act as slaves and name node as only the master.

Resources