Hadoop namenode : Single point of failure - hadoop

The Namenode in the Hadoop architecture is a single point of failure.
How do people who have large Hadoop clusters cope with this problem?.
Is there an industry-accepted solution that has worked well wherein a secondary Namenode takes over in case the primary one fails ?

Yahoo has certain recommendations for configuration settings at different cluster sizes to take NameNode failure into account. For example:
The single point of failure in a Hadoop cluster is the NameNode. While the loss of any other machine (intermittently or permanently) does not result in data loss, NameNode loss results in cluster unavailability. The permanent loss of NameNode data would render the cluster's HDFS inoperable.
Therefore, another step should be taken in this configuration to back up the NameNode metadata
Facebook uses a tweaked version of Hadoop for its data warehouses; it has some optimizations that focus on NameNode reliability. Additionally to the patches available on github, Facebook appears to use AvatarNode specifically for quickly switching between primary and secondary NameNodes. Dhruba Borthakur's blog contains several other entries offering further insights into the NameNode as a single point of failure.
Edit: Further info about Facebook's improvements to the NameNode.

High Availability of Namenode has been introduced with Hadoop 2.x release.
It can be achieved in two modes - With NFS and With QJM
But high availability with Quorum Journal Manager (QJM) is preferred option.
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
Have a look at below SE questions, which explains complete failover process.
Secondary NameNode usage and High availability in Hadoop 2.x
How does Hadoop Namenode failover process works?

Large Hadoop clusters have thousands of data nodes and one name node. The probability of failure goes up linearly with machine count (all else being equal). So if Hadoop didn't cope with data node failures it wouldn't scale. Since there's still only one name node the Single Point of Failure (SPOF) is there, but the probability of failure is still low.
That sad, Bkkbrad's answer about Facebook adding failover capability to the name node is right on.

Related

requirement of 3 journal nodes in HA hadoop setup

I am quite new to hadoop. As i am setting up a hadoop namenode ha using qoroum journal manager, i am a bit confused on the requirements. The official documentations on apache site says
Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs.
what does this means? why do we need 3 journal-nodes instead of two?
As in hadoop1 we can have only one Namenode per cluster if somehow this namenode become unavailable whole cluster will become unavailable thus making it single point of failure.
To resolve this issue the obvious solution was to add more than one Namenode per cluster.
In haoop2 we can have two Namenode per cluster. At a time only one Namenode would be active and other would be in standby mode. To Make system HA both Namenode should be synchronised. To do so they introduced a concept journal nodes.
The purpose of this light weight demon is to sync every change in active Namenode to standby Namenodes.
Now what if this journal node would fail? .This would again became the same issue.journal node will become the Single point of failure. To avoid that they introduced a quorum concept like it was introduced in Zookeeper.
what Quorum means?
Quorum :- The literal meaning of quorum is 'minimum number of assembly/society member that must be present to make a meeting valid'.
On similar notes there must always be more than half of the total journal nodes to be healthy to keep everything running. e.g if you have 2 journal nodes in the system you would have to have to keep 'more than half' i.e more than 1 which is 2 Journal nodes healthy to keep everything running. which means you can't take any journal node failures in this case. To avoid this you must have odd number of journal nodes (i.e 3,5,7). But minimum 3 so that we can bear journal node failures.
I hope this helped

what Hadoop will do after one of datanodes down

I have 10 data noes and 2 name nodes Hadoop cluster with replicates configured 3, I was wondering if one of data nodes goes down, will hadoop try to generate the lost replicates on the other alive nodes? or just do nothing(since still have 2 replicas left).
Add, what if the down data node come back after a while, can hadoop recognize the data on that node? Thanks!
will hadoop try to generate the lost replicates on the other alive nodes? or just do nothing(since still have 2 replicas left).
Yes, Hadoop will recognize it and make copies of that data on some other nodes. When Namenode stop receiving heart beats from the data nodes, it assumes that data node is lost. To keep the replication of the all the data to defined replication factor, it will make the copies on other data nodes.
Add, what if the down data node come back after a while, can hadoop recognize the data on that node?
Yes, when a data node comes back with all its data, Name node will remove/delete the extra copies of data. In the next heart beat to the data node, Name node will send the instruction to remove the extra data and free up the space on disk.
Snippet from Apache HDFS documentation:
Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.

Difference between Rack Awareness and Name node

I was going through Hadoop, I have doubt whether there is difference between Rack wareness and Name Node. Will Rack wareness and name node will remain on same box
As Aviral rightly said, the question has been quite vague. But just quoting for your understanding,
Namenode : The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
You can read in detail about this concept here.
Rack Awareness : In simple words rack awareness is the strategy namenode employs to choose the nearest datanode based on rack information. You can read details here
Further more, I would like to suggest this blog
Image credits Brad Hedlund
From Apache HDFS Users Guide
HDFS is the primary distributed storage used by Hadoop applications.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data
Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition NameNode tries to place replicas of block on multiple racks for improved fault tolerance.
From RackAwareness tutorial:
Hadoop components are rack-aware. For example, HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.
Let's see how Hadoop writes are implemented.
If the writer is on a datanode, the 1st replica is placed on the local machine, otherwise a random datanode.
The 2nd replica is placed on a datanode that is on a different rack.
The 3rd replica is placed on a datanode which is on a different node of the rack as the second replica.
Due to replication of data blocks on three different nodes across two different RACs, Hadoop read operations provides high availability of data blocks.
At least one replica is stored on different RAC. If one RAC is not accessible, still Hadoop can fetch data block from other RAC.

Remove a node of Hadoop which is NameNode too

I recently created a cluster with five servers :
master
node01
node02
node03
node04
To have more "workers" I added the Nademode to the list of slaves in /etc/hadoop/slaves.
This works, the master perfoms some mapReduce jobs.
Today I want to remove this node from the workers list (this is too much CPU intensive for it). I want to set dfs.exclude in my hdfs-site.xml but I worried about the fact this is also the master server.
COuld someone confirm me that there is no risks to perform this operation ?
Thanks,
Romain.
If there is data stored in the master node (as there probably is because it's a DataNode), you will essentially lose that data. But if your replication factor is more than 1 (3 is the default), then it doesn't matter as Hadoop will notice that some data is missing (under-replicated) and will start replicating it again on other DataNodes to reach the replication factor.
So, if your replication factor is more than 1 (and the cluster is otherwise healthy), you can just remove the master's data (and make it again just a NameNode) and Hadoop will take care of the rest.

Hadoop doesn't use one node for job

I've got a four node YARN cluster set up und running. I recently had to format the namenode due to a smaller problem.
Later I ran Hadoop's PI example to verify every node was still taking part in the calculation, which they all did. However when I start my own job now one of the nodes is not being used at all.
I figured this might be because this node doesn't have any data to work on. So I tried to balance the cluster using the balancer. This doesn't work and the balancer tells me the cluster is balanced.
What am I missing?
While processing, your ApplicationMaster would negoriate with the NodeManager for containers and NodeManager in turn would try to obtain the nearest datanode resource. Since your replication factor is 3, HDFS would try to place 1 whole copy on a single datanode and distribute the rest across all the datanodes.
1) Change the replication factor to 1 (Since you are only trying to benchmark, reducing replication should not be a big issue).
2) Make sure your client(machine from where you would give your -copyFromLocal command) does not have a datanode running on it. If not, HDFS will tend to place most of the data in this node since it would have reduced latency.
3) Control the file distribution using dfs.blocksize property.
4) Check the status of your datanodes using hdfs dfsadmin -report.
Make sure your node is joinig the resourcemanager. Look into nodemanager log on t the problem node, see if there are errors. Look into the resourcemanager Web UI (:8088 by default) make sure the node is listed there.
Make sure the node is bringing enough resources to the pool to be able to run a job. Check yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb in yarn-site.xml on the node. The memory should be more than the minimum memory requested by a container (see yarn.scheduler.minimum-allocation-mb).

Resources