Hadoop Datanode, namenode, secondary-namenode, job-tracker and task-tracker - hadoop

I am new in hadoop so I have some doubts. If the master-node fails what happened the hadoop cluster? Can we recover that node without any loss? Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?
We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails? The secondary namenode is the backup of namenode only not to datenode, right? If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
How can we restore the entire cluster data if anything happens?
And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
Thanks in advance

Although, It is too late to answer your question but just It may help others..
First of all let me Introduce you with Secondary Name Node:
It Contains the name space image, edit log files' back up for past one
hour (configurable). And its work is to merge latest Name Node
NameSpaceImage and edit logs files to upload back to Name Node as
replacement of the old one. To have a Secondary NN in a cluster is not
mandatory.
Now coming to your concerns..
If the master-node fails what happened the hadoop cluster?
Supporting Frail's answer, Yes hadoop has single point of failure so
whole of your currently running task like Map-Reduce or any other that
is using the failed master node will stop. The whole cluster including
client will stop working.
Can we recover that node without any loss?
That is hypothetical, Without loss it is least possible, as all the
data (block reports) will lost which has sent by Data nodes to Name
node after last back up taken by secondary name node. Why I mentioned
least, because If name node fails just after a successful back up run
by secondary name node then it is in safe state.
Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?
It is staright possible by an Administrator (User). And to switch it
automatically you have to write a native code out of the cluster, Code
to moniter the cluster that will cofigure the secondary name node
smartly and restart the cluster with new name node address.
We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails?
It is about replication factor, We have 3 (default as best practice,
configurable) replicas of each file block all in different data nodes.
So in case of failure for time being we have 2 back up data nodes.
Later Name node will create one more replica of the data that failed
data node contained.
The secondary namenode is the backup of namenode only not to datenode, right?
Right. It just contains all the metadata of data nodes like data node
address,properties including block report of each data node.
If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
HDFS will forcely try to continue the job. But again it depends on
replication factor, rack awareness and other configuration made by
admin. But if following Hadoop's best practices about HDFS then it
will not get failed. JobTracker will get replicated node address to
continnue.
How can we restore the entire cluster data if anything happens?
By Restarting it.
And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
yes, you can use any programming language which support Standard file
read write operations.
I Just gave a try. Hope it will help you as well as others.
*Suggestions/Improvements are welcome.*

Currently hadoop cluster has a single point of failure which is namenode.
And about the secondary node isssue (from apache wiki) :
The term "secondary name-node" is somewhat misleading. It is not a
name-node in the sense that data-nodes cannot connect to the secondary
name-node, and in no event it can replace the primary name-node in
case of its failure.
The only purpose of the secondary name-node is to perform periodic
checkpoints. The secondary name-node periodically downloads current
name-node image and edits log files, joins them into new image and
uploads the new image back to the (primary and the only) name-node.
See User Guide.
So if the name-node fails and you can restart it on the same physical
node then there is no need to shutdown data-nodes, just the name-node
need to be restarted. If you cannot use the old node anymore you will
need to copy the latest image somewhere else. The latest image can be
found either on the node that used to be the primary before failure if
available; or on the secondary name-node. The latter will be the
latest checkpoint without subsequent edits logs, that is the most
recent name space modifications may be missing there. You will also
need to restart the whole cluster in this case.
There are tricky ways to overcome this single point of failure. If you are using cloudera distribution, one of the ways explained here. Mapr distribution has a different way to handle to this spof.
Finally, you can use every single programing language to write map reduce over hadoop streaming.

Although, It is too late to answer your question but just It may help others..firstly we will discuss role of Hadoop 1.X daemons and then your issues..
1. What is role of secondary name Node
it is not exactly a backup node. it reads a edit logs and create updated fsimage file for name node periodically. it get metadata from name node periodically and keep it and uses when name node fails.
2. what is role of name node
it is manager of all daemons. its master jvm proceess which run at master node. it interact with data nodes.
3. what is role of job tracker
it accepts job and distributes to task trackers for processing at data nodes. its called as map process
4. what is role of task trackers
it will execute program provided for processing on existing data at data node. that process is called as map.
limitations of hadoop 1.X
single point of failure
which is name node so we can maintain high quality hardware for the name node. if name node fails everything will be inaccessible
Solutions
solution to single point of failure is hadoop 2.X which provides high availability.
high availability with hadoop 2.X
now your topics ....
How can we restore the entire cluster data if anything happens?
if cluster fails we can restart it..
If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
we have default 3 replicas of data(i mean blocks) to get high availability it depends upon admin that how much replicas he has set...so job trackers will continue with other copy of data on other data node
can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
basically mapreduce is execution engine which will solve or process big data problem in(storage plus processing) distributed manners. we are doing file handling and all other basic operations using mapreduce programming so we can use any language of where we can handle files as per the requirements.
hadoop 1.X architecture
hadoop 1.x has 4 basic daemons
I Just gave a try. Hope it will help you as well as others.
Suggestions/Improvements are welcome.

Related

What is checkpoint node HDFS? Why use it?

I am new for hadoop so please give answer. I know basic knowledge about name node and datanode.
The Checkpoint Node fetches periodically fsimage and edits from the NameNode and merges them. The resulting state is called checkpoint. After this is uploads the result to the NameNode.
A Checkpoint Node was introduced to solve the drawbacks of the NameNode. The changes are just written to edits and not merged to fsimage during the runtime. If the NameNode runs for a while edits gets huge and the next startup will take even longer because more changes have to be applied to the state to determine the last state of the metadata.
There was also a similiar type of node called “Secondary Node” but it doesn’t have the “upload to NameNode” feature. So the NameNode need to fetch the state from the Secondary NameNode. It also was confussing because the name suggests that the Secondary NameNode takes the request if the NameNode fails which isn’t the case.
Hope this Helps!!!...

Hadoop Nodes and Roles

I've a Hadoop Cluster at work that has over 50 nodes, We occasionally face disk failures and require to decommission the datanode roles.
My Question is - if I were to only decommission the datanode and leave the tasktracker running, would this result in failed tasks/jobs on this node due to unavailability of HDFS Service on that node?
Does the TaskTracker on Node1 sit idle since there is no DataNode service on that Node? Correct, if the data node is disabled then the task tracker will not be able to process the data as the data will not be avaiable; it will be idle. 2. or Does the TaskTracker work on data from DataNodes on other Nodes? Nope, due to data locality principle, the task tracker will not process the data from other nodes.. 3. Do we get errors from TaskTracker Service on Node1 due to the DN on it's node being down? , Task tracker will not be able to process any data, so no errors.; 4. if I have services like Hive, Impala, etc running on HDFS - would those services throw error upon contact with TaskTracker on Node1? They will not be able to contact the task tracker on node 1. When client requests for the processing of the data, Name node tells the client about the data locations, so based on the data locations all other applications will communicate with data nodes
I would expect any task that tries to read from HDFS on the "dead" node to fail. This should result in the node being blacklisted by M/R after N failures (default is 3 I think). Also, I believe this happens each time a job runs.
However, jobs should still finish since the tasks that got routed to the bad node will simply be retried on other nodes.
Firstly, in order to run a job you need to have the input file. So when you load the input file to HDFS this will be split into 64 MB block size by default. Also there will be 3 replications with default settings. Now since one of your data node in the cluster is failed, Name node will not store the data in that node. Even if it tries to store also, it gets the frequent updates from data node about the status. So it will not choose that specific data node to store the data.
It should throw exception when you don't have the disk space and the only dead data node is left in the cluster. Then its time for you to replace the data node and scale up the cluster.
Hope this helps.

Reading operations on hadoop and consistency level

I am setting up distributed HBase on HDFS and I trying to understand behavior of the system during read operations.
This is how I understand high level steps of the read operation.
Client connects to NameNode to get list of DataNodes which contain replicas of the rows that he interested in.
From here Client caches list of DataNodes and start talking to chosen DataNode directly until it needs some other rows from other DataNode, in which case it asks NameNode again.
My questions are as follows:
Who chooses the best replica DataNode to contact? How Client chooses "closest" replica? Does NameNode return list of relative DataNodes in a sorted order ?
What are the scenarios(if any) when Client switches to another DataNode that has requested rows? For example if one of the DataNode becomes overloaded/slow can the client library figure out to contact another DataNode from the list returned by the NameNode?
Is there a possibility of getting stale data from one of the replicas? For example client acquired list of DataNodes and starts reading from one of them. In the mean time there is a write request coming from another client to NameNode. We have dfs.replication == 3 and dfs.replication.min = 2. NameNode consider write successful after flushing to disk on 2 out of 3 nodes, while first client is reading from the 3rd node and doesn't know (yet) that there is another write that has been committed ?
Hadoop maintains the same reading policy when supporting HBase?
Thank you
Who chooses the best replica DataNode to contact? How Client chooses "closest" replica? Does NameNode return list of relative DataNodes in a sorted order ?
The client is the one that decides who best to contact. It picks them in this order:
The file is on the same machine. In this case (if properly configured) it will short circuit the DataNode and go directly to the file as an optimization.
The file is in the same rack (if rack awareness is configured).
The file is somewhere else.
What are the scenarios(if any) when Client switches to another DataNode that has requested rows? For example if one of the DataNode becomes overloaded/slow can the client library figure out to contact another DataNode from the list returned by the NameNode?
It's not that smart. It'll switch if it thinks the DataNode is down (meaning it times out) but in not any other situation that I know of. I believe that it will just go to the next one in the list, but it might contact the NameNode again-- I'm not 100% sure.
Is there a possibility of getting stale data from one of the replicas? For example client acquired list of DataNodes and starts reading from one of them. In the mean time there is a write request coming from another client to NameNode. We have dfs.replication == 3 and dfs.replication.min = 2. NameNode consider write successful after flushing to disk on 2 out of 3 nodes, while first client is reading from the 3rd node and doesn't know (yet) that there is another write that has been committed ?
Stale data is possible, but not in the situation you describe. Files are write-once and immutable (other than append, but don't append if you don't have to). The NameNode won't tell you the file is there until it is completely written. In the case of append, shame on you then. The behavior of reading from an actively-being-appended-to file on a local filesystem is unpredictable as well. You should expect the same in HDFS.
One way stale data could happen is if you retrieve your list of block locations and the NameNode decides to migrate all three of them at once before you access it. I don't know what would happen there. In the 5 years of using Hadoop, I've never had this be a problem. Even when running the balancer at the same time as doing stuff.
Hadoop maintains the same reading policy when supporting HBase?
HBase is not treated special by HDFS. There is some talk about using a custom block placement strategy with HBase to get better data locality, but that's in the weeds.

name node Vs secondary name node

Hadoop is Consistent and partition tolerant, i.e. It falls under the CP category of the CAP theoram.
Hadoop is not available because all the nodes are dependent on the name node. If the name node falls the cluster goes down.
But considering the fact that the HDFS cluster has a secondary name node why cant we call hadoop as available. If the name node is down the secondary name node can be used for the writes.
What is the major difference between name node and secondary name node that makes hadoop unavailable.
Thanks in advance.
The namenode stores the HDFS filesystem information in a file named fsimage. Updates to the file system (add/remove blocks) are not updating the fsimage file, but instead are logged into a file, so the I/O is fast append only streaming as opposed to random file writes. When restaring, the namenode reads the fsimage and then applies all the changes from the log file to bring the filesystem state up to date in memory. This process takes time.
The secondarynamenode job is not to be a secondary to the name node, but only to periodically read the filesystem changes log and apply them into the fsimage file, thus bringing it up to date. This allows the namenode to start up faster next time.
Unfortunatley the secondarynamenode service is not a standby secondary namenode, despite its name. Specifically, it does not offer HA for the namenode. This is well illustrated here.
See Understanding NameNode Startup Operations in HDFS.
Note that more recent distributions (current Hadoop 2.6) introduces namenode High Availability using NFS (shared storage) and/or namenode High Availability using Quorum Journal Manager.
Things have been changed over the years especially with Hadoop 2.x. Now Namenode is highly available with fail over feature.
Secondary Namenode is optional now & Standby Namenode has been to used for failover process.
Standby NameNode will stay up-to-date with all the file system changes the Active NameNode makes .
HDFS High availability is possible with two options : NFS and Quorum Journal Manager but Quorum Journal Manager is preferred option.
Have a look at Apache documentation
From Slide 8 from : http://www.slideshare.net/cloudera/hdfs-futures-world2012-widescreen
When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is reads these edits from the JNs and apply to its own name space.
In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
Have a look at about fail over process in related SE question :
How does Hadoop Namenode failover process works?
Regarding your queries on CAP theory for Hadoop:
It can be strong consistent
HDFS is almost highly Available unless you met with some bad luck
( If all three replicas of a block are down, you won't get data)
Supports data Partition
Name Node is a primary node in which all the metadata into is stored into fsimage and editlog files periodically. But, when name node down secondary node will be online but this node only have the read access to the fsimage and editlog files and dont have the write access to them . All the secondary node operations will be stored to temp folder . when name node back to online this temp folder will be copied to name node and the namenode will update the fsimage and editlog files.
Even in HDFS High Availability, where there are two NameNodes instead of one NameNode and one SecondaryNameNode, there is not availability in the strict CAP sense. It only applies to the NameNode component, and even there if a network partition separates the client from both of the NameNodes then the cluster is effectively unavailable.
If I explain it in simple way, suppose Name Node as a men(working/live) and secondary Name Node as a ATM machine(storage/data storage)
So all the functions carried out by NN or men only but if it goes down/fails then SNN will be useless it doesn’t work but later it can be used to recover your data or logs
When NameNode starts, it loads FSImage and replay Edit Logs to create latest updated namespace. This process may take long time if size of Edit Log file is big and hence increase startup time.
The job of Secondary Name Node is to periodically check edit log and replay to create updated FSImage and store in persistent storage. When Name Node starts it doesn't need to replay edit log to create updated FSImage, it uses FSImage created by secondary name node.
The namenode is a master node that contains metadata in terms of fsimage and also contains the edit log. The edit log contains recently added/removed block information in the namespace of the namenode. The fsimage file contains metadata of the entire hadoop system in a permanent storage. Every time we need to make changes permanently in fsimage, we need to restart namenode so that edit log information can be written at namenode, but it takes a lot of time to do that.
A secondary namenode is used to bring fsimage up to date. The secondary name node will access the edit log and make changes in fsimage permanently so that next time namenode can start up faster.
Basically the secondary namenode is a helper for namenode and performs housekeeping functionality for the namenode.

how does hdfs choose a datanode to store

As the title indicates, when a client requests to write a file to the hdfs, how does the HDFS or name node choose which datanode to store the file?
Does the hdfs try to store all the blocks of this file in the same node or some node in the same rack if it is too big?
Does the hdfs provide any APIs for applications to store the file in a certain datanode as he likes?
how does the HDFS or name node choose which datanode to store the file?
HDFS has a BlockPlacementPolicyDefault, check the API documentation for more details. It should be possible to extend BlockPlacementPolicy for a custom behavior.
Does the hdfs provide any APIs for applications to store the file in a certain datanode as he likes?
The placement behavior should not be specific to a particular datanode. That's what makes HDFS resilient to failure and also scalable.
The code for choosing datanode is in function ReplicationTargetChooser.chooseTarget().
The comment says that :
The replica placement strategy is that if the writer is on a
datanode, the 1st replica is placed on the local machine, otherwise
a random datanode. The 2nd replica is placed on a datanode that is on
a different rack. The 3rd replica is placed on a datanode which is on
the same rack as the first replica.
It doesn`t provide any API for applications to store the file in the datanode they want.
If someone prefers charts, here is a picture (source):
Now with Hadoop-385 patch, we can choose the block placement policy, so as to place all blocks of a file in the same node (and similarly for replicated nodes). Read this blog about this topic - look at the comments section.
You can see that when namenode instructs datanode to store data. The first replica is stored in the local machine and other two replicas are made on other rack and so on.
If any replica fails, data is stored from other replica. Chances of failing every replica is just like falling of fan on your head while you were sleeping :p i.e. there is very less chance for that.

Resources