Map on data node is run by whom? - hadoop

I faced this tricky question in one of my interviews.
Question was
Who run map on data node ?
Answer is neither Job tracker nor task tracker.
Could anybody help me please

Datanodes do not run any task, they are part of HDFS and take care of storing data.
So "map on data node" makes no sense at all.

if hadoop 1.x is installed on the system then If task tracker is running on the same data node then task tracker daemon is the one who would run the map task after getting instruction from job tracker .
if no task tracker is running on the data node then no map task can run on that node , data node takes care of storage part it has nothing to do with map processing .
if hadoop 2.x then application master is the entity which does so by coordinating with node manager and resource manager.

Related

Does Namenode know the job submitted by client in HDFS multi node system?

When client submits a job, task tracker receives it. Can Namenode sees the code of this submitted job?
Task Tracker doesn't exist in Hadoop2 / YARN, but no, the code is not running within the Namenode process.
The short answer is No.
The long answer is that namenode do not execute the mapreduce program, so name node has nothing to do with the mapreduce code. The mapreduce jar is physically uploaded to the each node responsible to execute the map/reduce. So, basically only those nodes which are executing the mapreduce refers to the jar. The only role of namenode is to make sure that the jars are written to the nodes responsible for executing map/reduce.

Hadoop on Mesos uses only one node?

I have successfully set up Mesos 0.22.1 cluster on 5 nodes. I can run Marathon and Chronos tasks on all slave nodes. Now I’m trying to run Hadoop jobs using Mesos Scheduler. I have followed very good tutorial and I could run wordcount test job. But when I try to run some larger job (loading data from Kafka to HDFS using Camus) job is running without the errors, but uses only one node with one task tracker, though it has in total 30 map jobs, and my nodes configured to run 2 map jobs in parallel.
What am I missing? Shouldn’t Jobtracker split task to run in parallel on all available nodes using 2 Map slots on eash node?
And what is strange - on Jobtracker webpage cluster summary reports only 1 available node. Is it correct behavior?
Any ideas are greatly appreciated!

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

Is JobTracker a single point of failure too (besides NameNode) in Hadoop?

I am new to Hadoop. In hadoop, I know that when a NameNode fails the entire Hadoop framework goes down. So it's a single point of failure in Hadoop. Is it same for JobTracker? Because if the JobTracker goes down, there would be no daemon to contact Namenode after a job submission and also no point for running the TaskTrackers. How is this handled exactly?
Yes, JobTracker is a single point of failure in MRv1. In case of JobTracker failure all running jobs are halted (http://wiki.apache.org/hadoop/JobTracker).
In YARN, Resource manager is not a single point of failure.
If you need MRv1, you can use MapR distribution, which provides the JobTracker high availability (http://www.mapr.com/resources/videos/demo-hadoop-jobtracker-failing-and-recovering-mapr-cluster).
Jobtracker HA(High Availability using Active and Standby) can be configured in Cloudera Hadoop distribution. See the following link, this feature is available from CDH4.2.1 onwards:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-High-Availability-Guide/cdh4hag_topic_3_1.html
The same can be configured in Hortwonworks distribution also
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_hdp1-system-admin-guide/content/sysadminguides_ha_chap2_5_5.html
In MR2 master service is ResourceManager, which is not Single Point of Failure
Yes job tracker is a single point of failure. In case of namenode failure, secondary namenode will take a charge and act as namenode. In MR-II, there is a resource manager concept introduced. YARN has no. of resource manager, if one fails another resource manager will take a charge.One resource manager is active and other resource manager's are in stand by mode.
No no If NN failure, not Hadoop Framework goes down. Framework different NN failure is different. Hadoop framework is a layer on all nodes. If Name Node goes down, Framework doesn't no where the data should store, and doesn't no where space available to be store. So it's not possible to sore actual data.
Job tracker coordinates with Namenode to get a data to be processed. So when Namenode failure, job tracker also not work properly. So first namenode should work properly. In Hadoop this mechanism is called Namenode Single point of failure.
Job tracker is responsible for job schedule and process the data. If Job tracker not working, Client submits a job request, but the client donesn't no where should that job should submit and where should process. But that logic (you should submit) should know how to resolve the problem, but doesn't know where should submit. So Job tracker failure, it's not possible to process the data and schedule job.
It's a biggest problem in Bigdata analysis problem.
Now Hadoop 2.x resolved these two problems. YERN don't have any single point of failure in namenode level and datanode level.

How I can find out IPs of slave nodes where currently map reduce task is running or about to run for a given Job?

I want to find out IPs of slave nodes where currently map reduce job is running or about to run for a given Job.
Is there any method to do this ?
Thanks in Advance.
For any job, you can view the list of running tasks through the Job Scheduler Web UI - this will detail the nodes on which the task is running.
As for where tasks are about to run - this is not neccessarily decided in advance. As slots become available on a node, the Job Scheduler (there are a number which behave differently depending on your needs) identifies a job task which will run on that node (based upon a number of criteria, hopefully honoring data locality where it can) and instructs the task tracker on that node to run the specific task.
Programatically, look at the javadocs for the JobClient class, it should be able to acquire information about the running tasks, and their node names (you'll probably need to do a DNS lookup to get the actual IPs i imagine)
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)
Thanks to #Chris..
Programatically, look at the javadocs for the JobClient class, it should be able to acquire information about the running tasks, and their node names.

Resources