How to submit MapReduce job from DataNode to JobTracker? - hadoop

I have this doubt where I am running a 12 node cluster with separate NameNode and JobTracker. I can execute MapReduce job from JobTracker but I want to submit the jobs to JobTracker from any of my 10 DataNodes. Is it possible and If yes how to do this?

Yes, as long as hadoop is on the path (on each node), and the configuration for the cluster has been properly distributed to each data node.
In fact you don't necessarily need the configuration to be properly distributed, you'll just need to configure the jobtracker and hdfs urls accordingly (look at the GenericOptionsParser options for -jt and -fs options).
See this page for more information on generic options: http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#Generic+Options

Related

What happens to hadoop job when the NameNode is down?

In Hadoop 1.2.1, I would like to know some basic understanding on these below questions
Who receives the hadoop job? Is it NameNode or JobTracker?
What will happen if somebody submits a Hadoop job when the NameNode is down?Does the hadoop job fail? or Does it get in to Hold?
What will happen if somebody submits a Hadoop job when the JobTracker is down? Does the hadoop job fail? or Does it get in to Hold?
By Hadoop job, you probably mean MapReduce job. If your NN is down, and you don't have spare one (in HA setup) your HDFS will not be working and every component dependent on this HDFS namespace will be either stuck or crashed.
1) JobTracker (Yarn ResourceManager with hadoop 2.x)
2) I am not completely sure, but probably job will become submitted and fail afterwards
3) You cannot submit job to a stopped JobTracker.
Client submits job to the Namenode. Namenode looks for the data requested by the client and gives the block information.
JobTracker is responsible for the job to be completed and the allocation of resources to the job.
In Case 2 & 3 - Jobs fails.

How Namenode High availability achieved in Hadoop 1.x?

Is there any possible solution to achieve Namenode HA in Hadoop 1.x ?
Hadoop 1.x is known for its single point of failure; there is a single Master Node that contains Hadoop Namenode and Hadoop JobTracker. The Namenode keeps look up table for every file (blocks of the file) location on the cluster. The Name node manages Hadoop Distributed File system and act as a HDFS master.
The Secondary NameNode is used for fault tolerance and it is a copy of the NameNode records. It is only used to backup the Namenode in case of crash.

Do JobTracker and TaskTrackers fall under HDFS?

I have been studying Hadoop for a while now but I am confused about JobTracker and TaskTracker systems.
I am not sure how to phrase my question but here it is:
Do JobTracker and TaskTraker fall under HDFS or MapReduce category?
or a more appropriate question could be:
Can physical machines fall under HDFS or MapReduce category?
The JobTracker and TaskTracker services are part of MapReduce, which you can read more about here: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Physical machines can host both HDFS (NameNode, DataNode) and MapReduce (JobTracker, TaskTracker) services. In general, it's recommended to place DataNodes and TaskTrackers together on the same physical slave nodes for performance reasons. The TaskTrackers can read/write to the local DataNode.
Note that Hadoop 2 introduced YARN (http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), which replaces MapReduce.

who communicates with the namenode in yarn?

since the jobTracker in MapReduce1 is replaced by the Application Master and Resouce Manager in Yarn I wonder who is communication in Yarn with the namenode to find out where the data is stored in the different datanodes?
Is the Application Master doing so?
In YARN, the per-application ApplicationMaster is responsible for getting the information about the input splits from Namenode. Later when the task attempts are executed over the assigned nodes, the YarnChild fetches the respective splits from HDFS.

Can an oozie instance run jobs on multiple hadoop clusters at the same time?

I have an available developer Hadoop cluster to run test jobs as well as an available production cluster. My question is, can I utilize oozie to kick off workflow jobs to multiple clusters on a single oozie instance?
What are the gotchas? I'm assuming I can just reconfigure the job tracker, namenode, and fs location properties for my workflow depending on which cluster I want the job to run on.
Assuming the clusters are all running the same distribution and version of hadoop, you should be able to.
As your note, you'll need to adjust the jobtracker and namenode values in your oozie actions

Resources