Can an oozie instance run jobs on multiple hadoop clusters at the same time? - hadoop

I have an available developer Hadoop cluster to run test jobs as well as an available production cluster. My question is, can I utilize oozie to kick off workflow jobs to multiple clusters on a single oozie instance?
What are the gotchas? I'm assuming I can just reconfigure the job tracker, namenode, and fs location properties for my workflow depending on which cluster I want the job to run on.

Assuming the clusters are all running the same distribution and version of hadoop, you should be able to.
As your note, you'll need to adjust the jobtracker and namenode values in your oozie actions

Related

Understanding mapreduce.framework.name wrt Hadoop

I am learning Hadoop and came to know that that there are two versions of the framework viz: Hadoop1 and Hadoop2.
If my understanding is correct, in Hadoop1, the execution environment is based on two daemons viz TaskTracker and JobTracker whereas in Hadoop2 (aka yarn), the execution environment is based on "new daemons" viz ResourceManager, NodeManager, ApplicationMaster.
Please correct me if this is not correct.
I came to know of the following configuration parameter:
mapreduce.framework.name
possible values which it can take: local , classic , yarn
I don't understand what does they actually mean; for example if I install Hadoop 2 , then how can it have old execution environment (which has TaskTracker, JobTracker).
Can anyone help me what these values mean?
yarn stands for MR version 2.
classic is for MR version 1
local for local runs of the MR jobs.
MR V1 and MR V2 are just about how resources are managed and a job is executed. The current hadoop release is capable of both (and even in local lightweight mode). When you set the value as yarn, you are simply instructing the framework to use yarn way to execute the job. Similarly when you set it to local, you just telling the framework that there is no cluster for execution and its all within a JVM. It is not a different infrastructure for MR V1 and MR V2 framework; its just the way of job execution, which changes.
jobTracker, TaskTracker etc are all just daemon thread, which are spawned when needed and killed.
MRv1 uses the JobTracker to create and assign tasks to data nodes. This was found to be too inefficient when dealing with large cluster, leading to yarn
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.
Local mode is given to simulate and debug MR application within a single machine/JVM.
EDIT: Based on comments
jps (Java Virtual Machine Process Status)is a JVM tool, which according to official page:
The jps tool lists the instrumented HotSpot Java Virtual Machines
(JVMs) on the target system. The tool is limited to reporting
information on JVMs for which it has the access permissions.
So,
jps is not a big data tool, rather a java tool which tells about JVM, however it does not divulge any information on processes running within the JVM.
It only list the JVM, it has access to. It means there still be certain JVMs which remains undetected.
Keeping the above points in mind, if you observed that jsp command emits different result based on hadoop deployment mode:
Local (or Standalone) mode: There are no daemons and everything runs on a single JVM.
Pseudo-Distributed mode: Each daemon(Namenode, Datanode etc) runs on its own JVM on a single host.
Distributed mode: Each Daemon run on its own JVM across a cluster of hosts.
Hence each of the processes may or may not run in same JVM and hence jps output will be different.
Now in distributed mode, the MR v2 framework works in default mode. i.e. yarn; hence you see yarn specific daemons running
Namenode
Datanode
ResourceManager
NodeManager
Apache Hadoop 1.x (MRv1) consists of the following daemons:
Namenode
Datanode
Jobtracker
Tasktracker
Note that NameNode and DataNode are common between two, because they are HDFS specific daemon, while other two are MR v1 and yarn specific.

Hadoop - Can name node execute task?

Is it possible for the name node to execute tasks? By default, the tasks execute on the data nodes of the cluster.
Assuming you are asking about MapReduce...
With YARN, MapReduce tasks execute in application masters, not namenodes and not within the datanode process. These are monitored by the ResourceManager.
Application master services are only commonly installed along side the datanode, and you can install it on the namenode as well, but you really shouldn't in a production environment

What is the difference between multi node hadoop cluster and running hadoop on mesos?

I've built a multi node hadoop cluster, then i started studying mesos and the ability to run hadoop on mesos cluster, so here's my questions:
1) Should I run hadoop on mesos cluster? or it doesn't matter.
2) What is the difference between them?
There are different things in different hierarchies. You could deploy the hadoop cluster in a set of machines directly. So that your machines could handle hadoop jobs now.
Or you could deploy mesos cluster first, and then deploy hadoop cluster, spark cluster, kafka and other things on mesos. So that you could sumbit your hadoop jobs to the hadoop cluster, submit your spark jobs to the spark cluster.

MapReduce 2 without YARN

Considering the fact that YARN is a better option to run mapreduce2, but is it possible to run MR2 without YARN?
I tried using MR2 but it runs with YARN.
MRv2 is actually YARN! So, no you can't run mapreduce2 jobs without YARN!
Official documentation :
Apache Hadoop NextGen MapReduce (YARN)
MapReduce has undergone a complete overhaul in hadoop-0.23 and we now
have, what we call, MapReduce 2.0 (MRv2) or YARN.
The fundamental idea of MRv2 is to split up the two major
functionalities of the JobTracker, resource management and job
scheduling/monitoring, into separate daemons. The idea is to have a
global ResourceManager (RM) and per-application ApplicationMaster
(AM). An application is either a single job in the classical sense of
Map-Reduce jobs or a DAG of jobs.

How to submit MapReduce job from DataNode to JobTracker?

I have this doubt where I am running a 12 node cluster with separate NameNode and JobTracker. I can execute MapReduce job from JobTracker but I want to submit the jobs to JobTracker from any of my 10 DataNodes. Is it possible and If yes how to do this?
Yes, as long as hadoop is on the path (on each node), and the configuration for the cluster has been properly distributed to each data node.
In fact you don't necessarily need the configuration to be properly distributed, you'll just need to configure the jobtracker and hdfs urls accordingly (look at the GenericOptionsParser options for -jt and -fs options).
See this page for more information on generic options: http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#Generic+Options

Resources