How to schedule Hadoop jobs conditionally? - hadoop

I am pretty new to Hadoop, and particularly to Hadoop Job Scheduling. Here is what I am trying to do.
I have 2 flows, each having a Hadoop job. I have freedom to put these flows either in the same project or in different ones. I don't want the Hadoop jobs to run simultaneously on the cluster, but I also want to make sure that they run alternatively.
E.g. flow_1 (with hadoop_job_1) runs and finishes -> flow_2 (with hadoop_job_2) runs and finishes -> flow_1 (with hadoop_job_1) runs and finishes and so on.
And of course, I would also like to handle special conditions gracefully.
E.g. flow_1 done, but flow_2 is not ready, then flow_1 gets chance to run again if it is ready, if flow_1 fails, flow_2 still gets its turn, etc.
I would like to know which schedulers I can explore which are capable of doing this.
We are using MapR.
Thanks

This looks to be a standard use case of oozie. Take a look at these tutorials
Executing an Oozie workflow with Pig, Hive & Sqoop actions and Oozie workflow scheduler for Hadoop

Related

Difference between job, application, task, task attempt logs in Hadoop, Oozie

I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourceman­ager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.
In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.

Does oozie provide any performance optimizations in terms of I/O?

Since oozie is a workflow engine for Hadoop platform, does it improve the performance of execution of a DAG dependencies of MapReduce jobs?
I mean, since the output of one MapReduce job is given as input to the next MapReduce job in the DAG, does oozie provides any mechanism for storing the intermediate results in memory and thus saving I/O.
Or is it just a workflow manager, that coordinates a series of dependent MapReduce?
Want to know how internally oozie works?
It is just a workflow manager. It doesn't change how, say, MapReduce works even though it runs M/R jobs.
What you are describing is much more like what Apache Spark does. I'm not aware that Oozie integrates directly with Spark yet, but, it can't possibly be difficult or far off.
It is "just a workflow manager, that coordinates a series of MapReduce" jobs. It uses the same mechanisms to execute jobs as using the command line.

What is significance of the Oozie MR launcher?

I created a simple Oozie work flow with Sqoop, Hive and Pig actions. For each of there actions, Oozie launches a MR launcher and which in turn launches the action (Sqoop/Hive/Pig). So, there are a total of 6 MR jobs for 3 actions in the work flow.
Why does Oozie start an MR launcher to start the action and not directly start the action?
I posted the same in Apache Flume forums and here is the response.
It's also to keep the Oozie server from being bogged down or becoming
unstable. For example, if you have a bunch of workflows running Pig jobs,
then you'd have the Oozie server running multiple copies of the Pig client
(which is a relatively "heavy" program) directly. By moving all of the
user code and external clients to map tasks in the launcher job, the Oozie
server remains more light-weight and less prone to errors. It can also
much more scalable this way because the launcher jobs distribute the the
job launching/monitoring to other machines in the cluster; otherwise, with
the Oozie server doing everything, we'd have to limit the number of
concurrent workflows based on your Oozie server's machine specs (RAM, CPU,
etc). And finally, from an architectural standpoint, the Oozie server
itself is stateless; that is, everything is stored in the database and the
Oozie server can be taken down at any point without losing anything. If we
were to launch jobs directly from the Oozie server, then we'd now have some
state (e.g. the Pig client cannot be restarted and resumed).

How to chaining mapred and mapreduce job

Now I have two hadoop jobs need to chain together. One is Mapred job(old api), the other is Mapreduce job(new API), this is because the external library we used for these two jobs.
I want to know whether there is a good way to chain these two jobs.
I have tried one way (first run the mapred job with JobClient.runjob(), after it finished run the second one.) But there is a problem for me submit this job to the hadoop clustor. If I close my local terminal, then only the first job will run, the second won't. It is because the Java code is running locally, so is there a good solution for this? Then I can just submit the whole job to cluster, the local program not need to keep running.

Hadoop Job Scheduling query

I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R
What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.
you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.
How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.
I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.

Resources