Clarification regarding oozie launcher jobs - hadoop

I needed some clarifications regarding the oozie launcher job.
1) Is the launcher job launched per workflow application (with several actions) or per action within a workflow application?
2) Use Case: I have workflows that contain multiple shell actions (which internally execute spark, hive, pig actions etc.). The reason for using shell is because additional parameters like partition date can be computed using custom logic and passed to hive using .q files
Example Shell File:
hive -hiveconf DATABASE_NAME=$1 -hiveconf MASTER_TABLE_NAME=$2 -hiveconf SOURCE_TABLE_NAME=$3 -hiveconf -f $4
Example .q File:
use ${hiveconf:DATABASE_NAME};
insert overwrite into table ${hiveconf:MASTER_TABLE_NAME} select * from ${hiveconf:SOURCE_TABLE_NAME};
I set the oozie.launcher.mapreduce.job.queuename and mapreduce.job.queuename to different queues to avoid starvation of task slots in a single queue. I also omitted the <capture-output></capture-output> in the corresponding shell action. However, I still see the launcher job occupying a lot of memory from the launcher queue.
Is this because the launcher job caches the log ouput that comes from hive?
Is it necessary to give the launcher job enough memory when executing a shell action the way I am?
What would happen if I explicitly limited the launcher job memory?
I would highly appreciate it if someone could outline the responsibilities of the oozie launcher job.
Thanks!

Is the launcher job launched per workflow application (with several actions) or per action within a workflow application?
The launcher job is launched per action in the workflow.
I would highly recommend you to use respective oozie actions, Hive, Pig etc. Because it allows oozie to handle your workflow and actions in a better manner.

Related

How to get resources used for FINISHED hadoop jobs from YARN logs using job names?

I have a unix shell script which runs multiple hive scripts. I have given Job names for every hive queries inside the hive scripts.
What I need is that at the end of the shell script, I want to retrieve the resources (in terms of memory used,containers) used for the hive queries based on the job names from the YARN logs/application having appstatus as 'FINISHED'
How do I do this?
Any help would be appreciated.
You can pull this information from the Yarn History server via rest apis.
https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html
Scroll through this documentation and you will see examples of how to get cluster level information on jobs executed and then how to get information on individual jobs.

How to schedule Hadoop jobs conditionally?

I am pretty new to Hadoop, and particularly to Hadoop Job Scheduling. Here is what I am trying to do.
I have 2 flows, each having a Hadoop job. I have freedom to put these flows either in the same project or in different ones. I don't want the Hadoop jobs to run simultaneously on the cluster, but I also want to make sure that they run alternatively.
E.g. flow_1 (with hadoop_job_1) runs and finishes -> flow_2 (with hadoop_job_2) runs and finishes -> flow_1 (with hadoop_job_1) runs and finishes and so on.
And of course, I would also like to handle special conditions gracefully.
E.g. flow_1 done, but flow_2 is not ready, then flow_1 gets chance to run again if it is ready, if flow_1 fails, flow_2 still gets its turn, etc.
I would like to know which schedulers I can explore which are capable of doing this.
We are using MapR.
Thanks
This looks to be a standard use case of oozie. Take a look at these tutorials
Executing an Oozie workflow with Pig, Hive & Sqoop actions and Oozie workflow scheduler for Hadoop

how to trigger an Oozie workflow when previous workflow completes

I am using Oozie workflows to import many tables from different oracle servers. Currently, I have developed a workflow for each of these tables that I want to sqoop into Hadoop. This does a basic sqoop, then does some transformation and creation of hive tables.
Where I have got stuck is, I want to schedule one workflow to run which, is fine. (I have done that), then I want the rest of the workflows to execute on completion of the previous one.
I have been looking at bundles but have not managed to find a solution. I hope some of you can help.
Thanks.
You can create a parent or wrapping workflow that calls each workflow in series (as part of the ok state transition). This is documented as a sub-workflow action:
http://oozie.apache.org/docs/3.3.2/WorkflowFunctionalSpec.html#a3.2.6_Sub-workflow_Action

What is significance of the Oozie MR launcher?

I created a simple Oozie work flow with Sqoop, Hive and Pig actions. For each of there actions, Oozie launches a MR launcher and which in turn launches the action (Sqoop/Hive/Pig). So, there are a total of 6 MR jobs for 3 actions in the work flow.
Why does Oozie start an MR launcher to start the action and not directly start the action?
I posted the same in Apache Flume forums and here is the response.
It's also to keep the Oozie server from being bogged down or becoming
unstable. For example, if you have a bunch of workflows running Pig jobs,
then you'd have the Oozie server running multiple copies of the Pig client
(which is a relatively "heavy" program) directly. By moving all of the
user code and external clients to map tasks in the launcher job, the Oozie
server remains more light-weight and less prone to errors. It can also
much more scalable this way because the launcher jobs distribute the the
job launching/monitoring to other machines in the cluster; otherwise, with
the Oozie server doing everything, we'd have to limit the number of
concurrent workflows based on your Oozie server's machine specs (RAM, CPU,
etc). And finally, from an architectural standpoint, the Oozie server
itself is stateless; that is, everything is stored in the database and the
Oozie server can be taken down at any point without losing anything. If we
were to launch jobs directly from the Oozie server, then we'd now have some
state (e.g. the Pig client cannot be restarted and resumed).

Hadoop Job Scheduling query

I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R
What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.
you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.
How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.
I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.

Resources