Hadoop Job Scheduling query - hadoop

I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R

What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.

you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.

How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.

I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.

Related

Clarification regarding oozie launcher jobs

I needed some clarifications regarding the oozie launcher job.
1) Is the launcher job launched per workflow application (with several actions) or per action within a workflow application?
2) Use Case: I have workflows that contain multiple shell actions (which internally execute spark, hive, pig actions etc.). The reason for using shell is because additional parameters like partition date can be computed using custom logic and passed to hive using .q files
Example Shell File:
hive -hiveconf DATABASE_NAME=$1 -hiveconf MASTER_TABLE_NAME=$2 -hiveconf SOURCE_TABLE_NAME=$3 -hiveconf -f $4
Example .q File:
use ${hiveconf:DATABASE_NAME};
insert overwrite into table ${hiveconf:MASTER_TABLE_NAME} select * from ${hiveconf:SOURCE_TABLE_NAME};
I set the oozie.launcher.mapreduce.job.queuename and mapreduce.job.queuename to different queues to avoid starvation of task slots in a single queue. I also omitted the <capture-output></capture-output> in the corresponding shell action. However, I still see the launcher job occupying a lot of memory from the launcher queue.
Is this because the launcher job caches the log ouput that comes from hive?
Is it necessary to give the launcher job enough memory when executing a shell action the way I am?
What would happen if I explicitly limited the launcher job memory?
I would highly appreciate it if someone could outline the responsibilities of the oozie launcher job.
Thanks!
Is the launcher job launched per workflow application (with several actions) or per action within a workflow application?
The launcher job is launched per action in the workflow.
I would highly recommend you to use respective oozie actions, Hive, Pig etc. Because it allows oozie to handle your workflow and actions in a better manner.

How to schedule Hadoop jobs conditionally?

I am pretty new to Hadoop, and particularly to Hadoop Job Scheduling. Here is what I am trying to do.
I have 2 flows, each having a Hadoop job. I have freedom to put these flows either in the same project or in different ones. I don't want the Hadoop jobs to run simultaneously on the cluster, but I also want to make sure that they run alternatively.
E.g. flow_1 (with hadoop_job_1) runs and finishes -> flow_2 (with hadoop_job_2) runs and finishes -> flow_1 (with hadoop_job_1) runs and finishes and so on.
And of course, I would also like to handle special conditions gracefully.
E.g. flow_1 done, but flow_2 is not ready, then flow_1 gets chance to run again if it is ready, if flow_1 fails, flow_2 still gets its turn, etc.
I would like to know which schedulers I can explore which are capable of doing this.
We are using MapR.
Thanks
This looks to be a standard use case of oozie. Take a look at these tutorials
Executing an Oozie workflow with Pig, Hive & Sqoop actions and Oozie workflow scheduler for Hadoop

how to setup cron job to run map reduce?

I have three map reduce jobs and need to configure as a cron job.
I tried with creating cron job to run mapreduce but its not working. I mean map reduce job run is not initiated.
Please help in setting up cron job to run map reduce job. I can use Oozie for setting up a workflow job but wanted to use cronjob only.
I highly recommend to use Oozie for MapReduce jobs scheduling. There is a lot of work done for you - workflows, data dependencies, monitoring. You can start with simple cron-like scenario and then extend without big reengineering. See https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases for examples.

How to schedule jobs in hadoop

I am new to hadoop, I have written few jobs and exported them as jar file . I am able to run them using hadoop jar command, I want to run these jobs every one hour . How do I do this? Thanks in advance.
Hadoop itself doesn't have ways to schedule jobs like you are suggesting. So you have two main choices, Java's Time and scheduling functions, or run the jobs from the operating system, I would suggest Cron. I would personally use cron to do this, it's simple and very flexible, and is installed by default on most servers. There are also lots of tutorials.
Cron example to run on the first minute of every hour.
0 * * * * /bin/hadoop jar myJar.jar
If you want to keep it inside of java itself, I would suggest checking out this question which has details and code, How to schedule task for start of every hour.
You could probably achieve that by writing a cron or some script. But the better way, in my view, would be to use some scheduler like Oozie.
In addition to an already mentioned Oozie, you might want to take a look at Falcon.
From own experience, however, a much easier approach is to try using your CI system to avoid adding new systems to your stack, for example Jenkins
Adding another option to CRON & Oozie, Quartz Scheduler

Hadoop Job Automation

I have hadoop 3 node cluster which is used to analyse the data every day at 9 PM. I want to automate the running job in hadoop command line. How can i do that.
Assuming you are using linux, I'll recommend you use the cron scheduler.
Look at a this tutorial for instructions.
Check out oozie, it's a workflow manager for hadoop and I believe it has the ability to schedule jobs

Resources