Hadoop Job Automation - hadoop

I have hadoop 3 node cluster which is used to analyse the data every day at 9 PM. I want to automate the running job in hadoop command line. How can i do that.

Assuming you are using linux, I'll recommend you use the cron scheduler.
Look at a this tutorial for instructions.

Check out oozie, it's a workflow manager for hadoop and I believe it has the ability to schedule jobs

Related

How to find hadoop applications ran by oozie (hadoop) job

We know that first oozie runs a hadoop job and using that job it runs other hadoop applications. So I want to find the list of those hadoop applications (eg. application_231232133) ran by oozie (hadoop) job. Currently there is no such api or command.
If you're using Oozie 5.0 or higher then the application type of those jobs is "Oozie Launcher", not "MapReduce" so they are easy to filter out.
You may use Oozie REST API http://oozie.apache.org/docs/4.2.0/WebServicesAPI.html#Job_Information which return externalId attribute for each action filled by hadoop application id.

What is oozie equivalent for Spark?

We have very complex pipelines which we need to compose and schedule. I see that Hadoop ecosystem has Oozie for this. What are the choices for Spark based jobs when I am running Spark on Mesos or Standalone and doesn't have a Hadoop cluster?
Unlike with Hadoop, it is pretty easy to chains things with Spark. So writing a Spark Scala script might be enough. My first recommendation is tying that.
If you like to keep it SQL like, you can try SparkSQL.
If you have a really complex flow, it is worth looking at Google data flow https://github.com/GoogleCloudPlatform/DataflowJavaSDK.
Oozie can be used in case of Yarn,
for spark there is no built in scheduler available, So you are free to choose any scheduler which works in the cluster mode.
For Mesos I feel Chronos would be the right choice, more info on Chronos

Can an oozie instance run jobs on multiple hadoop clusters at the same time?

I have an available developer Hadoop cluster to run test jobs as well as an available production cluster. My question is, can I utilize oozie to kick off workflow jobs to multiple clusters on a single oozie instance?
What are the gotchas? I'm assuming I can just reconfigure the job tracker, namenode, and fs location properties for my workflow depending on which cluster I want the job to run on.
Assuming the clusters are all running the same distribution and version of hadoop, you should be able to.
As your note, you'll need to adjust the jobtracker and namenode values in your oozie actions

HBase: do I need jobtracker/tasktracker

If I don't do any map/reduce jobs, still JobTracker/TaskTrackers need to be running for some HBase internal dependency?
No you don't need both for running solely HBase.
Just a tip: there are always scripts that just start the HDFS, bin/start-dfs.sh for example.
As mentioned above we don't need Job/Tasktracker if we are dealing with just Hbase. You can use bin/start-dfs.sh to start Name/Dtanodes..Moreover bin/start-all.sh has been deprecated now..So you should prefer using bin/start-dfs.sh to start Name/Datanodes and bin/start-mapred.sh to start Job/Tasktracker..I would suggest using Hbase in pseudo-distributed mode for learning and testing purpose, as in standalone Hbase doesn't use HDFS..You should be a bit careful while configuring though..
Basic case: You don't need JobTracker and TaskTrackers when using only HDFS+HBase (in smaller, testing environment you don't need event HDFS)
When you would like to run MapReduce jobs using data stored in HBase, you'll obviously need both JobTracker and TaskTrackers.

Hadoop Job Scheduling query

I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R
What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.
you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.
How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.
I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.

Resources