How to find hadoop applications ran by oozie (hadoop) job - hadoop

We know that first oozie runs a hadoop job and using that job it runs other hadoop applications. So I want to find the list of those hadoop applications (eg. application_231232133) ran by oozie (hadoop) job. Currently there is no such api or command.

If you're using Oozie 5.0 or higher then the application type of those jobs is "Oozie Launcher", not "MapReduce" so they are easy to filter out.

You may use Oozie REST API http://oozie.apache.org/docs/4.2.0/WebServicesAPI.html#Job_Information which return externalId attribute for each action filled by hadoop application id.

Related

installation of Oozie on a separate machine then Hadoop

Very new to Oozie, hence please excuse me if I sound like a newbie.
I have a hadoop cluster which is up and running. I want to install Oozie, this i want on a separate machine then then hadoop. Is this possible? the reason for asking is that on every installation guide I have seen it asks to install hadoop on the machine hence am not sure if its technically possible to have hadoop on a separate machine then Oozie.
Thanks in advance
Oozie server serves client's requests, it's a web application which uses embedded Tomcat, it can be installed on any machine where hadoop is reachable from, it's not tied to hadoop by itself. You can specify hadoop's nameNode and jobTracker in workflow properties so oozie will know where to send it's jobs.

hadoop mapreduce - API for getting job log

I’m developing a hadoop mapreduce application and i need to present the end user the task log.
(same as hue does).
is there a java-api that extract the logs of specific job?
i tried "JobClient" API without any success.
the Job Attempts API of the HistoryServer provides a link to the logs of each task

Can an oozie instance run jobs on multiple hadoop clusters at the same time?

I have an available developer Hadoop cluster to run test jobs as well as an available production cluster. My question is, can I utilize oozie to kick off workflow jobs to multiple clusters on a single oozie instance?
What are the gotchas? I'm assuming I can just reconfigure the job tracker, namenode, and fs location properties for my workflow depending on which cluster I want the job to run on.
Assuming the clusters are all running the same distribution and version of hadoop, you should be able to.
As your note, you'll need to adjust the jobtracker and namenode values in your oozie actions

Hadoop Job Automation

I have hadoop 3 node cluster which is used to analyse the data every day at 9 PM. I want to automate the running job in hadoop command line. How can i do that.
Assuming you are using linux, I'll recommend you use the cron scheduler.
Look at a this tutorial for instructions.
Check out oozie, it's a workflow manager for hadoop and I believe it has the ability to schedule jobs

Should oozie be installed on all the hadoop nodes inside a single hadoop cluster?

I am running oozie over hadoop 1.0.3. I wanted to find out whether oozie has to be installed over all the hadoop nodes inside a single cluster ? Is it sufficient to install it on the master node (hadoop) only ? I searched through the oozie documentation, but could not find the answer to my question.
Thankyou,
Mohsin.
Oozie need not be installed on all the nodes in a cluster. It can be installed on a dedicated machine or along with any other framework. Check this guide for a quick installation of Oozie.
Note that Oozie has got a client and a server component. The server component has a Scheduler and also a WorkFlow engine. And the WorkFlow engine used hPDL (Hadoop Process Definition Language) for defining the WorkFlow.

Resources