oozie variables across action actions - hadoop

I am new to oozie and have a usecase where we would need to set a variable in a oozie action and read the same variable in a different oozie action. This job runs every week and in first action we calculate few values and we like to use the same values in other actions as well,Just wanted to know oozie has provisions to do the same ?

Related

How to schedule OOZIE job, if any changes happened in given folder?

I want to schedule a oozie job based on folder i.e.
I have a folder in HDFS location and every day one file will be add in that folder with the format of date.txt (exp :20160802.txt ).
I want to schedule a OOZIE batch, if any new file add in that folder.
Please help me on this ,how can I schedule in my use case scenario.
Thanks in advance.
Oozie workflow jobs are run based on regular time intervals and/or data availability. And, in some cases, they can be triggered by an external event. Coordinator comes into play here.
You can use oozie coordinator to check data dependency and trigger a oozie workflow with Coordinator EL functions
In your case every day your file is getting added to hdfs with timestamp.So with dataset you can achieve.
From Documentation
Example A dataset produced once every day at 00:15 PST8PDT and done-flag is set to empty:
<dataset name="logs" frequency="${coord:days(1)}"
initial-instance="2009-02-15T08:15Z" timezone="America/Los_Angeles">
<uri-template>
hdfs://foo:9000/app/logs/${market}/${YEAR}${MONTH}/${DAY}/data
</uri-template>
<done-flag></done-flag>
</dataset>
The dataset would resolve to the following URIs and Coordinator looks for the existence of the directory itself:
[market] will be replaced with user given property. hdfs://foo:9000/usr/app/[market]/2009/02/15/data
hdfs://foo:9000/usr/app/[market]/2009/02/16/data
hdfs://foo:9000/usr/app/[market]/2009/02/17/data
Please read the documentation many examples are given there.Its good.
1.About Coordinators
2.DataSet

Dynamically calculating oozie parameter (number of reducers for MR action)

In my oozie workflow I dynamically create a hive table, say T1. This hive action is then followed by a map-reduce action. I want to set number of reducers property (mapred.reduce.tasks) equal to distinct values of a field say (T1.group). Any ideas how to set value of some oozie parameter dynamically and how to get value of the parameter from hive distinct action to oozie parameter?
I hope this can help:
Create the hive table as you are doing already.
Execute another Hive query which calculates the distinct values for the column and writes it to a file in hdfs.
Create an Shell action, which will read the file and echo the value in the form of key=value. Enable the capture-output for the shell action.
This is your MR action. Now access the action data using the Oozie EL functions. e.g. ${wf:actionData('ShellAction')['key']}, pass this value to the mapred.reduce.tasks in the configuration tag of the MR action.

Hadoop schedule jobs to run sequentially (one job after other)?

Lets say I am resource constrained in my Hadoop environment and I don't want to schedule really long running jobs (ie takes days to complete). I am analyzing vast amount of past time series data. I want to schedule mapreduce jobs that take a day's worth of data at a time (which takes an hour to crunch).
So how do I schedule such that new job is submitted as soon as previous job is completed?
If you want a quick and simple approach you could just write a shell script that calls hadoop jar in sequence for each job you want to run.
If you want a more robust approach you could use Apache Oozie to define a workflow of jobs that will run your jobs in sequence. If you are new to Hadoop you may find it easiest to define and run your Oozie workflow using the Hue GUI.

Changing the name of Hadoop job?

Hi I would like to change the name of the running Hadoop Job to a meaningful name.
Is there any command to change the name of running job, just like this -
hadoop job -set-priority <JOB_ID> 'HIGH'; which changes the priority of the job
The job id is assigned by the job tracker on submit, by calling JobTracker.getNewJobId(). It cannot be pre-set. To change the the job priority, you must retrieve the ID from submission. Read comments on PIG-948 why is not always possible to know the MR job id from PIG:
Reason for that is JobControlCompiler compiles a set of
inter-dependent MR jobs and generates a job-control object which is
then submitted asynchronously to hadoop for execution. Since we dont
block on those thread, its possible that job-ids are not yet assigned
when we ask for them

How to find the node where the pig job is running

I ran a pig work-flow using oozie. The job completed successfully but now I want to know on which slave or master the job ran. My input file is a 1.4GB file which is distributed on the nodes (1 master and 2 slaves).
And I also want to figure out how much time did the pig executed on each node.
Thank you in advance
Point your web browser to "JobTracker_Machine:50030" and it will presetn you the MapReduce webUI. Here you'll find all the jobs you have run(Running, Completed and Retired). Click on the job which you want to analyze and it will give you all the information you need including the node where a particular task has run and the time taken to finish the task.
HTH
Go to the Oozie Web console and click on the workflow (which contains the pig node).Clicking on the worklfow job will lead to a dialog box (for your workflow) containing details of all the action nodes in the workflow. Select the pig node (which you want to analyse) and a detailed dialog box will appear containing the Job Tracker's URL of that pig job.
There you will find all the details you are looking for.

Resources