Scheduling jobs as file trigger in Oozie - hadoop

Using oozie we could submit jobs in hadoop, is that possible to make the job submission triggered by availability of file. For an example after coping the file successfully to hdfs, Oozie has to submit the jobs. Is that possible ?

Use The 'done-flag' tag in dataset. like
<datasets>
<dataset name="dataset1" frequency="${coord:hours(1)}"
initial-instance="${startTime}" timezone="UTC">
<uri-template>
${dataRoot}/${YEAR}/${MONTH}/${DAY}/${HOUR}/
</uri-template>
<done-flag>_SUCCESS</done-flag>
</dataset>
</datasets>
If the done flag is set to empty, then Coordinator looks for the existence of the directory itself.
IF the _SUCCESS(or any file name which is specified in the tag) file is exists in your directory then coordinator will proceed further.
for more information see the - https://oozie.apache.org/docs/3.1.3-incubating/CoordinatorFunctionalSpec.html

Related

Unable to deploy Spark jobs using Oozie

I need to keep a spark job running 24/7 and for this I am using Oozie. To do this I have written a workflow.xml and job.properties files, containing the needful information to invoke it.
However when I try to send the oozie job using this:
oozie job –config /home/oozie/tst/job.properties -run
I get the following error message, which is very clear:
java.io.IOException: configuration is not specified
at org.apache.oozie.cli.OozieCLI.getConfiguration(OozieCLI.java:816)
at org.apache.oozie.cli.OozieCLI.jobCommand(OozieCLI.java:1055)
at org.apache.oozie.cli.OozieCLI.processCommand(OozieCLI.java:686)
at org.apache.oozie.cli.OozieCLI.run(OozieCLI.java:639)
at org.apache.oozie.cli.OozieCLI.main(OozieCLI.java:225)
configuration is not specified
The problem here is that the configuration file (job.properties) exists locally on the path specified. I also PUT the directory containing both files and .jar in the HDFS.
Any idea why is this failing?
Is Oozie the best tool for this task I have?
The config parameter takes local path not HDFS. check job.properties present in /home/oozie/tst/job.properties
check job.properties contain oozie.wf.application.path=PATH_TO_HDFS_PATH_WHERE_WORKFLOW.XML_IS_PRESENT
Plus I see the dash(-) given in config parameter is different then dash(-) in run parameter
Specify the host in your command
oozie job --oozie http://your_host:11000/oozie -config /home/oozie/tst/job.properties -run
11000 is deafult port

How to check whether the file exist in HDFS location, using oozie?

How to check whether a file in HDFS location is exist or not, using Oozie?
In my HDFS location I will get a file like this test_08_01_2016.csv at 11PM , on a daily basis.
I want check whether this file exist after 11.15 PM. I can schedule the batch using a Oozie coordinator job.
But how can I validate if the file exists in HDFS?
you can use EL expression in oozie like:
<decision name="CheckFile">
<switch>
<case to="nextOozieTask">
${fs:exists('/path/test_08_01_2016.csv')} <!--do note the path which should be in ''-->
</case>
<default to="MailActionFileMissing" />
</switch>
</decision>
You can also build the name of the file using simple shell script using capture output.

How to output Hadoop EL counters from streaming Map Reduce job triggered by Oozie?

I am triggering a streaming MapReduce job using Oozie, for which I would like to collect the following Hadoop EL constants:
MAP_IN: Hadoop mapper input records counter name.
MAP_OUT: Hadoop mapper output records counter name.
REDUCE_IN: Hadoop reducer input records counter name.
REDUCE_OUT: Hadoop reducer input record counter name.
I see that these can be accessed using
${ hadoop:counters('mr-action')[RECORDS][REDUCE_OUT]}
However, I have no idea how to get these values to be output back to either the screen via STDOUT or to a file in HDFS on the server from where I'm launching the Oozie workflow.
I've tried passing these values to a shell action and then echo / append to a file, but I believe this is being handled on the data nodes and so I'm not able to see that output. I've also tried setting oozie.action.external.stats.write to true, as one thread suggested, and then calling
oozie job -info -verbose
but I still don't see these counters showing up under an External Stats field. Any suggestions of how to get these counters output will be very helpful.
Before I was doing oozie job -info job-id -verbose which wasn't displaying external stats. Key was to make following changes.
In workflow.xml file, under the action I want to collect counters for, add the following to the configuration:
<action name="mr-action">
<configuration>
<property>
<name>oozie.action.external.stats.write</name>
<value>true</value>
</property>
</configuration>
</action>
Then, after job is run, do the following in the command line:
oozie job -info job-id#mr-action -verbose
which gives me the counters I was looking for.

Oozie shell script action

I am exploring the capabilities of Oozie for managing Hadoop workflows. I am trying to set up a shell action which invokes some hive commands. My shell script hive.sh looks like:
#!/bin/bash
hive -f hivescript
Where the hive script (which has been tested independently) creates some tables and so on. My question is where to keep the hivescript and then how to reference it from the shell script.
I've tried two ways, first using a local path, like hive -f /local/path/to/file, and using a relative path like above, hive -f hivescript, in which case I keep my hivescript in the oozie app path directory (same as hive.sh and workflow.xml) and set it to go to the distributed cache via the workflow.xml.
With both methods I get the error message:
"Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]" on the oozie web console. Additionally I've tried using hdfs paths in shell scripts and this does not work as far as I know.
My job.properties file:
nameNode=hdfs://sandbox:8020
jobTracker=hdfs://sandbox:50300
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozieProjectRoot=${nameNode}/user/sandbox/poc1
appPath=${oozieProjectRoot}/testwf
oozie.wf.application.path=${appPath}
And workflow.xml:
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${appPath}/hive.sh</exec>
<file>${appPath}/hive.sh</file>
<file>${appPath}/hive_pill</file>
</shell>
<ok to="end"/>
<error to="end"/>
</action>
<end name="end"/>
My objective is to use oozie to call a hive script through a shell script, please give your suggestions.
One thing that has always been tricky about Oozie workflows is the execution of bash scripts.
Hadoop is created to be massively parallel so the architecture acts very different than you would think.
When an oozie workflow executes a shell action, it will receive resources from your job tracker or YARN on any of the nodes in your cluster. This means that using a local location for your file will not work, since the local storage is exclusively on your edge node. If the job happened to spawn on your edge node then it would work, but any other time it would fail, and this distribution is random.
To get around this, I found it best to have the files I needed (including the sh scripts) in hdfs in either a lib space or the same location as my workflow.
Here is a good way to approach what you are trying to achieve.
<shell xmlns="uri:oozie:shell-action:0.1">
<exec>hive.sh</exec>
<file>/user/lib/hive.sh#hive.sh</file>
<file>ETL_file1.hql#hivescript</file>
</shell>
One thing you will notice is that the exec is just hive.sh since we are assuming that the file will be moved to the base directory where the shell action is completed
To make sure that last note is true, you must include the file's hdfs path, this will force oozie to distribute that file with the action. In your case, the hive script launcher should only be coded once, and simply fed different files. Since we have a one to many relationship, the hive.sh should be kept in a lib and not distributed with every workflow.
Lastly you see the line:
<file>ETL_file1.hql#hivescript</file>
This line does two things. Before the # we have the location of the file. It is just the file name since we should distribute our distinct hive files with our workflows
user/directory/workflow.xml
user/directory/ETL_file1.hql
and the node running the sh will have this distributed to it automagically. Lastly, the part after the # is the variable name we assign it two inside of the sh script. This gives you the ability to reuse the same script over and over and simply feed it different files.
HDFS directory notes,
if the file is nested inside the same directory as the workflow, then you only need to specify child paths:
user/directory/workflow.xml
user/directory/hive/ETL_file1.hql
Would yield:
<file>hive/ETL_file1.hql#hivescript</file>
But if the path is outside of the workflow directory you will need the full path:
user/directory/workflow.xml
user/lib/hive.sh
would yield:
<file>/user/lib/hive.sh#hive.sh</file>
I hope this helps everyone.
From
http://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html#Shell_Action_Schema_Version_0.2
If you keep your shell script and hive script both in some folder in workflow then you can execute it.
See the command in sample
<exec>${EXEC}</exec>
<argument>A</argument>
<argument>B</argument>
<file>${EXEC}#${EXEC}</file> <!--Copy the executable to compute node's current working directory -->
you can write whatever commands you want in file
You can also use use hive action directly
http://oozie.apache.org/docs/3.3.0/DG_HiveActionExtension.html

Can I rename the oozie job name dynamically

We have a Hadoop service in which we have multiple applications. We need to process the data for each of the applications by reexecuting the same workflow. These are scheduled to execute at the same time of the day. The issue is that when these jobs are running its hard to know for which application the job is running/failed/succeeded. Ofcourse, I can open the job coonfiguration and know it but that does take time since there are 10s of applications running under that service.
Is there any option in oozie to dynamically pass the name of the workflow (or part of it) when executing the job such as
oozie job -run -config <filename> -name "<NameIWishToGive>"
OR
oozie job -run -config <filename> -nameSuffix "<MyApplicationNameUnderTheService>"
Also, we dont wish to create multiple job folders to execute separately as that would be too much of copy paste.
Please suggest.
It looks to me like you should be able to just use properties set in the job config.
I was able to get a dynamic name by doing the following.
Here's an example of my workflow.xml:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf-${environment}">
...
</workflow-app>
And in my job.properties I had:
...
environment=test
...
The name ended up being: "map-reduce-wf-test"
you will find a whole bunch of oozie command lines here in the apache docs. i'm not sure which one exactly you are looking for so i thought i'd just paste the link. hope this helps!
I couldn't find anything in oozie to do that. Here is the script that does find/replace of #{appName} and #{frequency} in *.xml files + uploads all files to hdfs. Values are taken from the properties file passed to the script as the 3rd argument.
Gist - https://gist.github.com/epishkin/5952522
Example:
./upload.sh simple_reports namenode01 simple_reports/coordinator_script-1.properties
where 'simple_reports' is a folder with workflow.xml and coordinator.xml files.
workflow.xml:
<workflow-app name="#{appName}" xmlns="uri:oozie:workflow:0.3">
...
</workflow-app>
coordinator.xml:
<coordinator-app name="#{appName}-coord" xmlns="uri:oozie:coordinator:0.2"
frequency="#{frequency}"
start="${start}"
end= "${end}"
timezone="America/New_York">
...
</coordinator-app>
coordinator_script-1.properties:
appName=multi_network
frequency=${coord:days(7)}
...
Hope this helps.
I had recently faced this issue and this, All the tables uses the same workflow but name of the oozie application should reflect the name of the table it is processing.
Then pass the same parameter from job.properties then the name of the ozzie application will be acoording to dataload_tablename.

Resources