How to get oozie jobId in oozie workflow? - hadoop

I have a oozie workflow that will invoke a shell file, Shell file will further invoke a driver class of mapreduce job. Now i want to map my oozie jobId to Mapreduce jobId for later process. Is there any way to get oozie jobId in workflow file so that i can pass the same as argument to my driver class for mapping.
Following is my sample workflow.xml file
<workflow-app xmlns="uri:oozie:workflow:0.4" name="test">
<start to="start-test" />
<action name='start-test'>
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${jobScript}</exec>
<argument>${fileLocation}</argument>
<argument>${nameNode}</argument>
<argument>${jobId}</argument> <!-- this is how i wanted to pass oozie jobId -->
<file>${jobScriptWithPath}#${jobScript}</file>
</shell>
<ok to="end" />
<error to="kill" />
</action>
<kill name="kill">
<message>test job failed
failed:[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
Following is my shell script.
hadoop jar testProject.jar testProject.MrDriver $1 $2 $3

Try to use ${wf:id()}:
String wf:id()
It returns the workflow job ID for the current workflow job.
More info here.

Oozie drops an XML file in the CWD of the YARN container running the shell (the "launcher" container), and also sets an env variable pointing to that XML (cannot remember the name though).
That XML contains a lot of stuff like name of Workflow, name of Action, ID of both, run attempt number, etc.
So you can sed back that information in the shell script itself.
Of course passing explicitly the ID (as suggested by Alexei) would be cleaner, but sometimes "clean" is not the best way. Especially if you are concerned about whether it's the first run or not...

Related

How to mark an Oozie workflow action's status as OK

I am using Apache oozie. I want to mark the status of one of the shell action as OK, in my oozie workflow. It is in Running state.
Can we please share the command to use in Apache Oozie to do this.
You don't need to explicitly set the status of an Action. Oozie automatically does that for you based on the action/task execution. For instance, let's say you have shell action that looks something like this:
<workflow-app
xmlns="uri:oozie:workflow:0.3" name="shell-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell
xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>some-script.sh</exec>
<file>/user/src/some-script.sh</file>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
If the /user/src/some-script.sh execution is successful, Oozie will mark the action status as ok and successfully ends the job. On the other hand if the script execution encounters any error, it will be marked as error immediately kill the job and directed. If you're looking for not to kill the job due to any abnormal execution of the code in your script, you can create another action and direct Oozie to follow that execution path instead of immediately killing the workflow. Checkout more about Oozie Shell Action.

Can I run py spark as a shell job in Oozie?

I have python script which I 'm able to run through spark-submit. I need to use it in Oozie.
<!-- move files from local disk to hdfs -->
<action name="forceLoadFromLocal2hdfs">
<shell xmlns="uri:oozie:shell-action:0.3">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>driver-script.sh</exec>
<!-- single -->
<argument>s</argument>
<!-- py script -->
<argument>load_local_2_hdfs.py</argument>
<!-- local file to be moved-->
<argument>localPathFile</argument>
<!-- hdfs destination folder, be aware of, script is deleting existing folder! -->
<argument>hdfFolder</argument>
<file>${workflowRoot}driver-script.sh#driver-script.sh</file>
<file>${workflowRoot}load_local_2_hdfs.py#load_local_2_hdfs.py</file>
</shell>
<ok to="end"/>
<error to="killAction"/>
</action>
The script by itself through driver-script.sh runs fine. Through oozie, even the status of workflow is SUCCEEDED, the file is not copied to hdfs. I was not able to find any error logs, or related logs to pyspark job.
I have another topic about supressed logs from Spark by oozie here
Set your script to set -x in the beginning that will show you which line the script is it. You can see those in the stderr.
Can you elaborate on what you mean by file is not copied ? To help you better.

How do I set path in oozie workflows?

I am trying to run a Shell Script on Oozie.First, I selected the path of shell script file, after which I added the arguments to run the shell script file. When I try running the oozie worflow, as such, it goes into a running loop which gets killed after 10 seconds.
I also added the Environment Variable by setting the path of the output folder in HDFS. When I run it, again it runs into a loop which gets killed after 10 seconds. I am unable to figure out how to set the path. Please help.
Your question is not clear,
But I gess , your are trying to run Shell Script using Oozie workflow, where Shell Script Arguments will be pass from Oozie it self.
If My understanding is right, you can pass the Argument variable from Oozie via coordinator.properties/coordinator.xml/workflow.xml it self.
Example:
let say You have a shell script, which will perform distcp everytime its execute to anothere dfs location.
Shell Script:
> hadoop dfs -rmr destination_location
> hadoop distcp hdfs://<source_dfs><source_dfs_location> hdfs://<destination_dfs><destination_dfs_location>
workflow.xml:
<action name="shellAction">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.launcher.mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<exec>shell_script.sh</exec>
<file>hdfs://<dfs:port>/<dfs_location/shell_script.sh></file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="killAction"/>
</action>
<kill name="killAction">
<message>Shell Action Failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
Note: Shell action chould be defined in oozie_site.xml
Belive these will help u some point

oozie running Sqoop command in a shell script

can I write a sqoop import command in a script and excute it in oozie as coordinator workflow?
I have tired to do so and found an error saying sqoop command not found even if i give the absolute path for sqoop to execute
script.sh is as follows
sqoop import --connect 'jdbc:sqlserver://xx.xx.xx.xx' -username=sa -password -table materials --fields-terminated-by '^' -- --schema dbo -target-dir /user/hadoop/CFFC/oozie_materials
and I have placed the file in HDFS and gave oozie its path.The workflow is as follows :
<workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'>
<start to='shell1' />
<action name='shell1'>
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>script.sh</exec>
<file>script.sh#script.sh</file>
</shell>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
the oozie returns an error as sqoop command not found in the mapreduce log.
so is that a good practice?
Thanks
The shell action will be running as a mapper task as you have observed. The sqoop command needs to be present on each data node where the mapper is running. If you make sure sqoop command line is there and has proper permission for the user who submitted the job, it should work.
The way to verify could be :
ssh to datanode as specific user
run command line sqoop to see if it works
try to add sqljdbc41.jar sqlserver driver to your HDFS and add archive tag in your workflow.xml as below and then try to run oozie workflow run command:
<archive>${HDFSAPATH}/sqljdbc41.jar#sqljdbc41.jar</archive>
If problem exists then..add hive-site.xml with below properties,
javax.jdo.option.ConnectionURL
hive.metastore.uris
Keep hive-site.xml in HDFS, and add file tag in workflow.xml and restart oozie workflow.xml

How do I pass arguments to an Oozie action using oozie.launcher.action.main.class?

Oozie has a config property called oozie.launcher.action.main.class where you can pass in the name of a "main class" for a map-reduce action (or a shell action), like so:
<configuration>
<property>
<name>oozie.launcher.action.main.class</name>
<value>com.company.MyCascadingClass</value>
</property>
</configuration>
But I need to pass arguments to my main class and can't see a way to do it. Any ideas?
I'm asking because I'm trying to launch a Cascading class/flow from within Oozie and all options I've tried so far have failed. If anyone has gotten Cascading to work from Oozie, let me know and I'll post another question asking that in particular.
As of Oozie 3 (haven't tried Oozie 4 yet), the answer to my main question is: you can't. There is no facility (strangely) for specifying any arguments to your main class defined with the oozie.launcher.action.main.class property.
#Dmitry's suggestion in the comments to just use the Oozie java action works for a Cascading job (or any Hadoop dependent job) because Oozie puts all the Hadoop jars in the classpath when it launches the job.
I've documented a working example of launching a Cascading job from Oozie at my blog here: http://thornydev.blogspot.com/2013/10/launching-cascading-job-from-apache.html
Here is the workflow.xml file that worked for me:
<workflow-app xmlns='uri:oozie:workflow:0.2' name='cascading-wf'>
<start to='stage1' />
<action name='stage1'>
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.mycompany.MyCascade</main-class>
<java-opts></java-opts>
<arg>/user/myuser/dir1/dir2</arg>
<arg>my-arg-2</arg>
<arg>my-arg-3</arg>
<file>lib/${EXEC}#${EXEC}</file>
<capture-output />
</java>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>FAIL: Oh, the huge manatee!</message>
</kill>
<end name="end"/>
</workflow-app>
In the job.properties file that accompanies the workflow.xml, the EXEC property is defined as:
EXEC=mybig-shaded-0.0.1-SNAPSHOT.jar
and the job is put into the lib directory below where these two definition files are.

Resources