Can I run py spark as a shell job in Oozie? - hadoop

I have python script which I 'm able to run through spark-submit. I need to use it in Oozie.
<!-- move files from local disk to hdfs -->
<action name="forceLoadFromLocal2hdfs">
<shell xmlns="uri:oozie:shell-action:0.3">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>driver-script.sh</exec>
<!-- single -->
<argument>s</argument>
<!-- py script -->
<argument>load_local_2_hdfs.py</argument>
<!-- local file to be moved-->
<argument>localPathFile</argument>
<!-- hdfs destination folder, be aware of, script is deleting existing folder! -->
<argument>hdfFolder</argument>
<file>${workflowRoot}driver-script.sh#driver-script.sh</file>
<file>${workflowRoot}load_local_2_hdfs.py#load_local_2_hdfs.py</file>
</shell>
<ok to="end"/>
<error to="killAction"/>
</action>
The script by itself through driver-script.sh runs fine. Through oozie, even the status of workflow is SUCCEEDED, the file is not copied to hdfs. I was not able to find any error logs, or related logs to pyspark job.
I have another topic about supressed logs from Spark by oozie here

Set your script to set -x in the beginning that will show you which line the script is it. You can see those in the stderr.
Can you elaborate on what you mean by file is not copied ? To help you better.

Related

Oozie supress logging from shell job action?

I have a simple workflow (see below) which runs a shell script. The shell script runs pyspark script, which moves file from local to hdfs folder.
When I run the shell script itself, it works perfectly, logs are redirect to a folder by > spark.txt 2>&1 right in the shell script.
But when I submit oozie job with following workflow, output from shell seems to be supressed. I tried to redirect all possible oozie logs (-verbose -log) > oozie.txt 2>&1, but it didn't help.
The workflow is finished successfuly (status SUCCESSEDED, no error log), but I see, the folder is not copied to hdfs, however when I run it alone (not through oozie), everything is fine.
<action name="forceLoadFromLocal2hdfs">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>driver-script.sh</exec>
<argument>s</argument>
<argument>script.py</argument>
<!-- arguments for py script -->
<argument>hdfsPath</argument>
<argument>localPath</argument>
<file>driver-script.sh#driver-script.sh</file>
</shell>
<ok to="end"/>
<error to="killAction"/>
Thx a lot!
EDIT: Thx to the advice I found full log under the
yarn -logs -applicationId [application_xxxxxx_xxxx]
Thx to the advice I found full log under the
yarn -logs -applicationId [application_xxxxxx_xxxx]

How to get oozie jobId in oozie workflow?

I have a oozie workflow that will invoke a shell file, Shell file will further invoke a driver class of mapreduce job. Now i want to map my oozie jobId to Mapreduce jobId for later process. Is there any way to get oozie jobId in workflow file so that i can pass the same as argument to my driver class for mapping.
Following is my sample workflow.xml file
<workflow-app xmlns="uri:oozie:workflow:0.4" name="test">
<start to="start-test" />
<action name='start-test'>
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${jobScript}</exec>
<argument>${fileLocation}</argument>
<argument>${nameNode}</argument>
<argument>${jobId}</argument> <!-- this is how i wanted to pass oozie jobId -->
<file>${jobScriptWithPath}#${jobScript}</file>
</shell>
<ok to="end" />
<error to="kill" />
</action>
<kill name="kill">
<message>test job failed
failed:[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
Following is my shell script.
hadoop jar testProject.jar testProject.MrDriver $1 $2 $3
Try to use ${wf:id()}:
String wf:id()
It returns the workflow job ID for the current workflow job.
More info here.
Oozie drops an XML file in the CWD of the YARN container running the shell (the "launcher" container), and also sets an env variable pointing to that XML (cannot remember the name though).
That XML contains a lot of stuff like name of Workflow, name of Action, ID of both, run attempt number, etc.
So you can sed back that information in the shell script itself.
Of course passing explicitly the ID (as suggested by Alexei) would be cleaner, but sometimes "clean" is not the best way. Especially if you are concerned about whether it's the first run or not...

Hive-oozie action error

Here is my workflow.xml
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${wfeRoot}/output-data/hive"/>
<mkdir path="${nameNode}/user/${wf:user()}/${wfeRoot}/output-data"/>
</prepare>
<job-xml>hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>oozie.log.hive.level</name>
<value>DEBUG</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>script.q</script>
</hive>
<ok to="end"/>
<error to="fail"/>
Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
my job.properties file
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
wfeRoot=wfe
oozie.use.system.libpath=true
oozie.libpath=/user/oozie/share/lib/hive
oozie.wf.application.path=${nameNode}/user/${user.name}/${wfeRoot}/hiveoozie
Script
create table brundesh(name string,lname string) row format delimited fields terminated by ',';
I copied hive-site.xml ,script.hql and hive-default.xml in to oozie app directory. I am using cdh3
Error detalis:
Error code: JA018
Error Message: Main Class[org.apache.oozie.action.hadoop.HiveMain],exit code [9]
I copied the required jar files to sharelib directory in hdfs. I copied all the jar fiels present in oozie.sharelib.tar.gz from $OOZIE_HOME
I goggled for error but no luck. Please help me were am going wrong
As mention by Ben Please check Hive Log, which present in the respected Node or Check with in the console URL with details of the Logs.
Will also suggest to do another steps which requried to perform are:
Take a Backup of Shared Lib Jar from the DFS Location.
Upload the same Jars from local Hive Lib Location to DFS Shared Location with Oozie User.
Make Sure there should not be any Duplicate Hive Jar present in other Local Location except Hive Lib Path.
All Nods should be having the same Jars.
If you are using Pig as well, then please perform the Step 1, Step 2 , Step 3 from Pig as well.
Check the Hadoop ClassPath if there Classpath have been set properly.

Bigquery command is failing running from oozie workflow

I am new oozie user. Currently I am trying to run a sample bigquery command(e.g: bq ls -p) from a shell script in oozie. But its failing every time. below I have provided the workflow and shell script. I am trying it out in Hortonworks Sandbox and the gcloud is authenticated in hortonworks sandbox box.
I want to know is it not possible to run a bigquery command from oozie? AFAIK Hortonworks sandbox uses same virtualbox as its datanode and jobnode.
If I can run then anyone can help me to find the answer - if I am going to run from larger hadoop cluster, do I need to authenticate the gcloud in each node?
Thanks in advance.
My workflow xml sample:
<workflow..
<start to="run_shell" />
<action name="run_shell" retry-max="2" retry-interval="1">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<exec>pl2.sh</exec>
<argument>/user/bandyoa/AP/</argument>
<file>${nameNode}/user/bandyoa/AP/pl2.sh#pl2.sh</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="failure_mail"/>
</action>
</workflo..>
and shell script:
#!/bin/bash
bq ls -p
Copying all the project settings and auth settings from /home/hdfs/.config and /home/hdfs/.bigqueryrc to /home/ and setting them to be readable/writable to all users made Oozie happy for me. Now bq ls returns the list of tables in default dataset.

How do I pass arguments to an Oozie action using oozie.launcher.action.main.class?

Oozie has a config property called oozie.launcher.action.main.class where you can pass in the name of a "main class" for a map-reduce action (or a shell action), like so:
<configuration>
<property>
<name>oozie.launcher.action.main.class</name>
<value>com.company.MyCascadingClass</value>
</property>
</configuration>
But I need to pass arguments to my main class and can't see a way to do it. Any ideas?
I'm asking because I'm trying to launch a Cascading class/flow from within Oozie and all options I've tried so far have failed. If anyone has gotten Cascading to work from Oozie, let me know and I'll post another question asking that in particular.
As of Oozie 3 (haven't tried Oozie 4 yet), the answer to my main question is: you can't. There is no facility (strangely) for specifying any arguments to your main class defined with the oozie.launcher.action.main.class property.
#Dmitry's suggestion in the comments to just use the Oozie java action works for a Cascading job (or any Hadoop dependent job) because Oozie puts all the Hadoop jars in the classpath when it launches the job.
I've documented a working example of launching a Cascading job from Oozie at my blog here: http://thornydev.blogspot.com/2013/10/launching-cascading-job-from-apache.html
Here is the workflow.xml file that worked for me:
<workflow-app xmlns='uri:oozie:workflow:0.2' name='cascading-wf'>
<start to='stage1' />
<action name='stage1'>
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.mycompany.MyCascade</main-class>
<java-opts></java-opts>
<arg>/user/myuser/dir1/dir2</arg>
<arg>my-arg-2</arg>
<arg>my-arg-3</arg>
<file>lib/${EXEC}#${EXEC}</file>
<capture-output />
</java>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>FAIL: Oh, the huge manatee!</message>
</kill>
<end name="end"/>
</workflow-app>
In the job.properties file that accompanies the workflow.xml, the EXEC property is defined as:
EXEC=mybig-shaded-0.0.1-SNAPSHOT.jar
and the job is put into the lib directory below where these two definition files are.

Resources