How to execute sqoop job from oozie? - hadoop

I'm not able to execute a sample job from oozie using sqoop command to import data into hive. I've placed the hive-site.xml in hdfs path but I think it's not picking the hive-site.xml file. I'm getting class not found exception. How to fix this?
workflow.xml
<!-- This is a comment -->
<workflow-app xmlns="uri:oozie:workflow:0.4" name="oozie-wf">
<start to = "sqoop-node1" />
<!--Step 1 -->
<action name = "sqoop-node1" >
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker></job-tracker>
<name-node></name-node>
<command> import command </command>
</sqoop>
<ok to="end"/>
<error to="kill_job"/>
</action>
<kill name = "kill_job">
<message>Job failed</message>
</kill>
<end name = "end" />
</workflow-app>
nameNode=ip jobTracker=ip queueName=default user.name=oozie oozie.use.system.libpath=true oozie.libpath=/user/hdfs/share/share/lib/sqoop oozie.wf.application.path=workflow path outputDir=/tmp/oozie.txt
java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf

I guess your Sqoop action requires the HCatalog library to interact with the Hive Metastore. And Oozie does not add that library by default, you have to require it explicitly.
Note that there is some literature about using HCatalog from Pig, but very little from Sqoop. Anyway the trick is the same...
From Oozie documentation:
Oozie share libraries are organized per action type... Oozie
provides a mechanism to override the action share library JARs
... More than one share library directory name can be specified
for an action ... For example: When using HCatLoader and HCatStorer in
pig, oozie.action.sharelib.for.pig can be set to pig,hcatalog to
include both pig and hcatalog jars.
In your case, you need to override a specific <property> in your Sqoop action, named oozie.action.sharelib.for.sqoop, with value sqoop,hcatalog -- then Oozie will provide the required JARs at run-time.

Related

How to get oozie jobId in oozie workflow?

I have a oozie workflow that will invoke a shell file, Shell file will further invoke a driver class of mapreduce job. Now i want to map my oozie jobId to Mapreduce jobId for later process. Is there any way to get oozie jobId in workflow file so that i can pass the same as argument to my driver class for mapping.
Following is my sample workflow.xml file
<workflow-app xmlns="uri:oozie:workflow:0.4" name="test">
<start to="start-test" />
<action name='start-test'>
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${jobScript}</exec>
<argument>${fileLocation}</argument>
<argument>${nameNode}</argument>
<argument>${jobId}</argument> <!-- this is how i wanted to pass oozie jobId -->
<file>${jobScriptWithPath}#${jobScript}</file>
</shell>
<ok to="end" />
<error to="kill" />
</action>
<kill name="kill">
<message>test job failed
failed:[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
Following is my shell script.
hadoop jar testProject.jar testProject.MrDriver $1 $2 $3
Try to use ${wf:id()}:
String wf:id()
It returns the workflow job ID for the current workflow job.
More info here.
Oozie drops an XML file in the CWD of the YARN container running the shell (the "launcher" container), and also sets an env variable pointing to that XML (cannot remember the name though).
That XML contains a lot of stuff like name of Workflow, name of Action, ID of both, run attempt number, etc.
So you can sed back that information in the shell script itself.
Of course passing explicitly the ID (as suggested by Alexei) would be cleaner, but sometimes "clean" is not the best way. Especially if you are concerned about whether it's the first run or not...

oozie running Sqoop command in a shell script

can I write a sqoop import command in a script and excute it in oozie as coordinator workflow?
I have tired to do so and found an error saying sqoop command not found even if i give the absolute path for sqoop to execute
script.sh is as follows
sqoop import --connect 'jdbc:sqlserver://xx.xx.xx.xx' -username=sa -password -table materials --fields-terminated-by '^' -- --schema dbo -target-dir /user/hadoop/CFFC/oozie_materials
and I have placed the file in HDFS and gave oozie its path.The workflow is as follows :
<workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'>
<start to='shell1' />
<action name='shell1'>
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>script.sh</exec>
<file>script.sh#script.sh</file>
</shell>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
the oozie returns an error as sqoop command not found in the mapreduce log.
so is that a good practice?
Thanks
The shell action will be running as a mapper task as you have observed. The sqoop command needs to be present on each data node where the mapper is running. If you make sure sqoop command line is there and has proper permission for the user who submitted the job, it should work.
The way to verify could be :
ssh to datanode as specific user
run command line sqoop to see if it works
try to add sqljdbc41.jar sqlserver driver to your HDFS and add archive tag in your workflow.xml as below and then try to run oozie workflow run command:
<archive>${HDFSAPATH}/sqljdbc41.jar#sqljdbc41.jar</archive>
If problem exists then..add hive-site.xml with below properties,
javax.jdo.option.ConnectionURL
hive.metastore.uris
Keep hive-site.xml in HDFS, and add file tag in workflow.xml and restart oozie workflow.xml

E0701 XML schema error in OOZIE workflow

The following is my workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.3" name="import-job">
<start to="createtimelinetable" />
<action name="createtimelinetable">
<sqoop xmlns="uri:oozie:sqoop-action:0.3">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<command>import --connect jdbc:mysql://10.65.220.75:3306/automation --table ABC --username root</command>
</sqoop>
<ok to="end"/>
<error to="end"/>
</action>
<end name="end"/>
</workflow-app>
Getting the following error on trying to submit the job:
Error: E0701 : E0701: XML schema error, cvc-elt.1.a: Cannot find the declaration of element 'action'.
However, oozie validate workflow.xml returns:
Valid worflow-app
Anyone who faced and resolved a similar issue in the past?
Confirm if you have copied your workflow.xml to hdfs. You need not copy job.properties to hdfs but have to copy all the other files and libraries to hdfs
For those who reached here by googling the error message below is the general way to resolve Oozie schema issues:
Once your workflow.xml is complete, it's a best practice to validate it against oozie XSD schema file rather than submitting the Ooozie job and facing the issue later.
note on What is XSD schema:
XSD schema is a kind of validation file which narrates,
a. Sequence of tags
b. whether a tag should be present or not
c. what are the valid sub-tags in a tag, etc.
How to validate workflow XML against XSD?
a. find out the specific XSD, this is seen in xmlns(xml namespace) property
< workflow-app name='FooBarWorkFlow' xmlns="uri:oozie:workflow:0.4">
in this case, it is uri:oozie:workflow:0.4. find the XSD file of uri:oozie:workflow:0.4(get it from appendix of Oozie official site or can be found easily by Googling)
b. There are numerous XML validation sites(example https://www.liquid-technologies.com/online-xsd-validator), provide your Workflow XML file ,XSD file and validate
Errors in workflow XML file will be listed out with line and column info. Rectify these, now use the valid Workflow XML file to avoid schema validation errors in oozie.
oozie validate some_workflow.xml
Tells you line numbers and is much easier to understand than logging output.

How do I pass arguments to an Oozie action using oozie.launcher.action.main.class?

Oozie has a config property called oozie.launcher.action.main.class where you can pass in the name of a "main class" for a map-reduce action (or a shell action), like so:
<configuration>
<property>
<name>oozie.launcher.action.main.class</name>
<value>com.company.MyCascadingClass</value>
</property>
</configuration>
But I need to pass arguments to my main class and can't see a way to do it. Any ideas?
I'm asking because I'm trying to launch a Cascading class/flow from within Oozie and all options I've tried so far have failed. If anyone has gotten Cascading to work from Oozie, let me know and I'll post another question asking that in particular.
As of Oozie 3 (haven't tried Oozie 4 yet), the answer to my main question is: you can't. There is no facility (strangely) for specifying any arguments to your main class defined with the oozie.launcher.action.main.class property.
#Dmitry's suggestion in the comments to just use the Oozie java action works for a Cascading job (or any Hadoop dependent job) because Oozie puts all the Hadoop jars in the classpath when it launches the job.
I've documented a working example of launching a Cascading job from Oozie at my blog here: http://thornydev.blogspot.com/2013/10/launching-cascading-job-from-apache.html
Here is the workflow.xml file that worked for me:
<workflow-app xmlns='uri:oozie:workflow:0.2' name='cascading-wf'>
<start to='stage1' />
<action name='stage1'>
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.mycompany.MyCascade</main-class>
<java-opts></java-opts>
<arg>/user/myuser/dir1/dir2</arg>
<arg>my-arg-2</arg>
<arg>my-arg-3</arg>
<file>lib/${EXEC}#${EXEC}</file>
<capture-output />
</java>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>FAIL: Oh, the huge manatee!</message>
</kill>
<end name="end"/>
</workflow-app>
In the job.properties file that accompanies the workflow.xml, the EXEC property is defined as:
EXEC=mybig-shaded-0.0.1-SNAPSHOT.jar
and the job is put into the lib directory below where these two definition files are.

Error while running Hive Action in Oozie

I'm trying to run a hive action through Oozie. My workflow.xml is as follows:
<workflow-app name='edu-apollogrp-dfe' xmlns="uri:oozie:workflow:0.1">
<start to="HiveEvent"/>
<action name="HiveEvent">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.hive.defaults</name>
<value>${hiveConfigDefaultXml}</value>
</property>
</configuration>
<script>${hiveQuery}</script>
<param>OUTPUT=${StagingDir}</param>
</hive>
<ok to="end"/>
<error to="end"/>
</action>
<kill name='kill'>
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end'/>
And here is my job.properties file:
oozie.wf.application.path=${nameNode}/user/${user.name}/hiveQuery
oozie.libpath=${nameNode}/user/${user.name}/hiveQuery/lib
queueName=interactive
#QA
nameNode=hdfs://hdfs.bravo.hadoop.apollogrp.edu
jobTracker=mapred.bravo.hadoop.apollogrp.edu:8021
# Hive
hiveConfigDefaultXml=/etc/hive/conf/hive-default.xml
hiveQuery=hiveQuery.hql
StagingDir=${nameNode}/user/${user.name}/hiveQuery/Output
When I run this workflow, I end up with this error:
ACTION[0126944-130726213131121-oozie-oozi-W#HiveEvent] Launcher exception: org/apache/hadoop/hive/cli/CliDriver
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver
Error Code: JA018
Error Message: org/apache/hadoop/hive/cli/CliDriver
I'm not sure what this error means. Where am I going wrong?
EDIT
This link says error code JA018 is: JA018 is output directory exists error in workflow map-reduce action. But in my case the output directory does not exist. This makes it all the more confusing
I figured out what was going wrong!
The class org/apache/hadoop/hive/cli/CliDriver is required for execution of a Hive Action. This much is obvious from the error message. This class is within this jar file: hive-cli-0.7.1-cdh3u5.jar. (In my case cdh3u5 in my cloudera version).
Oozie checks for this jar in the ShareLib directory. The location of this directory is usually configured in hive-site.xml, with the property name as oozie.service.WorkflowAppService.system.libpath, so Oozie should find the jar easily.
But in my case, hive-site.xml did not include this property, so Oozie didn't know where to look for this jar, hence the java.lang.NoClassDefFoundError.
To resolve this, I had to include a parameter in my job.properties file to point oozie to the location of the ShareLib directory, as follows:
oozie.libpath=${nameNode}/user/oozie/share/lib. (depends on where SharedLib directory is configured on your cluster).
This got rid of the error!

Resources