oozie running Sqoop command in a shell script - hadoop

can I write a sqoop import command in a script and excute it in oozie as coordinator workflow?
I have tired to do so and found an error saying sqoop command not found even if i give the absolute path for sqoop to execute
script.sh is as follows
sqoop import --connect 'jdbc:sqlserver://xx.xx.xx.xx' -username=sa -password -table materials --fields-terminated-by '^' -- --schema dbo -target-dir /user/hadoop/CFFC/oozie_materials
and I have placed the file in HDFS and gave oozie its path.The workflow is as follows :
<workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'>
<start to='shell1' />
<action name='shell1'>
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>script.sh</exec>
<file>script.sh#script.sh</file>
</shell>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
the oozie returns an error as sqoop command not found in the mapreduce log.
so is that a good practice?
Thanks

The shell action will be running as a mapper task as you have observed. The sqoop command needs to be present on each data node where the mapper is running. If you make sure sqoop command line is there and has proper permission for the user who submitted the job, it should work.
The way to verify could be :
ssh to datanode as specific user
run command line sqoop to see if it works

try to add sqljdbc41.jar sqlserver driver to your HDFS and add archive tag in your workflow.xml as below and then try to run oozie workflow run command:
<archive>${HDFSAPATH}/sqljdbc41.jar#sqljdbc41.jar</archive>
If problem exists then..add hive-site.xml with below properties,
javax.jdo.option.ConnectionURL
hive.metastore.uris
Keep hive-site.xml in HDFS, and add file tag in workflow.xml and restart oozie workflow.xml

Related

How to execute sqoop job from oozie?

I'm not able to execute a sample job from oozie using sqoop command to import data into hive. I've placed the hive-site.xml in hdfs path but I think it's not picking the hive-site.xml file. I'm getting class not found exception. How to fix this?
workflow.xml
<!-- This is a comment -->
<workflow-app xmlns="uri:oozie:workflow:0.4" name="oozie-wf">
<start to = "sqoop-node1" />
<!--Step 1 -->
<action name = "sqoop-node1" >
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker></job-tracker>
<name-node></name-node>
<command> import command </command>
</sqoop>
<ok to="end"/>
<error to="kill_job"/>
</action>
<kill name = "kill_job">
<message>Job failed</message>
</kill>
<end name = "end" />
</workflow-app>
nameNode=ip jobTracker=ip queueName=default user.name=oozie oozie.use.system.libpath=true oozie.libpath=/user/hdfs/share/share/lib/sqoop oozie.wf.application.path=workflow path outputDir=/tmp/oozie.txt
java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
I guess your Sqoop action requires the HCatalog library to interact with the Hive Metastore. And Oozie does not add that library by default, you have to require it explicitly.
Note that there is some literature about using HCatalog from Pig, but very little from Sqoop. Anyway the trick is the same...
From Oozie documentation:
Oozie share libraries are organized per action type... Oozie
provides a mechanism to override the action share library JARs
... More than one share library directory name can be specified
for an action ... For example: When using HCatLoader and HCatStorer in
pig, oozie.action.sharelib.for.pig can be set to pig,hcatalog to
include both pig and hcatalog jars.
In your case, you need to override a specific <property> in your Sqoop action, named oozie.action.sharelib.for.sqoop, with value sqoop,hcatalog -- then Oozie will provide the required JARs at run-time.

Move file from HDFS one directory to other directory in HDFS using OOZIE?

I am trying to copy a file from HDFS one directory to other directory in HDFS, with the help of shell script as a part of oozie Job, but i am not able to copy it through oozie.
Can we copy file from HDFS one directory to other director in HDFS using oozie.
when i am running the oozie job, i am not any getting error.
it is showing status SUCCEEDED but file is not copying to destination directory.
oozie Files are below.
test.sh
#!/bin/bash
echo "listing files in the current directory, $PWD"
sudo hadoop fs -cp /user/cloudera/RAVIOOZIE/input/* /user/cloudera/RAVIOOZIE/output/
ls # list files
my workflow.xml is
<workflow-app name="RAMA" xmlns="uri:oozie:workflow:0.5">
<start to="shell-381c"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-381c">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>/user/cloudera/test.sh</exec>
<file>/user/cloudera/test.sh#test.sh</file>
<capture-output/>
</shell>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
and my job.properties
oozie.use.system.libpath=True
security_enabled=False
dryrun=True
jobTracker=localhost:8032
nameNode=hdfs://quickstart.cloudera:8020
oozie.wf.application.path=${nameNode}/user/cloudera/test/
please help on this. why file is not copying to my destination director.
please let me know is there any thing i missed.
As mentioned in the comments by #Samson:
If you want to do hadoop actions with oozie, you should use a hdfs action rather than a shell action for that.
I am not sure why you don't get an error, but here is some speculation on what might happen:
You give oozie the task of starting a shell action, it succesfully starts the shell action and reports a success. Then the shell action fails, but that's not oozies problem.

Scheduling a sqoop job in oozie through Shell script using Hue

I am able to run a sqoop command in Oozie using Hue. But, when I try to run the same sqoop command by placing it in a shell script I am getting an error like below
Stdoutput 2016-05-20 10:52:13,241 ERROR [main] sqoop.Sqoop (Sqoop.java:runSqoop(181)) - Got exception running Sqoop:
java.lang.RuntimeException: Could not load db driver class: oracle.jdbc.OracleDriver
I have included the jdbc jar file like I did while running the sqoop command directly. I don't understand why it is not working for shell script.
Here is the workflow generated by Hue
<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="shell-ca31"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-ca31">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
<property>
<name>oozie.libpath</name>
<value>/user/oozie/libext</value>
</property>
</configuration>
<exec>sqoopoozie.sh</exec>
<file>/user/yxr6907/sqoopoozie.sh#sqoopoozie.sh</file>
<archive>/user/oozie/libext/ojdbc7.jar#ojdbc7.jar</archive>
<capture-output/>
</shell>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
When you use shell action, jars for sqoop are not imported into classpath.
I was able to solve it by adding the jar into the classpath. Then, i export HADOOP_CLASSPATH and sqoop works.
Use the following:
Put the jar ojdbc7.jar in files
Use the following command inside shell script: export HADOOP_CLASSPATH=${PWD}/ojdbc7.jar
Instead of step 1. you can use the following properties to load jar into classpath:
oozie.use.system.libpath=true
oozie.libpath=/path/to/jars
Exporting HADOOP_CLASSPATH is required in both ways.

How to get oozie jobId in oozie workflow?

I have a oozie workflow that will invoke a shell file, Shell file will further invoke a driver class of mapreduce job. Now i want to map my oozie jobId to Mapreduce jobId for later process. Is there any way to get oozie jobId in workflow file so that i can pass the same as argument to my driver class for mapping.
Following is my sample workflow.xml file
<workflow-app xmlns="uri:oozie:workflow:0.4" name="test">
<start to="start-test" />
<action name='start-test'>
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${jobScript}</exec>
<argument>${fileLocation}</argument>
<argument>${nameNode}</argument>
<argument>${jobId}</argument> <!-- this is how i wanted to pass oozie jobId -->
<file>${jobScriptWithPath}#${jobScript}</file>
</shell>
<ok to="end" />
<error to="kill" />
</action>
<kill name="kill">
<message>test job failed
failed:[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
Following is my shell script.
hadoop jar testProject.jar testProject.MrDriver $1 $2 $3
Try to use ${wf:id()}:
String wf:id()
It returns the workflow job ID for the current workflow job.
More info here.
Oozie drops an XML file in the CWD of the YARN container running the shell (the "launcher" container), and also sets an env variable pointing to that XML (cannot remember the name though).
That XML contains a lot of stuff like name of Workflow, name of Action, ID of both, run attempt number, etc.
So you can sed back that information in the shell script itself.
Of course passing explicitly the ID (as suggested by Alexei) would be cleaner, but sometimes "clean" is not the best way. Especially if you are concerned about whether it's the first run or not...

Issue running Sqoop Action using Oozie on a Hadoop Cluster

I am trying to successfully run a sqoop-action in Oozie using a Hadoop Cluster.
Whenever I check on the jobs status, Oozie returns with the following status update:
Actions
ID Status Ext ID Ext Status Err Code
0000037-140930230740727-oozie-oozi-W#:start: OK - OK -
0000037-140930230740727-oozie-oozi-W#sqoop-load ERROR job_1412278758569_0002 FAILED/KILLEDJA018
0000037-140930230740727-oozie-oozi-W#sqoop-load-fail OK - OK E0729
Which leads me to believe that there is nothing wrong with my Workflow, as opposed to some permission I am missing.
My jobs.properties config:
nameNode=hdfs://mynamenode.demo.com:8020
jobTracker=mysnamenode.demo.com:8050
queueName=default
workingRoot=working_dir
jobOutput=/user/test/out
oozie.use.system.libpath=true
oozie.libpath=/user/oozie/share/lib
oozie.wf.application.path=${nameNode}/user/test/${workingRoot}
MyWorkFlow.xml :
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns='uri:oozie:workflow:0.4' name='sqoop-workflow'>
<start to='sqoop-load' />
<action name="sqoop-load">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/test/${workingRoot}/out-data/sqoop" />
<mkdir path="${nameNode}/user/test/${workingRoot}/out-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>import --connect jdbc:oracle:thin:#10.100.50.102:1521/db --username myID --password myPass --table SomeTable -target-dir /user/test/${workingRoot}/out-data/sqoop </command>
</sqoop>
<ok to="end"/>
<error to="sqoop-load-fail"/>
</action>
<kill name="sqoop-load-fail">
<message>Sqoop export failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
Steps I have taken:
Looking up the Error...didn't find much beyound what I mentioned previously
checking that the required ojdbc.jar file was executable and that the /user/oozie/share/lib/sqoop directory and is accessible on HDFS
checking to see if I have any prexisting directories that might be causing a problem
I have been searching the internet and my log files for an answer....any help provided would be much appreciated....
Update:
Ok...so I add ALL of the jars within /usr/lib/sqoop/lib to /user/oozie/share/lib/sqoop. I am still getting the same errors. checking the job log...there is something I did not post previously:
2014-10-03 11:16:35,586 WARN CoordActionUpdateXCommand:542 - USER[ambari-qa] GROUP[-] TOKEN[] APP[sqoop-workflow] JOB[0000015-141002171510902-oozie-oozi-W] ACTION[-] E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
As you can see I am running the job as "Super User".....and the error is exactly the same. So it cannot be a permission issue. I am thinking there is a jar that is required other than those required to be in the /user/oozie/share/lib/sqoop directory.....perhaps I need to copy the jars for mapreduce to be in /user/oozie/share/lib/mapreduce ?
Ok...problem solved.
Apparently EVERY component of the Oozie Workflow/Job must have it's corresponding *.jar dependencies uploaded to the Oozie SharedLib(/user/oozie/share/lib/) directories corresponding to those components.
I copied ALL the *.jars in /usr/lib/sqoop/lib into -> /user/oozie/share/lib
I copied ALL the *.jars in the /usr/lib/oozie/lib into -> /user/oozie/share/lib/oozie
After running the job again....the workflow stalled, and the error given was different from the last one....namely that this time around....the workflow was trying to create a directory on HDFS that already existed, so I removed that directory and then ran the job again.....
SUCCESS!
Side Note: People really need to write better exception messages. If this was just an issue a few people where having....then fine....but this is simply not the case. This particular error is giving more than a few people fits if the requests for help online are any indication.
I faced the same problem. Just adding a
<archive>path/in/hdfs/ojdbc6.jar#ojdbc6.jar</archive>
to my workflow.xml within the <sqoop> </sqoop> tags worked for me. Got the reference here.

Resources