Executing Sqoops using Oozie - hadoop

I have 2 Sqoops that loads data from HDFS to MySQL. I want to execute them using Oozie. I have seen that Oozie is an XML file. How can I configure it so I can execute those Sqoop? Demonstration with steps will be appreciated?
Two Sqoops are:
1.
sqoop export --connect jdbc:mysql://localhost/hduser --table foo1 -m 1 --export-dir /user/cloudera/bar1
2.
sqoop export --connect jdbc:mysql://localhost/hduser --table foo2 -m 1 --export-dir /user/cloudera/bar2
Thanks.

You don't have to execute it via a shell action. There is a separate sqoop action in oozie. Here is what you have to put in your workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.4" name="oozie-wf">
<start to="sqoop-wf1"/>
<action name="sqoop-wf1">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>export --connect jdbc:mysql://localhost/hduser --table foo1 -m 1 --export-dir /user/cloudera/bar1</command>
</sqoop>
<ok to="sqoop-wf2"/>
<error to="fail"/>
</action>
<action name="sqoop-wf2">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>export --connect jdbc:mysql://localhost/hduser --table foo1 -m 1 --export-dir /user/cloudera/bar2</command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Failed, Error Message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Hope this helps..

You can use an Oozie shell action for this. Basically you need to create a shell action & provide the commands that you posted in your question as the commands to be executed within the action
Sample Oozie action:
<action name="SqoopAction">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>[JOB-TRACKER]</job-tracker>
<name-node>[NAME-NODE]</name-node>
<prepare>
<delete path="[PATH]"/>
...
<mkdir path="[PATH]"/>
...
</prepare>
<job-xml>[SHELL SETTINGS FILE]</job-xml>
<configuration>
<property>
<name>[PROPERTY-NAME]</name>
<value>[PROPERTY-VALUE]</value>
</property>
...
</configuration>
<exec>[SHELL-COMMAND]</exec>
<argument>[ARG-VALUE]</argument>
...
<argument>[ARG-VALUE]</argument>
<env-var>[VAR1=VALUE1]</env-var>
...
<env-var>[VARN=VALUEN]</env-var>
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
...
<capture-output/>
</shell>
In your case, you would replace [SHELL-COMMAND] with whatever Sqoop command you want to run, such as:
<exec>sqoop export --connect jdbc:mysql://localhost/hduser --table foo1 -m 1 --export-dir /user/cloudera/bar1</exec>
Also, you could put all your sqoop commands in a shell script, and execute that script instead. This is better if you have a lot of commands to be executed.

Related

Sqoop Export works in command line but fails in Oozie Workflow

I tried lot of things in order to make my Sqoop Export work, here is the command that works in bash:
sqoop export --connect jdbc:mysql://localhost/monapp --username root --password cloudera --table results --direct --export-dir hdfs://quickstart.cloudera:8020/data/aggregated_data/ --driver com.mysql.jdbc.Driver --m 1
But when I use an Oozie Workflow as follows it doesn't work and I can't see any errors in the log file (/var/log/sqoop2/) :
<action name="export">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>quickstart.cloudera:8032</job-tracker>
<name-node>hdfs://quickstart.cloudera:8020</name-node>
<arg>export</arg>
<arg>--connect</arg>
<arg>jdbc:mysql://localhost/monapp</arg>
<arg>--username</arg>
<arg>root</arg>
<arg>--password</arg>
<arg>cloudera</arg>
<arg>--table</arg>
<arg>results</arg>
<arg>--export-dir</arg>
<arg>hdfs://quickstart.cloudera:8020/data/aggregated_data/</arg>
<arg>--driver</arg>
<arg>com.mysql.jdbc.Driver</arg>
<arg>-m</arg>
<arg>1</arg>
</sqoop>
<ok to="end" />
<error to="error" />
</action>
Please tell me if I need to check a log file? I will edit my question.

Unable to run "sqoop job --exec" in oozie

need some advice I'm trying to run sqoop job in oozie but suddenly it was killed and there's this warning in oozie-error.log
2018-01-21 17:30:12,473 WARN SqoopActionExecutor:523 - SERVER[edge01.domain.com] USER[linknet] GROUP[-] TOKEN[] APP[sqoop-wf] JOB[0000006-180121122345026-oozie-link-W] ACTION[0000006-180121122345026-oozie-link-W#sqoop-node] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
job.properties
nameNode=hdfs://hadoop01.domain.com:8020
jobTracker=hadoop01.domain.com:18032
queueName=default
oozie.use.system.libpath=true
examplesRoot=examples
oozie.libpath=${nameNode}/share/lib/oozie
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/sqoop
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="sqoop-wf">
<start to="sqoop-node"/>
<action name="sqoop-node">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/sqoop"/>
<mkdir path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>job --exec ingest_cpm_alarm</command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
And this is how I created sqoop job ingest_cpm_alarm
$ sqoop job --create ingest_cpm_alarm -- import --connect jdbc:postgresql://xxx.xxx.xxx.xxx:5432/snapshot --username "extractor" -P \
--incremental append \
--check-column snapshot_date \
--table cpm_snr_history \
--as-avrodatafile \
--target-dir /tmp/trash/cpm_alarm
I can run this sqoop job successfully but not in Oozie scheduler.
Also, jar file postgresql-42.1.4.jar and everything under $SQOOP_HOME/lib have been copied into libpath directory (/share/lib/oozie).
Oozie and sqoop reside in the same server. In my sqoop-site.xml, I only set these parameters.
sqoop.metastore.client.enable.autoconnect=true
sqoop.metastore.client.record.password=true
sqoop.metastore.client.record.password=true
Did I miss something here ?
it was resolved, I missed sqoop-site.xml that should be available in the same workflow directory in HDFS.
This post has similar issue:
sqoop exec job in oozie is not working
Thanks.

Oozie Xml Workflow Schema Validation error

When I run oozie in order to schedule HBASE through sqoop job incremental append.
I'm getting the following error:
<action name="sqoop-import">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/sqoop"/>
<mkdir path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<job-xml>/user/root/hbase-site.xml</job-xml>
<command>import --connect "jdbc:sqlserver://localhost:1433;database=test" --table test_plan_package --username sa --password pass
--incremental append --check-column testid --hbase-table test_plan --column-family testid</command>
<file>/user/root/sqljdbc4.jar#sqljdbc4.jar</file>
<file>/user/root/hbase/hbase-client.jar#hbase-client.jar</file>
<file>/user/root/hbase/hbase-common.jar#hbase-common.jar</file>
<file>/user/root/hbase/hbase-protocol.jar#hbase/hbase-protocol.jar</file>
<file>/user/root/hbase/htrace-core3.1.0-incubating.jar#htrace-core3.1.0-incubating.jar</file>
<file>/user/root/hbase/hbase-server.jar#hbase-server.jar</file>
<file>/user/root/hbase/hbase-hadoop-compat.jar#hbase-hadoop-compat.jar</file>
<file>/user/root/hbase/high-scale-lib-1.1.1.jar#high-scale-lib-1.1.1.jar</file>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
I try various portals and came to know that problem is with xml schema version 0.2 and it need to be upgraded to 0.4 in workflow.xml.
Could anyone provide me the steps to upgrade the xml version to 0.4 in oozie.
modify your job-xml above the configuration and there no need to upgrade to xml 0.2 to xml 0.4 directly exit 0.4 because in oozie-site.xml we have xsd file for that the error your getting because of Job-xml should be place above the configuration.
and check the jars according to version and modify the workflow.xml

Propagating oozie job last run date to last-value

I have a oozie workflow that runs a sqoop command to do incrementally load data from a table based on the lastupdatedate.
How do I set the --last-value so that we get records from the last time we ran the job to now?
In case you are importing the data to a hive table , you could query the last updated value from the hive table and pass the value to the sqoop import query.
Hive action for the select query based on the logic to retrieve the
last updated value .
Sqoop action for incremental load from thecaptured output of
previous hive action.
PFB a sudo workflow :
<workflow-app name="sqoop-to-hive" xmlns="uri:oozie:workflow:0.4">
<start to="hiveact"/>
<action name="hiveact">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.sql</script>
<capture-output/>
</hive>
<ok to="sqoopact"/>
<error to="kill"/>
<action name="sqoopact">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --incremental append --last-value ${wf:actionData('hiveact')}</command>
</sqoop>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed</message>
</kill>
<end name="end"/>
Hope this helps.

oozie running Sqoop command in a shell script

can I write a sqoop import command in a script and excute it in oozie as coordinator workflow?
I have tired to do so and found an error saying sqoop command not found even if i give the absolute path for sqoop to execute
script.sh is as follows
sqoop import --connect 'jdbc:sqlserver://xx.xx.xx.xx' -username=sa -password -table materials --fields-terminated-by '^' -- --schema dbo -target-dir /user/hadoop/CFFC/oozie_materials
and I have placed the file in HDFS and gave oozie its path.The workflow is as follows :
<workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'>
<start to='shell1' />
<action name='shell1'>
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>script.sh</exec>
<file>script.sh#script.sh</file>
</shell>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
the oozie returns an error as sqoop command not found in the mapreduce log.
so is that a good practice?
Thanks
The shell action will be running as a mapper task as you have observed. The sqoop command needs to be present on each data node where the mapper is running. If you make sure sqoop command line is there and has proper permission for the user who submitted the job, it should work.
The way to verify could be :
ssh to datanode as specific user
run command line sqoop to see if it works
try to add sqljdbc41.jar sqlserver driver to your HDFS and add archive tag in your workflow.xml as below and then try to run oozie workflow run command:
<archive>${HDFSAPATH}/sqljdbc41.jar#sqljdbc41.jar</archive>
If problem exists then..add hive-site.xml with below properties,
javax.jdo.option.ConnectionURL
hive.metastore.uris
Keep hive-site.xml in HDFS, and add file tag in workflow.xml and restart oozie workflow.xml

Resources