Sqoop - Hive import using Oozie failed - hadoop

I am trying to execute a sqoop import from oracle to hive, but the job fails with error
WARN [main] conf.HiveConf (HiveConf.java:initialize(2472)) - HiveConf of name hive.auto.convert.sortmerge.join.noconditionaltask does not exist
Intercepting System.exit(1)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
Oozie Launcher failed, finishing Hadoop job gracefully
I have all the jar files in place
hive-site.xml is also in place with hive metastore configuration
<property>
<name>hive.metastore.uris</name>
<value>thrift://sv2lxgsed01.xxxx.com:9083</value>
</property>
I am able to run a sqoop import(using oozie) to HDFS successfully.
I am also able to execute a hive script(using oozie) successfully
I can also execute sqoop-hive import from commandline , but the same
command fails when I execute it using oozie
My workflow.xml is as below
<workflow-app name="WorkflowWithSqoopAction" xmlns="uri:oozie:workflow:0.1">
<start to="sqoopAction"/>
<action name="sqoopAction">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>import --connect
jdbc:oracle:thin:#//sv2axcrmdbdi301.xxx.com:1521/DI3CRM --username xxxxxxx --password xxxxxx--table SIEBEL.S_ORG_EXT --hive-table eg.EQX_EG_CRM_S_ORG_EXT --hive-import -m1</command>
<file>/user/oozie/oozieProject/workflowSqoopAction/hive-site.xml</file>
</sqoop>
<ok to="end"/>
<error to="killJob"/>
</action>
<kill name="killJob">
<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />
</workflow-app>
I can also find the data being loaded in HDFS.

You need to do 2 things
1) Copy hive-site.xml in the oozie workflow directory 2) In your Hive action tell oozie that use my hive-site.xml

Related

oozie sqoop action fails to import

I am facing issue while executing oozie sqoop action. In logs I can see that sqoop is able to import data to temp directory then sqoop creates hive scripts to import data.
It fails while importing data to hive.
Below is a sqoop action I am using.
<action name="import" retry-max="2" retry-interval="5">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${jobQueue}</value>
</property>
</configuration>
<arg>import</arg>
<arg>-D</arg>
<arg>sqoop.mapred.auto.progress.max=300000</arg>
<arg>-D</arg>
<arg>map.retry.exponentialBackOff=TRUE</arg>
<arg>-D</arg>
<arg>map.retry.numRetries=3</arg>
<arg>--options-file</arg>
<arg>${odsparamFileName}</arg>
<arg>--table</arg>
<arg>${odsTableName}</arg>
<arg>--where</arg>
<arg>${ods_data_pull_column} BETWEEN TO_DATE(${wf:actionData('getDates')['prevMonthBegin']},'YYYY-MM-DD hh24:mi:ss') AND TO_DATE(${wf:actionData('prevMonthEnd')['endDate']},'YYYY-MM-DD hh24:mi:ss')</arg>
<arg>--hive-import</arg>
<arg>--hive-overwrite</arg>
<arg>--hive-table</arg>
<arg>${stgTable}</arg>
<arg>--hive-drop-import-delims</arg>
<arg>--warehouse-dir</arg>
<arg>${sqoopStgDir}</arg>
<arg>--delete-target-dir</arg>
<arg>--null-string</arg>
<arg>\\N</arg>
<arg>--null-non-string</arg>
<arg>\\N</arg>
<arg>--compress</arg>
<arg>--compression-codec</arg>
<arg>gzip</arg>
<arg>--num-mappers</arg>
<arg>1</arg>
<arg>--verbose</arg>
<file>${odsSqoopConnectionParamsFileLocation}</file>
</sqoop>
<ok to="rev"/>
<error to="fail"/>
</action>
Below is the error i am getting in mapred logs
20078 [main] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - Creating input split with lower bound '1=1' and upper bound '1=1'
Heart beat
Heart beat
Heart beat
Heart beat
151160 [main] INFO org.apache.sqoop.mapreduce.ImportJobBase - Transferred 0 bytes in 135.345 seconds (0 bytes/sec)
151164 [main] INFO org.apache.sqoop.mapreduce.ImportJobBase - Retrieved 0 records.
151164 [main] ERROR org.apache.sqoop.tool.ImportTool - Error during import: Import job failed!
Intercepting System.exit(1)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
Oozie Launcher failed, finishing Hadoop job gracefully
Please suggest
You can import the table to hdfs path using --target-dir and set the location of your hive table to point that path. I fixed it using this approach. Hope it helps you as well.

not able to run the shell script with oozie

hi i am trying to run the shell script through oozie.while running the shell script i am getting the following error.
org.apache.oozie.action.hadoop.ShellMain], exit code [1]
my job.properties file
nameNode=hdfs://ip-172-31-41-199.us-west-2.compute.internal:8020
jobTracker=ip-172-31-41-199.us-west-2.compute.internal:8032
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib/
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
oozieProjectRoot=shell_example
oozie.wf.application.path=${nameNode}/user/karun/${oozieProjectRoot}/apps/shell
my workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.1" name="pi.R example">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>script.sh</exec>
<file>/user/karun/oozie-oozi/script.sh#script.sh</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Incorrect output</message>
</kill>
<end name="end"/>
</workflow-app>
my shell script- script.sh
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/spark
export YARN_CONF_DIR=/etc/hadoop/conf
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
export HADOOP_CMD=/usr/bin/hadoop
/SparkR-pkg/lib/SparkR/sparkR-submit --master yarn-client examples/pi.R yarn-client 4
error log file
WEBHCAT_DEFAULT_XML=/opt/cloudera/parcels/CDH-5.4.2- 1.cdh5.4.2.p0.2/etc/hive-webhcat/conf.dist/webhcat-default.xml:
CDH_KMS_HOME=/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop-kms:
LANG=en_US.UTF-8:
HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH-5.4.2- 1.cdh5.4.2.p0.2/lib/hadoop-mapreduce:
=================================================================
Invoking Shell command line now >>
Stdoutput Running /opt/cloudera/parcels/CDH-5.4.2-
1.cdh5.4.2.p0.2/lib/spark/bin/spark-submit --class edu.berkeley.cs.amplab.sparkr.SparkRRunner --files hdfs://ip-172-31-41-199.us-west-2.compute.internal:8020/user/karun/examples/pi.R --master yarn-client
/SparkR-pkg/lib/SparkR/sparkr-assembly-0.1.jar hdfs://ip-172-31-41-199.us-west- 2.compute.internal:8020/user/karun/examples/pi.R yarn-client 4
Stdoutput Fatal error: cannot open file 'pi.R': No such file or directory
Exit code of the Shell command 2
<<< Invocation of Shell command completed <<<
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://ip-172-31-41-199.us-west-2.compute.internal:8020/user/karun/oozie-oozi/0000035-150722003725443-oozie-oozi-W/shell-node--shell/action-data.seq
Oozie Launcher ends
I dont know how to solve the issue.any help will be appreciated.
sparkR-submit ... examples/pi.R ...
Fatal error: cannot open file 'pi.R': No such file or directory
The message is really explicit: your shell tries to read a R script from the local FileSystem. But local to what, actually???
Oozie uses YARN to run your shell; so YARN allocates a container on a random machine. It's something you must put into your head so that it becomes a reflex: all resources required by an Oozie Action (scripts, libraries, config files, whatever) must be
available in HDFS beforehand
downloaded at execution time thanks to <file> instructions in the Oozie script
accessed as local files in the Current Working Dir
In your case:
<exec>script.sh</exec>
<file>/user/karun/oozie-oozi/script.sh</file>
<file>/user/karun/some/place/pi.R</file>
Then
sparkR-submit ... pi.R ...

Oozie variable[user] cannot ber resolved

I'm trying to use Oozie's Hive action in Hue. My Hive script is very simple:
create table test.test_2 as
select * from test.test
This Oozie action has only 3 steps:
start
hive_query
end
My job.properties:
jobTracker worker-1:8032
mapreduce.job.user.name hue
nameNode hdfs://batchlayer
oozie.use.system.libpath true
oozie.wf.application.path hdfs://batchlayer/user/hue/oozie/workspaces/_hue_-oozie-4-1425575226.04
user.name hue
I add hive-site.xml two times - as file and as job.xml. Oozie action starts and on second step stops. Job is 'accepted'. But in hue console I've got an error:
variable[user] cannot ber resolved
I'm using Apache Oozie 4.2, Apache Hive 0.14 and Hue 3.7 (from Github).
UPDATE:
This is my workflow.xml:
bash-4.1$ bin/hdfs dfs -cat /user/hue/oozie/workspaces/*.04/work*
<workflow-app name="ccc" xmlns="uri:oozie:workflow:0.4">
<start to="ccc"/>
<action name="ccc">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>/user/hue/hive-site.xml</job-xml>
<script>/user/hue/hive_test.hql</script>
<file>/user/hue/hive-site.xml#hive-site.xml</file>
</hive>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Tried running a sample hive action in Oozie following similar steps as you, and was able to resolve error faced by you using following steps
Remove the add for hive-site.xml
Add following line to your
job.properties oozie.libpath=${nameNode}/user/oozie/share/lib
Increase visibility of your hive-site.xml file kept in HDFS. Maybe
you have very restrictive privileges over it (in my case 500)
With this both the [user] variable cannot be resolved and subsequent errors got resolved.
Hope it helps.
This message can be really misleading. You should check yarn logs and diagnostics.
In my case it was configuration settings regarding reduce task and container memory. By some error container memory limit was lower than single reduce task memory limit. After looking into yarn application logs I saw the true cause in 'diagnostics' section, which was:
REDUCE capability required is more than the supported max container capability in the cluster. Killing the Job. reduceResourceRequest: <memory:8192, vCores:1> maxContainerCapability:<memory:5413, vCores:4>
Regards

Accessing Vertica Database through Oozie sqoop

I have written an Oozie workflow to access an HP Vertica database through Sqoop. This is on a Cloudera VM. I am getting the following error in Yarn logs after running:
RROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: dbDriver
java.lang.RuntimeException: Could not load db driver class: dbDriver
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:848)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.tool.EvalSqlTool.run(EvalSqlTool.java:64)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)"
This is a snippet from the jobprops file:
"dbDriver=com.vertica.jdbc.Driver
dbHost=host***.assist.***
dbName=vertica247
dbPassword=*****
dbPort=5433
dbSchema=simod_chat
dbStagingSchema=simodstg_chat
dbUser=vertica"
What should I specify for --connection-manager? When I run the same workflow outside the VM, it runs without the connection-manager argument?
As the error states:
Could not load db driver class: dbDriver
There are likely two problems:
The JDBC URL is probably incorrect
The JDBC Jar needs to be included in the workflow
For the JDBC URL, make sure it looks like this:
jdbc:vertica://VerticaHost:portNumber/databaseName
For the JDBC jar, it needs to be included with the workflow. Check out this article for a brief example on how to do this with HBase. TLDR: When you run Sqoop through oozie, you have to include the driver in the workflow:
<workflow-app name="sqoop-import" xmlns="uri:oozie:workflow:0.4">
<start to="sqoop-import"/>
<action name="sqoop-import">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>import --connect jdbc:vertica://VerticaHost:portNumber/databaseName --username test --password test --table test</command>
<file>/user/admin/vertica-jdbc.jar#vertica-jdbc.jar</file>
</sqoop>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Note the line:
<file>/user/admin/vertica-jdbc.jar#vertica-jdbc.jar</file>
It will automatically be included in your sqoop job.

oozie Sqoop action fails to import data to hive

I am facing issue while executing oozie sqoop action.
In logs I can see that sqoop is able to import data to temp directory then sqoop creates hive scripts to import data.
It fails while importing temp data to hive.
In logs I am not getting any exception.
Below is a sqoop action I am using.
<workflow-app name="testSqoopLoadWorkflow" xmlns="uri:oozie:workflow:0.4">
<credentials>
<credential name='hive_credentials' type='hcat'>
<property>
<name>hcat.metastore.uri</name>
<value>${HIVE_THRIFT_URL}</value>
</property>
<property>
<name>hcat.metastore.principal</name>
<value>${KERBEROS_PRINCIPAL}</value>
</property>
</credential>
</credentials>
<start to="loadSqoopDataAction"/>
<action name="loadSqoopDataAction" cred="hive_credentials">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>/tmp/hive-oozie-site.xml</job-xml>
<configuration>
<property>
<name>oozie.hive.defaults</name>
<value>/tmp/hive-oozie-site.xml</value>
</property>
</configuration>
<command>job --meta-connect ${SQOOP_METASTORE_URL} --exec TEST_SQOOP_LOAD_JOB</command>
</sqoop>
<ok to="end"/>
<error to="kill"/>
</action>
Below is a sqoop Job I am using to import data.
sqoop job --meta-connect ${SQOOP_METASTORE_URL} --create TEST_SQOOP_LOAD_JOB -- import --connect '${JDBC_URL}' --table testTable -m 1 --append --check-column pkId --incremental append --hive-import --hive-table testHiveTable;
In mapred logs I am getting following exception.
72285 [main] INFO org.apache.sqoop.hive.HiveImport - Loading uploaded data into Hive
Intercepting System.exit(1)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher ends
Please suggest.
This seems like a typical Sqoop import to Hive job. So it seems like Sqoop has successfully imported data in HDFS and is failing to load that data into Hive.
Here's some background on what's happening... Oozie launches a separate job (which will execute on any node in your hadoop cluster) to run the Sqoop command. The Sqoop command starts a separate job to load data into HDFS. Then, at the end of the Sqoop job, sqoop runs a hive script to load that data into Hive.
Since this is theoretically running from any node in your Hadoop cluster, hive CLI will need to be available on each node and talk to the same metastore. The Hive Metastore will need to run in remote mode.
The most normal problem is because Sqoop cannot talk to the correct metastore. The main reasons for this are normally:
Hive metastore service is not running. It should be running in remote mode and a separate service should be started. Here's a quick way to check if its running:
service hive-metastore status
hive-site.xml does not contain hive.metastore.uris. Here's an example hive-site.xml with hive.metastore.uris set:
<configuration>
...
<property>
<name>hive.metastore.uris</name>
<value>thrift://sqoop2.example.com:9083</value>
</property>
...
</configuration>
hive-site.xml is not included in your Sqoop action (or its properties). Try adding your hive-site.xml to a <file> element in your Sqoop action. Here's an example workflow.xml with <file> in it:
<workflow-app name="sqoop-to-hive" xmlns="uri:oozie:workflow:0.4">
...
<action name="sqoop2hive">
...
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
...
<file>/tmp/hive-site.xml#hive-site.xml</file>
</sqoop>
...
</action>
...
</workflow-app>
This seems to be a bug in Sqoop. Am not sure about the JIRA#. Hortonworks mentioned that the issue is still not resolved even in HDP 2.2 version.
#abeaamase - I want try to use your solution.
Just want to check if below solution works good for sqoop + Hive import in one single oozie job?
...
...
...
/tmp/hive-site.xml#hive-site.xml
...
...
If you are using cdh then problem may be due to hive metastore jar dependency conflicts.

Resources