Propagating oozie job last run date to last-value - hadoop

I have a oozie workflow that runs a sqoop command to do incrementally load data from a table based on the lastupdatedate.
How do I set the --last-value so that we get records from the last time we ran the job to now?

In case you are importing the data to a hive table , you could query the last updated value from the hive table and pass the value to the sqoop import query.
Hive action for the select query based on the logic to retrieve the
last updated value .
Sqoop action for incremental load from thecaptured output of
previous hive action.
PFB a sudo workflow :
<workflow-app name="sqoop-to-hive" xmlns="uri:oozie:workflow:0.4">
<start to="hiveact"/>
<action name="hiveact">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.sql</script>
<capture-output/>
</hive>
<ok to="sqoopact"/>
<error to="kill"/>
<action name="sqoopact">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --incremental append --last-value ${wf:actionData('hiveact')}</command>
</sqoop>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed</message>
</kill>
<end name="end"/>
Hope this helps.

Related

oozie over hive to fetch the data from table

I am trying to do automation through oozie over hive.I wrote simple hive query for creation of table and select queries on that table.When I submitted the same script.Script goes to running mode and doesn't execute.I checked the yarn application -list ,job was hanged on 95%.Hive table had been created successfully but not able to fetch data from table.Please let me know how to resolve this problem.
Thanks in Advance.
Workflow.xml
<action name="hive2-node">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/hive2"/>
<mkdir path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<jdbc-url>${jdbcURL}</jdbc-url>
<script>script.q</script>
<param>INPUT=/user/${wf:user()}/${examplesRoot}/input-data/table</param>
<param>OUTPUT=/user/${wf:user()}/${examplesRoot}/output-data/hive2</param>
</hive2>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive2 (Beeline) action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
script.q
job.properties
nameNode=hdfs://...:8020
jobTracker=...:8050
queueName=default
jdbcURL=jdbc:hive2://..*.:10000/default
examplesRoot=examples
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/hive2

Oozie Xml Workflow Schema Validation error

When I run oozie in order to schedule HBASE through sqoop job incremental append.
I'm getting the following error:
<action name="sqoop-import">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/sqoop"/>
<mkdir path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<job-xml>/user/root/hbase-site.xml</job-xml>
<command>import --connect "jdbc:sqlserver://localhost:1433;database=test" --table test_plan_package --username sa --password pass
--incremental append --check-column testid --hbase-table test_plan --column-family testid</command>
<file>/user/root/sqljdbc4.jar#sqljdbc4.jar</file>
<file>/user/root/hbase/hbase-client.jar#hbase-client.jar</file>
<file>/user/root/hbase/hbase-common.jar#hbase-common.jar</file>
<file>/user/root/hbase/hbase-protocol.jar#hbase/hbase-protocol.jar</file>
<file>/user/root/hbase/htrace-core3.1.0-incubating.jar#htrace-core3.1.0-incubating.jar</file>
<file>/user/root/hbase/hbase-server.jar#hbase-server.jar</file>
<file>/user/root/hbase/hbase-hadoop-compat.jar#hbase-hadoop-compat.jar</file>
<file>/user/root/hbase/high-scale-lib-1.1.1.jar#high-scale-lib-1.1.1.jar</file>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
I try various portals and came to know that problem is with xml schema version 0.2 and it need to be upgraded to 0.4 in workflow.xml.
Could anyone provide me the steps to upgrade the xml version to 0.4 in oozie.
modify your job-xml above the configuration and there no need to upgrade to xml 0.2 to xml 0.4 directly exit 0.4 because in oozie-site.xml we have xsd file for that the error your getting because of Job-xml should be place above the configuration.
and check the jars according to version and modify the workflow.xml

How to pick Dynamic File Name from HDFS while inserting into Hive Table

I have a Hive Table.
Now I need to write a workflow where everyday the job will search for a file in a location -
/data/data_YYYY-mm-dd.csv
like
/data/data_2015-07-07.csv
/data/data_2015-07-08.csv
...
So each day workflow will automatically pick the file name and load the data into the Hive Table(MyTable).
I am writing the script of loading as below-
LOAD DATA INPATH "/data/${filepath}" OVERWRITE INTO TABLE MyTable.
Now while running the same as a plain hive job I can set the filepath as data_2015-07-07.csv , but how to do that in Oozie coordinator so that it automatically picks the path with name as date.
I tried to set the workflow parameter from Oozie coordinator-
clicklog_${YYYY}-{MONTH}-{DAY}.csv
Well after checking through Oozie coordinator documentation, I found the solution.
Its simple and straightforward, whatever the configuration you already added in Hive Workflow, will be ignored and OOzie coordinator will fill them-
So My Hive Workflow was -
<workflow-app name="Workflow__" xmlns="uri:oozie:workflow:0.5">
<start to="hive-cfc5"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="hive-cfc5">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>/user/hive-site.xml</job-xml>
<script>/user/sub/create.hql</script>
</hive>
<ok to="hive-2ade"/>
<error to="Kill"/>
</action>
<action name="hive-2ade">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>/user/hive-site.xml</job-xml>
<script>/user/sub/load_query.hql</script>
<param>filepath=test_2015-06-26.csv</param>
</hive>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
Now I scheduled the same workflow in my oozie coordinator-
Simply by setting the filepath parameter-
test_${YYYY}-{MONTH}-{DAY}.csv
<coordinator-app name="My_Coordinator"
frequency="*/60 * * * *"
start="${start_date}" end="${end_date}" timezone="America/Los_Angeles"
xmlns="uri:oozie:coordinator:0.2"
>
<controls>
<execution>FIFO</execution>
</controls>
<action>
<workflow>
<app-path>${wf_application_path}</app-path>
<configuration>
<property>
<name>filepath</name>
<value>test_${YYYY}-{MONTH}-{DAY}.csv</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>True</value>
</property>
<property>
<name>start_date</name>
<value>2015-07-07T14:50Z</value>
</property>
<property>
<name>end_date</name>
<value>2015-07-14T07:23Z</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
and then I used a crone job to run the same every 60 minute (*/60 * * * *) to check for any above pattern file is available or not

Executing Sqoops using Oozie

I have 2 Sqoops that loads data from HDFS to MySQL. I want to execute them using Oozie. I have seen that Oozie is an XML file. How can I configure it so I can execute those Sqoop? Demonstration with steps will be appreciated?
Two Sqoops are:
1.
sqoop export --connect jdbc:mysql://localhost/hduser --table foo1 -m 1 --export-dir /user/cloudera/bar1
2.
sqoop export --connect jdbc:mysql://localhost/hduser --table foo2 -m 1 --export-dir /user/cloudera/bar2
Thanks.
You don't have to execute it via a shell action. There is a separate sqoop action in oozie. Here is what you have to put in your workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.4" name="oozie-wf">
<start to="sqoop-wf1"/>
<action name="sqoop-wf1">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>export --connect jdbc:mysql://localhost/hduser --table foo1 -m 1 --export-dir /user/cloudera/bar1</command>
</sqoop>
<ok to="sqoop-wf2"/>
<error to="fail"/>
</action>
<action name="sqoop-wf2">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>export --connect jdbc:mysql://localhost/hduser --table foo1 -m 1 --export-dir /user/cloudera/bar2</command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Failed, Error Message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Hope this helps..
You can use an Oozie shell action for this. Basically you need to create a shell action & provide the commands that you posted in your question as the commands to be executed within the action
Sample Oozie action:
<action name="SqoopAction">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>[JOB-TRACKER]</job-tracker>
<name-node>[NAME-NODE]</name-node>
<prepare>
<delete path="[PATH]"/>
...
<mkdir path="[PATH]"/>
...
</prepare>
<job-xml>[SHELL SETTINGS FILE]</job-xml>
<configuration>
<property>
<name>[PROPERTY-NAME]</name>
<value>[PROPERTY-VALUE]</value>
</property>
...
</configuration>
<exec>[SHELL-COMMAND]</exec>
<argument>[ARG-VALUE]</argument>
...
<argument>[ARG-VALUE]</argument>
<env-var>[VAR1=VALUE1]</env-var>
...
<env-var>[VARN=VALUEN]</env-var>
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
...
<capture-output/>
</shell>
In your case, you would replace [SHELL-COMMAND] with whatever Sqoop command you want to run, such as:
<exec>sqoop export --connect jdbc:mysql://localhost/hduser --table foo1 -m 1 --export-dir /user/cloudera/bar1</exec>
Also, you could put all your sqoop commands in a shell script, and execute that script instead. This is better if you have a lot of commands to be executed.

Simple oozie example of hive query?

I'm trying to convert a simple work flow to oozie. I have tried looking through the oozie examples but they are a bit over-whelming. Effectively I want to run a query and output the result to a text file.
hive -e 'select * from tables' > output.txt
How to I got about translating that into oozie to have it run every hour?
Your workflow might look something like this...
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>localhost:50001</job-tracker>
<name-node>hdfs://localhost:50000</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>/user/user1/oozie/hive-site.xml</value>
</property>
</configuration>
<script>script.q</script>
<param>INPUT_TABLE=SampleTable</param>
<param>OUTPUT=/user/user1/output-data/hive</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
So here hive-site.xml is the site xml present in $HIVE_HOME/conf folder.
script.q file contains the actual hive query. select * from ${INPUT_TABLE} .
how and where can we use the OUTPUT param?

Resources