I am trying to do automation through oozie over hive.I wrote simple hive query for creation of table and select queries on that table.When I submitted the same script.Script goes to running mode and doesn't execute.I checked the yarn application -list ,job was hanged on 95%.Hive table had been created successfully but not able to fetch data from table.Please let me know how to resolve this problem.
Thanks in Advance.
Workflow.xml
<action name="hive2-node">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/hive2"/>
<mkdir path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<jdbc-url>${jdbcURL}</jdbc-url>
<script>script.q</script>
<param>INPUT=/user/${wf:user()}/${examplesRoot}/input-data/table</param>
<param>OUTPUT=/user/${wf:user()}/${examplesRoot}/output-data/hive2</param>
</hive2>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive2 (Beeline) action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
script.q
job.properties
nameNode=hdfs://...:8020
jobTracker=...:8050
queueName=default
jdbcURL=jdbc:hive2://..*.:10000/default
examplesRoot=examples
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/hive2
Related
I have a oozie workflow that runs a sqoop command to do incrementally load data from a table based on the lastupdatedate.
How do I set the --last-value so that we get records from the last time we ran the job to now?
In case you are importing the data to a hive table , you could query the last updated value from the hive table and pass the value to the sqoop import query.
Hive action for the select query based on the logic to retrieve the
last updated value .
Sqoop action for incremental load from thecaptured output of
previous hive action.
PFB a sudo workflow :
<workflow-app name="sqoop-to-hive" xmlns="uri:oozie:workflow:0.4">
<start to="hiveact"/>
<action name="hiveact">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.sql</script>
<capture-output/>
</hive>
<ok to="sqoopact"/>
<error to="kill"/>
<action name="sqoopact">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --incremental append --last-value ${wf:actionData('hiveact')}</command>
</sqoop>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed</message>
</kill>
<end name="end"/>
Hope this helps.
I've written an Oozie workflow that runs a BASH shell script to do some hive queries and perform some actions on the results. The script runs but throws a permission error when accessing some of the HDFS data. The user that submitted the Oozie workflow has permission but the script is running as the yarn user.
Is it possible to make Oozie execute the script as the user who submitted the workflow? Hive and Java actions both execute as the submitted user, just shell is behaving differently.
Here's the rough outline of my Oozie action
<action name="start_action"
retry-max="12"
retry-interval="600">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${WorkflowRoot}/hive-site.xml</job-xml>
<exec>script.sh</exec>
<file>${WorkflowRoot}/script.sh</file>
<capture-output />
</shell>
<ok to="next_action"/>
<error to="send_email"/>
</action>
I'm running Oozie 4.1.0 and HDP 2.1.
This issue will occur in all cluster that are configured using Simple Security. You've an option to override the default configuration. Include the below statement at the starting of the shell script will fix this issue.
export HADOOP_USER_NAME=<Name of submitted user>;
you can make run with help of env-var
<env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
<workflow-app xmlns="uri:oozie:workflow:0.3" name="shell-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>test.sh</exec>
<env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
<file>/user/root/test.sh</file>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
I have a Hive Table.
Now I need to write a workflow where everyday the job will search for a file in a location -
/data/data_YYYY-mm-dd.csv
like
/data/data_2015-07-07.csv
/data/data_2015-07-08.csv
...
So each day workflow will automatically pick the file name and load the data into the Hive Table(MyTable).
I am writing the script of loading as below-
LOAD DATA INPATH "/data/${filepath}" OVERWRITE INTO TABLE MyTable.
Now while running the same as a plain hive job I can set the filepath as data_2015-07-07.csv , but how to do that in Oozie coordinator so that it automatically picks the path with name as date.
I tried to set the workflow parameter from Oozie coordinator-
clicklog_${YYYY}-{MONTH}-{DAY}.csv
Well after checking through Oozie coordinator documentation, I found the solution.
Its simple and straightforward, whatever the configuration you already added in Hive Workflow, will be ignored and OOzie coordinator will fill them-
So My Hive Workflow was -
<workflow-app name="Workflow__" xmlns="uri:oozie:workflow:0.5">
<start to="hive-cfc5"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="hive-cfc5">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>/user/hive-site.xml</job-xml>
<script>/user/sub/create.hql</script>
</hive>
<ok to="hive-2ade"/>
<error to="Kill"/>
</action>
<action name="hive-2ade">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>/user/hive-site.xml</job-xml>
<script>/user/sub/load_query.hql</script>
<param>filepath=test_2015-06-26.csv</param>
</hive>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
Now I scheduled the same workflow in my oozie coordinator-
Simply by setting the filepath parameter-
test_${YYYY}-{MONTH}-{DAY}.csv
<coordinator-app name="My_Coordinator"
frequency="*/60 * * * *"
start="${start_date}" end="${end_date}" timezone="America/Los_Angeles"
xmlns="uri:oozie:coordinator:0.2"
>
<controls>
<execution>FIFO</execution>
</controls>
<action>
<workflow>
<app-path>${wf_application_path}</app-path>
<configuration>
<property>
<name>filepath</name>
<value>test_${YYYY}-{MONTH}-{DAY}.csv</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>True</value>
</property>
<property>
<name>start_date</name>
<value>2015-07-07T14:50Z</value>
</property>
<property>
<name>end_date</name>
<value>2015-07-14T07:23Z</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
and then I used a crone job to run the same every 60 minute (*/60 * * * *) to check for any above pattern file is available or not
i'm unable to import oozie workflow in hue editor, hue version 2.5.0
Error : Could not import workflow, Node kill has not been defined
<workflow-app name="mapDeply" xmlns="uri:oozie:workflow:0.4">
<start to="TestPOC"/>
<action name="TestPOC">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/data/temp"/>
</prepare>
<main-class>WordCount</main-class>
<arg>/data/input</arg>
<arg>/data/temp</arg>
</java>
<ok to="end"/>
<error to="killemail"/>
</action>
<action name="killemail">
<email xmlns="uri:oozie:email-action:0.1">
<to>test#test.com</to>
<subject>Test</subject>
<body>TEST</body>
</email>
<ok to="kill"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
if i change java action error to kill it's working. is this excepted behavior or is there any work around to resolve it
This is currently not supported. You indeed need to have each action error node point to the kill node, then import the workflow, then modify it in the editor.
This will be improved in the future and this use case can be in part replaced by the Oozie SLA, supported up in Hue 3.6.
I'm trying to convert a simple work flow to oozie. I have tried looking through the oozie examples but they are a bit over-whelming. Effectively I want to run a query and output the result to a text file.
hive -e 'select * from tables' > output.txt
How to I got about translating that into oozie to have it run every hour?
Your workflow might look something like this...
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>localhost:50001</job-tracker>
<name-node>hdfs://localhost:50000</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>/user/user1/oozie/hive-site.xml</value>
</property>
</configuration>
<script>script.q</script>
<param>INPUT_TABLE=SampleTable</param>
<param>OUTPUT=/user/user1/output-data/hive</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
So here hive-site.xml is the site xml present in $HIVE_HOME/conf folder.
script.q file contains the actual hive query. select * from ${INPUT_TABLE} .
how and where can we use the OUTPUT param?