Ozzie flow not able to hive-ql having UDF add command

Ozzie flow not able to hive-ql having UDF add command - hadoop

I am creating oozie workflow where I am calling hive sqls sequentially.
First sql has simple transformation logic. While second has temporary function creation command and add lookup files commands. I am using this UDF further in sql.
ADD JAR **;
CREATE TEMPORARY FUNCTION XXXXX AS ...;
ADD FILE *;
<workflow-app xmlns="uri:oozie:workflow:0.4" name="hive-wf">
<credentials>
<credential name="hive_credentials" type="hcat">
<property>
<name>hcat.metastore.uri</name>
<value>XXXXXXXX</value>
</property>
<property>
<name>hcat.metastore.principal</name>
<value>XXXXXXXX</value>
</property>
</credential>
</credentials>
<start to="hive-1" />
<action name="hive-1" cred="hive_credentials">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>XXXXXXX</job-tracker>
<name-node>XXXXXXX</name-node>
<job-xml>/XXXXXX/oozie/oozie-hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<script>/XXXXXXX/hive_1.sql</script>
</hive>
<ok to="hive-2" />
<error to="fail" />
</action>
<action name="hive-2" cred="hive_credentials">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>XXXXXXXX</job-tracker>
<name-node>XXXXXXXX</name-node>
<job-xml>/XXXXXX/oozie/oozie-hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<script>/XXXXXXX/hive_2.sql</script>
</hive>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
First hql script is executed successfully. Workflow is killed while executing second hql script giving below error.
JOB[0000044-140317190624992-oozie-oozi-W] ACTION[-] E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
It throws error while executing commands to ADD UDF.(ADD JAR,CREATE TEMPORARY,ADD FILE).
I searched on this error and I got some links to ignore the error !!!
But, my actual sql using hive UDF given in second Hql script is not executed.
Can you please help?

The add jar path is a local machine path.
Oozie actions are run on datanodes. There is a possibility it cannot find the jar on the datanode and hence gives an error (Check the mapreduce logs , you would find the reason)
If this is a issue, one workaround can be to put the file in HDFS and copy it to local filesytem on the datanode during execution

Related

Scheduling a sqoop job in oozie through Shell script using Hue

I am able to run a sqoop command in Oozie using Hue. But, when I try to run the same sqoop command by placing it in a shell script I am getting an error like below
Stdoutput 2016-05-20 10:52:13,241 ERROR [main] sqoop.Sqoop (Sqoop.java:runSqoop(181)) - Got exception running Sqoop:
java.lang.RuntimeException: Could not load db driver class: oracle.jdbc.OracleDriver
I have included the jdbc jar file like I did while running the sqoop command directly. I don't understand why it is not working for shell script.
Here is the workflow generated by Hue
<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="shell-ca31"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-ca31">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
<property>
<name>oozie.libpath</name>
<value>/user/oozie/libext</value>
</property>
</configuration>
<exec>sqoopoozie.sh</exec>
<file>/user/yxr6907/sqoopoozie.sh#sqoopoozie.sh</file>
<archive>/user/oozie/libext/ojdbc7.jar#ojdbc7.jar</archive>
<capture-output/>
</shell>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

When you use shell action, jars for sqoop are not imported into classpath.
I was able to solve it by adding the jar into the classpath. Then, i export HADOOP_CLASSPATH and sqoop works.
Use the following:
Put the jar ojdbc7.jar in files
Use the following command inside shell script: export HADOOP_CLASSPATH=${PWD}/ojdbc7.jar
Instead of step 1. you can use the following properties to load jar into classpath:
oozie.use.system.libpath=true
oozie.libpath=/path/to/jars
Exporting HADOOP_CLASSPATH is required in both ways.

oozie Pig action lauching Error

I am trying to do very basic oozie workflow
I am getting the below error wheni give the command..
user#ubuntu:~/surender$ oozie job -oozie http://localhost:11000/oozie /home/user/surender/oozie_demo/job.properties -run
Error:
Error: E0501 : E0501: Could not perform authorization operation, Failed on local exception: java.io.EOFException; Host Details : local host is: "ubuntu/127.0.0.1"; destination host is: "localhost":8020;
My oozie version is 4.0.0 , I checked that oozie web console is enabled..
This is how created a oozie workflow
I created a directory called oozie_demo and inside that i created two files
1.workflow.xml
2.job.properties
I also created a lib directory and placed the pig script inside that lib directory
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="pig-wf">
<start to="pig-node"/>
<action name="pig-node">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/output/pig/simple_load"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>simple_load.pig</script>
<param>INPUT=/user/${wf:user()}/inputfiles/records.txt</param>
<param>OUTPUT=/user/${wf:user()}//output/pig/simple_load</param>
</pig>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Pig failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
job.properties
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
oozie_demo=oozie_demo
oozie.use.system.libpath=true
ozie.wf.application.path=${nameNode}/user/user/oozie_demo
my pig script :
records = load '/user/user/inputfiles/records.txt' USING PigStorage(',');
store records into '/user/user/output/pig/simple_load' using PigStorage(',');
Could somebody help me on this? I would like to know what went wrong? and how do i resolve this issue ?

Could you check if the Namenode is up and running at port 8020.

How to pick Dynamic File Name from HDFS while inserting into Hive Table

I have a Hive Table.
Now I need to write a workflow where everyday the job will search for a file in a location -
/data/data_YYYY-mm-dd.csv
like
/data/data_2015-07-07.csv
/data/data_2015-07-08.csv
...
So each day workflow will automatically pick the file name and load the data into the Hive Table(MyTable).
I am writing the script of loading as below-
LOAD DATA INPATH "/data/${filepath}" OVERWRITE INTO TABLE MyTable.
Now while running the same as a plain hive job I can set the filepath as data_2015-07-07.csv , but how to do that in Oozie coordinator so that it automatically picks the path with name as date.
I tried to set the workflow parameter from Oozie coordinator-
clicklog_${YYYY}-{MONTH}-{DAY}.csv

Well after checking through Oozie coordinator documentation, I found the solution.
Its simple and straightforward, whatever the configuration you already added in Hive Workflow, will be ignored and OOzie coordinator will fill them-
So My Hive Workflow was -
<workflow-app name="Workflow__" xmlns="uri:oozie:workflow:0.5">
<start to="hive-cfc5"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="hive-cfc5">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>/user/hive-site.xml</job-xml>
<script>/user/sub/create.hql</script>
</hive>
<ok to="hive-2ade"/>
<error to="Kill"/>
</action>
<action name="hive-2ade">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>/user/hive-site.xml</job-xml>
<script>/user/sub/load_query.hql</script>
<param>filepath=test_2015-06-26.csv</param>
</hive>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
Now I scheduled the same workflow in my oozie coordinator-
Simply by setting the filepath parameter-
test_${YYYY}-{MONTH}-{DAY}.csv
<coordinator-app name="My_Coordinator"
frequency="*/60 * * * *"
start="${start_date}" end="${end_date}" timezone="America/Los_Angeles"
xmlns="uri:oozie:coordinator:0.2"
>
<controls>
<execution>FIFO</execution>
</controls>
<action>
<workflow>
<app-path>${wf_application_path}</app-path>
<configuration>
<property>
<name>filepath</name>
<value>test_${YYYY}-{MONTH}-{DAY}.csv</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>True</value>
</property>
<property>
<name>start_date</name>
<value>2015-07-07T14:50Z</value>
</property>
<property>
<name>end_date</name>
<value>2015-07-14T07:23Z</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
and then I used a crone job to run the same every 60 minute (*/60 * * * *) to check for any above pattern file is available or not

LeaseExpiredException while running oozie fork

We are trying to run a Oozie workflow with 3 sub workflows running in parallel using fork. The sub-workflows contains a node running a native map reduce job, and subsequent two nodes running some complex PIG jobs. Finally the three sub-workflows are joined to a single end node.
When we run this workflow, we get LeaseExpiredException. The exception occurs randomly while running the PIG jobs. There is no definite place when it occurs, but it occurs every time we run the WF.
Also, if we remove the fork and run the sub-workflows sequentially, it works fine. However, our expectation is to have them run in parallel and same on some execution time.
Can you please help me understand this issue and some pointers on where we could be going wrong. We are starting with hadoop development and haven't faced such an issue earlier.
It looks like due to several tasks running in parallel, one of the threads closed a part file and when another thread tried to close the same, it throws the error.
Following is the stack trace of the exception from the hadoop logs.
2013-02-19 10:23:54,815 INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher: 57% complete
2013-02-19 10:26:55,361 INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher: 59% complete
2013-02-19 10:27:59,666 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file <hdfspath>/oozie-oozi/0000105-130218000850190-oozie-oozi-W/aggregateData--pig/output/_temporary/_attempt_201302180007_0380_m_000000_0/part-00000 : org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on <hdfspath>/oozie-oozi/0000105-130218000850190-oozie-oozi-W/aggregateData--pig/output/_temporary/_attempt_201302180007_0380_m_000000_0/part-00000 File does not exist. Holder DFSClient_attempt_201302180007_0380_m_000000_0 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1664)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1655)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1710)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1698)
at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:793)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1439)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1435)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1433)
Following is the sample for main workflow and one sub-workflow.
Main Work-Flow:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="MainProcess">
<start to="forkProcessMain"/>
<fork name="forkProcessMain">
<path start="Proc1"/>
<path start="Proc2"/>
<path start="Proc3"/>
</fork>
<join name="joinProcessMain" to="end"/>
<action name="Proc1">
<sub-workflow>
<app-path>${nameNode}${wfPath}/proc1_workflow.xml</app-path>
<propagate-configuration/>
</sub-workflow>
<ok to="joinProcessMain"/>
<error to="fail"/>
</action>
<action name="Proc2">
<sub-workflow>
<app-path>${nameNode}${wfPath}/proc2_workflow.xml</app-path>
<propagate-configuration/>
</sub-workflow>
<ok to="joinProcessMain"/>
<error to="fail"/>
</action>
<action name="Proc3">
<sub-workflow>
<app-path>${nameNode}${wfPath}/proc3_workflow.xml</app-path>
<propagate-configuration/>
</sub-workflow>
<ok to="joinProcessMain"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>WF Failure, 'wf:lastErrorNode()' failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
Sub-WorkFlow:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="Sub Process">
<start to="Step1"/>
<action name="Step1">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${step1JoinOutputPath}"/>
</prepare>
<configuration>
<property>
<name>mapred.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.absd.mr.step1</main-class>
<arg>${wf:name()}</arg>
<arg>${wf:id()}</arg>
<arg>${tbMasterDataOutputPath}</arg>
<arg>${step1JoinOutputPath}</arg>
<arg>${tbQueryKeyPath}</arg>
<capture-output/>
</java>
<ok to="generateValidQueryKeys"/>
<error to="fail"/>
</action>
<action name="generateValidQueryKeys">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${tbValidQuerysOutputPath}"/>
</prepare>
<configuration>
<property>
<name>pig.tmpfilecompression</name>
<value>true</value>
</property>
<property>
<name>pig.tmpfilecompression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.map.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.map.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>${pigDir}/tb_calc_valid_accounts.pig</script>
<param>csvFilesDir=${csvFilesDir}</param>
<param>step1JoinOutputPath=${step1JoinOutputPath}</param>
<param>tbValidQuerysOutputPath=${tbValidQuerysOutputPath}</param>
<param>piMinFAs=${piMinFAs}</param>
<param>piMinAccounts=${piMinAccounts}</param>
<param>parallel=80</param>
</pig>
<ok to="aggregateAumData"/>
<error to="fail"/>
</action>
<action name="aggregateAumData">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${tbCacheDataPath}"/>
</prepare>
<configuration>
<property>
<name>pig.tmpfilecompression</name>
<value>true</value>
</property>
<property>
<name>pig.tmpfilecompression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.map.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.map.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>${pigDir}/aggregationLogic.pig</script>
<param>csvFilesDir=${csvFilesDir}</param>
<param>tbValidQuerysOutputPath=${tbValidQuerysOutputPath}</param>
<param>tbCacheDataPath=${tbCacheDataPath}</param>
<param>currDate=${date}</param>
<param>udfJarPath=${nameNode}${wfPath}/lib</param>
<param>parallel=150</param>
</pig>
<ok to="loadDataToDB"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>WF Failure, 'wf:lastErrorNode()' failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>

We've got the same error when we were running three pig actions in parallel and one of them failed. That message error is consequence of an unexpected workflow stop because one action failed, the workflow is stopped and the others actions are trying to continue. You must look at the failed action with status ERROR to know what happened, doesn't look at actions with status KILLED

Simple oozie example of hive query?

I'm trying to convert a simple work flow to oozie. I have tried looking through the oozie examples but they are a bit over-whelming. Effectively I want to run a query and output the result to a text file.
hive -e 'select * from tables' > output.txt
How to I got about translating that into oozie to have it run every hour?

Your workflow might look something like this...
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>localhost:50001</job-tracker>
<name-node>hdfs://localhost:50000</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>/user/user1/oozie/hive-site.xml</value>
</property>
</configuration>
<script>script.q</script>
<param>INPUT_TABLE=SampleTable</param>
<param>OUTPUT=/user/user1/output-data/hive</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
So here hive-site.xml is the site xml present in $HIVE_HOME/conf folder.
script.q file contains the actual hive query. select * from ${INPUT_TABLE} .
how and where can we use the OUTPUT param?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ozzie flow not able to hive-ql having UDF add command - hadoop

Related

Scheduling a sqoop job in oozie through Shell script using Hue

oozie Pig action lauching Error

How to pick Dynamic File Name from HDFS while inserting into Hive Table

LeaseExpiredException while running oozie fork

Simple oozie example of hive query?

Categories

Resources