Shell script that validates file in hdfs - bash

I'm trying to make a script to check if there is any file missing in a hdfs path. The idea is to include it in an oozie workflow that when no file is found fails and does not continue with the flow
ALTRAFI="/input/ALTrafi*.txt"
ALTDATOS="/ALTDatos*.txt"
ALTARVA="/ALTarva*.txt"
TRAFCIER="/TrafCier*.txt"
if hdfs dfs -test -e $ALTRAFI; then
echo "[$ALTRAFI] Archive not found"
exit 1
fi
if hdfs dfs -test -e $ALTDATOS; then
echo "[$ALTDATOS] Archive not found"
exit 2
fi
if hdfs dfs -test -e $ALTARVA; then
echo "[$ALTARVA] Archive not found"
exit 3
fi
if hdfs dfs -test -e $TRAFCIER; then
echo "[$TRAFCIER] Archive not found"
exit 4
fi
But the script does not fail when it does not find a file and continues the flow of the workflow in oozie
oozie flow:
<start to="ValidateFiles"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="ValidateFiles">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>tez.lib.uris</name>
<value>/hdp/apps/2.5.0.0-1245/tez/tez.tar.gz</value>
</property>
</configuration>
<exec>/produccion/apps/traficotasado/ValidateFiles.sh</exec>
<file>/produccion/apps/traficotasado/ValidateFiles.sh#ValidateFiles.sh</file> <!--Copy the executable to compute node's current working directory -->
</shell>
<ok to="CopyFiles"/>
<error to="Kill"/>
</action>
<action name="CopyFiles">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>tez.lib.uris</name>
<value>/hdp/apps/2.5.0.0-1245/tez/tez.tar.gz</value>
</property>
</configuration>
<exec>/produccion/apps/traficotasado/CopyFiles.sh</exec>
<file>/produccion/apps/traficotasado/CopyFiles.sh#CopyFiles.sh</file> <!--Copy the executable to compute node's current working directory -->
</shell>
<ok to="DepuraFilesStage"/>
<error to="Kill"/>
</action>
Thanks for the help

Related

File creation failing when using oozie

I want to create some output files using shell script as a part of my project. The process works fine when the script is run independently.
But fails with below error when I try to integrate with Oozie workflow:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
Sample workflow:
<action name="scr-filecreation">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${SimpleFileCreation}</exec>
<argument>${filedir}</argument>
<file>${SimpleFileCreation}#${SimpleFileCreation}</file>
<capture-output/>
</shell>
<ok to="income-success" />
<error to="income-failure" />
</action>
Below is the simple script to create a file with a few lines: fileCreate.sh
echo $1 > $1/file.txt
count_val1=`cat dir1/* | wc -l`
count_val2=`cat dir2/* | wc -l`
echo $count_acct >> $1/file.txt
echo $count_cust >> $1/file.txt

oozie Pig action lauching Error

I am trying to do very basic oozie workflow
I am getting the below error wheni give the command..
user#ubuntu:~/surender$ oozie job -oozie http://localhost:11000/oozie /home/user/surender/oozie_demo/job.properties -run
Error:
Error: E0501 : E0501: Could not perform authorization operation, Failed on local exception: java.io.EOFException; Host Details : local host is: "ubuntu/127.0.0.1"; destination host is: "localhost":8020;
My oozie version is 4.0.0 , I checked that oozie web console is enabled..
This is how created a oozie workflow
I created a directory called oozie_demo and inside that i created two files
1.workflow.xml
2.job.properties
I also created a lib directory and placed the pig script inside that lib directory
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="pig-wf">
<start to="pig-node"/>
<action name="pig-node">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/output/pig/simple_load"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>simple_load.pig</script>
<param>INPUT=/user/${wf:user()}/inputfiles/records.txt</param>
<param>OUTPUT=/user/${wf:user()}//output/pig/simple_load</param>
</pig>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Pig failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
job.properties
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
oozie_demo=oozie_demo
oozie.use.system.libpath=true
ozie.wf.application.path=${nameNode}/user/user/oozie_demo
my pig script :
records = load '/user/user/inputfiles/records.txt' USING PigStorage(',');
store records into '/user/user/output/pig/simple_load' using PigStorage(',');
Could somebody help me on this? I would like to know what went wrong? and how do i resolve this issue ?
Could you check if the Namenode is up and running at port 8020.

hadoop streaming workflow multiple files

I am trying to write a workflow with hadoop streaming action which executes a awk program, Below is my scenario
Hadoop streaming commands works fine from client.However ever when executing as oozie workflow it does not work as its not able to find second file. please note the awk script is on local home directory which is mounted on hadoop as well and the input paths are on HDFS
In sample.awk(code attached below) i am passing two variables $1 and $2 which should get data from file1 and file2
From CLI , I have also attached the streaming workflow which I configured from hue which is not working as expected.
/usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.0.jar -D mapreduce.job.reduces=0 -D mapred.reduce.tasks=0 -input /user/cloudera/input/file1 /user/cloudera/input/file2 -output /user/cloudera/awk/ouput -mapper /home/cloudera/diff_files/op_code/sample.awk -file /home/cloudera/diff_files/op_code/sample.awk
Workflow.xml
------------------
<workflow-app name="awk" xmlns="uri:oozie:workflow:0.4">
<global>
<configuration>
<property>
<name></name>
<value></value>
</property>
</configuration>
</global>
<start to="awk-streaming"/>
<action name="awk-streaming" cred="">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<streaming>
<mapper>/home/clouderasample.awk</mapper>
<reducer>/home/clouderasample.awk</reducer>
</streaming>
<configuration>
<property>
<name>mapred.output.dir</name>
<value>/user/cloudera/awk/output</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/cloudera/awk/input</value>
</property>
</configuration>
<file>/user/cloudera/awk/input/file1#file1</file>
<file>/user/cloudera/awk/input/file2#file2</file>
</map-reduce>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Kindly see this link for more details
http://wiki.apache.org/hadoop/JobConfFile
<property>
<name>mapred.input.dir</name>
<value>/user/cloudera/awk/input/file1,/user/cloudera/awk/input/file2</value>
<description>A comma separated list of input directories.</description>
</property>

Ozzie flow not able to hive-ql having UDF add command

I am creating oozie workflow where I am calling hive sqls sequentially.
First sql has simple transformation logic. While second has temporary function creation command and add lookup files commands. I am using this UDF further in sql.
ADD JAR **;
CREATE TEMPORARY FUNCTION XXXXX AS ...;
ADD FILE *;
<workflow-app xmlns="uri:oozie:workflow:0.4" name="hive-wf">
<credentials>
<credential name="hive_credentials" type="hcat">
<property>
<name>hcat.metastore.uri</name>
<value>XXXXXXXX</value>
</property>
<property>
<name>hcat.metastore.principal</name>
<value>XXXXXXXX</value>
</property>
</credential>
</credentials>
<start to="hive-1" />
<action name="hive-1" cred="hive_credentials">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>XXXXXXX</job-tracker>
<name-node>XXXXXXX</name-node>
<job-xml>/XXXXXX/oozie/oozie-hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<script>/XXXXXXX/hive_1.sql</script>
</hive>
<ok to="hive-2" />
<error to="fail" />
</action>
<action name="hive-2" cred="hive_credentials">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>XXXXXXXX</job-tracker>
<name-node>XXXXXXXX</name-node>
<job-xml>/XXXXXX/oozie/oozie-hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<script>/XXXXXXX/hive_2.sql</script>
</hive>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
First hql script is executed successfully. Workflow is killed while executing second hql script giving below error.
JOB[0000044-140317190624992-oozie-oozi-W] ACTION[-] E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
It throws error while executing commands to ADD UDF.(ADD JAR,CREATE TEMPORARY,ADD FILE).
I searched on this error and I got some links to ignore the error !!!
But, my actual sql using hive UDF given in second Hql script is not executed.
Can you please help?
The add jar path is a local machine path.
Oozie actions are run on datanodes. There is a possibility it cannot find the jar on the datanode and hence gives an error (Check the mapreduce logs , you would find the reason)
If this is a issue, one workaround can be to put the file in HDFS and copy it to local filesytem on the datanode during execution

Simple oozie example of hive query?

I'm trying to convert a simple work flow to oozie. I have tried looking through the oozie examples but they are a bit over-whelming. Effectively I want to run a query and output the result to a text file.
hive -e 'select * from tables' > output.txt
How to I got about translating that into oozie to have it run every hour?
Your workflow might look something like this...
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>localhost:50001</job-tracker>
<name-node>hdfs://localhost:50000</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>/user/user1/oozie/hive-site.xml</value>
</property>
</configuration>
<script>script.q</script>
<param>INPUT_TABLE=SampleTable</param>
<param>OUTPUT=/user/user1/output-data/hive</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
So here hive-site.xml is the site xml present in $HIVE_HOME/conf folder.
script.q file contains the actual hive query. select * from ${INPUT_TABLE} .
how and where can we use the OUTPUT param?

Resources