hadoop streaming workflow multiple files - hadoop

I am trying to write a workflow with hadoop streaming action which executes a awk program, Below is my scenario
Hadoop streaming commands works fine from client.However ever when executing as oozie workflow it does not work as its not able to find second file. please note the awk script is on local home directory which is mounted on hadoop as well and the input paths are on HDFS
In sample.awk(code attached below) i am passing two variables $1 and $2 which should get data from file1 and file2
From CLI , I have also attached the streaming workflow which I configured from hue which is not working as expected.
/usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.0.jar -D mapreduce.job.reduces=0 -D mapred.reduce.tasks=0 -input /user/cloudera/input/file1 /user/cloudera/input/file2 -output /user/cloudera/awk/ouput -mapper /home/cloudera/diff_files/op_code/sample.awk -file /home/cloudera/diff_files/op_code/sample.awk
Workflow.xml
------------------
<workflow-app name="awk" xmlns="uri:oozie:workflow:0.4">
<global>
<configuration>
<property>
<name></name>
<value></value>
</property>
</configuration>
</global>
<start to="awk-streaming"/>
<action name="awk-streaming" cred="">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<streaming>
<mapper>/home/clouderasample.awk</mapper>
<reducer>/home/clouderasample.awk</reducer>
</streaming>
<configuration>
<property>
<name>mapred.output.dir</name>
<value>/user/cloudera/awk/output</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/cloudera/awk/input</value>
</property>
</configuration>
<file>/user/cloudera/awk/input/file1#file1</file>
<file>/user/cloudera/awk/input/file2#file2</file>
</map-reduce>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>

Kindly see this link for more details
http://wiki.apache.org/hadoop/JobConfFile
<property>
<name>mapred.input.dir</name>
<value>/user/cloudera/awk/input/file1,/user/cloudera/awk/input/file2</value>
<description>A comma separated list of input directories.</description>
</property>

Related

Shell script that validates file in hdfs

I'm trying to make a script to check if there is any file missing in a hdfs path. The idea is to include it in an oozie workflow that when no file is found fails and does not continue with the flow
ALTRAFI="/input/ALTrafi*.txt"
ALTDATOS="/ALTDatos*.txt"
ALTARVA="/ALTarva*.txt"
TRAFCIER="/TrafCier*.txt"
if hdfs dfs -test -e $ALTRAFI; then
echo "[$ALTRAFI] Archive not found"
exit 1
fi
if hdfs dfs -test -e $ALTDATOS; then
echo "[$ALTDATOS] Archive not found"
exit 2
fi
if hdfs dfs -test -e $ALTARVA; then
echo "[$ALTARVA] Archive not found"
exit 3
fi
if hdfs dfs -test -e $TRAFCIER; then
echo "[$TRAFCIER] Archive not found"
exit 4
fi
But the script does not fail when it does not find a file and continues the flow of the workflow in oozie
oozie flow:
<start to="ValidateFiles"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="ValidateFiles">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>tez.lib.uris</name>
<value>/hdp/apps/2.5.0.0-1245/tez/tez.tar.gz</value>
</property>
</configuration>
<exec>/produccion/apps/traficotasado/ValidateFiles.sh</exec>
<file>/produccion/apps/traficotasado/ValidateFiles.sh#ValidateFiles.sh</file> <!--Copy the executable to compute node's current working directory -->
</shell>
<ok to="CopyFiles"/>
<error to="Kill"/>
</action>
<action name="CopyFiles">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>tez.lib.uris</name>
<value>/hdp/apps/2.5.0.0-1245/tez/tez.tar.gz</value>
</property>
</configuration>
<exec>/produccion/apps/traficotasado/CopyFiles.sh</exec>
<file>/produccion/apps/traficotasado/CopyFiles.sh#CopyFiles.sh</file> <!--Copy the executable to compute node's current working directory -->
</shell>
<ok to="DepuraFilesStage"/>
<error to="Kill"/>
</action>
Thanks for the help

Ozzie flow not able to hive-ql having UDF add command

I am creating oozie workflow where I am calling hive sqls sequentially.
First sql has simple transformation logic. While second has temporary function creation command and add lookup files commands. I am using this UDF further in sql.
ADD JAR **;
CREATE TEMPORARY FUNCTION XXXXX AS ...;
ADD FILE *;
<workflow-app xmlns="uri:oozie:workflow:0.4" name="hive-wf">
<credentials>
<credential name="hive_credentials" type="hcat">
<property>
<name>hcat.metastore.uri</name>
<value>XXXXXXXX</value>
</property>
<property>
<name>hcat.metastore.principal</name>
<value>XXXXXXXX</value>
</property>
</credential>
</credentials>
<start to="hive-1" />
<action name="hive-1" cred="hive_credentials">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>XXXXXXX</job-tracker>
<name-node>XXXXXXX</name-node>
<job-xml>/XXXXXX/oozie/oozie-hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<script>/XXXXXXX/hive_1.sql</script>
</hive>
<ok to="hive-2" />
<error to="fail" />
</action>
<action name="hive-2" cred="hive_credentials">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>XXXXXXXX</job-tracker>
<name-node>XXXXXXXX</name-node>
<job-xml>/XXXXXX/oozie/oozie-hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<script>/XXXXXXX/hive_2.sql</script>
</hive>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
First hql script is executed successfully. Workflow is killed while executing second hql script giving below error.
JOB[0000044-140317190624992-oozie-oozi-W] ACTION[-] E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
It throws error while executing commands to ADD UDF.(ADD JAR,CREATE TEMPORARY,ADD FILE).
I searched on this error and I got some links to ignore the error !!!
But, my actual sql using hive UDF given in second Hql script is not executed.
Can you please help?
The add jar path is a local machine path.
Oozie actions are run on datanodes. There is a possibility it cannot find the jar on the datanode and hence gives an error (Check the mapreduce logs , you would find the reason)
If this is a issue, one workaround can be to put the file in HDFS and copy it to local filesytem on the datanode during execution

LeaseExpiredException while running oozie fork

We are trying to run a Oozie workflow with 3 sub workflows running in parallel using fork. The sub-workflows contains a node running a native map reduce job, and subsequent two nodes running some complex PIG jobs. Finally the three sub-workflows are joined to a single end node.
When we run this workflow, we get LeaseExpiredException. The exception occurs randomly while running the PIG jobs. There is no definite place when it occurs, but it occurs every time we run the WF.
Also, if we remove the fork and run the sub-workflows sequentially, it works fine. However, our expectation is to have them run in parallel and same on some execution time.
Can you please help me understand this issue and some pointers on where we could be going wrong. We are starting with hadoop development and haven't faced such an issue earlier.
It looks like due to several tasks running in parallel, one of the threads closed a part file and when another thread tried to close the same, it throws the error.
Following is the stack trace of the exception from the hadoop logs.
2013-02-19 10:23:54,815 INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher: 57% complete
2013-02-19 10:26:55,361 INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher: 59% complete
2013-02-19 10:27:59,666 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file <hdfspath>/oozie-oozi/0000105-130218000850190-oozie-oozi-W/aggregateData--pig/output/_temporary/_attempt_201302180007_0380_m_000000_0/part-00000 : org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on <hdfspath>/oozie-oozi/0000105-130218000850190-oozie-oozi-W/aggregateData--pig/output/_temporary/_attempt_201302180007_0380_m_000000_0/part-00000 File does not exist. Holder DFSClient_attempt_201302180007_0380_m_000000_0 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1664)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1655)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1710)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1698)
at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:793)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1439)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1435)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1433)
Following is the sample for main workflow and one sub-workflow.
Main Work-Flow:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="MainProcess">
<start to="forkProcessMain"/>
<fork name="forkProcessMain">
<path start="Proc1"/>
<path start="Proc2"/>
<path start="Proc3"/>
</fork>
<join name="joinProcessMain" to="end"/>
<action name="Proc1">
<sub-workflow>
<app-path>${nameNode}${wfPath}/proc1_workflow.xml</app-path>
<propagate-configuration/>
</sub-workflow>
<ok to="joinProcessMain"/>
<error to="fail"/>
</action>
<action name="Proc2">
<sub-workflow>
<app-path>${nameNode}${wfPath}/proc2_workflow.xml</app-path>
<propagate-configuration/>
</sub-workflow>
<ok to="joinProcessMain"/>
<error to="fail"/>
</action>
<action name="Proc3">
<sub-workflow>
<app-path>${nameNode}${wfPath}/proc3_workflow.xml</app-path>
<propagate-configuration/>
</sub-workflow>
<ok to="joinProcessMain"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>WF Failure, 'wf:lastErrorNode()' failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
Sub-WorkFlow:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="Sub Process">
<start to="Step1"/>
<action name="Step1">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${step1JoinOutputPath}"/>
</prepare>
<configuration>
<property>
<name>mapred.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.absd.mr.step1</main-class>
<arg>${wf:name()}</arg>
<arg>${wf:id()}</arg>
<arg>${tbMasterDataOutputPath}</arg>
<arg>${step1JoinOutputPath}</arg>
<arg>${tbQueryKeyPath}</arg>
<capture-output/>
</java>
<ok to="generateValidQueryKeys"/>
<error to="fail"/>
</action>
<action name="generateValidQueryKeys">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${tbValidQuerysOutputPath}"/>
</prepare>
<configuration>
<property>
<name>pig.tmpfilecompression</name>
<value>true</value>
</property>
<property>
<name>pig.tmpfilecompression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.map.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.map.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>${pigDir}/tb_calc_valid_accounts.pig</script>
<param>csvFilesDir=${csvFilesDir}</param>
<param>step1JoinOutputPath=${step1JoinOutputPath}</param>
<param>tbValidQuerysOutputPath=${tbValidQuerysOutputPath}</param>
<param>piMinFAs=${piMinFAs}</param>
<param>piMinAccounts=${piMinAccounts}</param>
<param>parallel=80</param>
</pig>
<ok to="aggregateAumData"/>
<error to="fail"/>
</action>
<action name="aggregateAumData">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${tbCacheDataPath}"/>
</prepare>
<configuration>
<property>
<name>pig.tmpfilecompression</name>
<value>true</value>
</property>
<property>
<name>pig.tmpfilecompression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.map.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.map.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>${pigDir}/aggregationLogic.pig</script>
<param>csvFilesDir=${csvFilesDir}</param>
<param>tbValidQuerysOutputPath=${tbValidQuerysOutputPath}</param>
<param>tbCacheDataPath=${tbCacheDataPath}</param>
<param>currDate=${date}</param>
<param>udfJarPath=${nameNode}${wfPath}/lib</param>
<param>parallel=150</param>
</pig>
<ok to="loadDataToDB"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>WF Failure, 'wf:lastErrorNode()' failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
We've got the same error when we were running three pig actions in parallel and one of them failed. That message error is consequence of an unexpected workflow stop because one action failed, the workflow is stopped and the others actions are trying to continue. You must look at the failed action with status ERROR to know what happened, doesn't look at actions with status KILLED

Submitting applications externally via REST APIs

Is there currently a way to submit applications externally via the supplied REST APIs for MapReduceV1 and/or YARN? I'm hoping to find a way to do this without adding a custom service.
So far I've only figured out how to GET the application status from the ResourceManager using YARN.
Maybe I'm looking at this the wrong and there's a better way to do this externally?
So after doing some research, I've decided that the Oozie Workflow Scheduler is the way to go.
This is a sample workflow that can be submitted to a REST endpoint running inside your Hadoop system to start a MapReduce job. <action>s are not limited to MapReduce.
<workflow-app xmlns='uri:oozie:workflow:0.1' name='map-reduce-wf'>
<start to='hadoop1' />
<action name='hadoop1'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.oozie.example.SampleMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.oozie.example.SampleReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>input-data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>output-map-reduce</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>unfunded</value>
</property>
</configuration>
</map-reduce>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
Sample taken from https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases

Simple oozie example of hive query?

I'm trying to convert a simple work flow to oozie. I have tried looking through the oozie examples but they are a bit over-whelming. Effectively I want to run a query and output the result to a text file.
hive -e 'select * from tables' > output.txt
How to I got about translating that into oozie to have it run every hour?
Your workflow might look something like this...
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>localhost:50001</job-tracker>
<name-node>hdfs://localhost:50000</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>/user/user1/oozie/hive-site.xml</value>
</property>
</configuration>
<script>script.q</script>
<param>INPUT_TABLE=SampleTable</param>
<param>OUTPUT=/user/user1/output-data/hive</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
So here hive-site.xml is the site xml present in $HIVE_HOME/conf folder.
script.q file contains the actual hive query. select * from ${INPUT_TABLE} .
how and where can we use the OUTPUT param?

Resources