Executing MapReduce job using oozie workflow in hue giving wrong output - hadoop

I'm trying to execute MapReduce job using oozie workflow in hue. When I submit the job, oozie successfully executes but I don't get the expected output. It seems that either mapper or reducer never invoked.here is my workflow.xml:
<workflow-app name="wordCount" xmlns="uri:oozie:workflow:0.4">
<start to="wordcount"/>
<action name="wordcount">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>/user/root/jane/inputPath</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/root/jane/outputPath17</value>
</property>
<property>
<name>mapred.mapper.class</name>
<value>MapReduceGenerateReports.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>MapReduceGenerateReports.Reduce</value>
</property>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Can anyone please tell what is the problem?
my new workflow.xml :
<workflow-app name="wordCount" xmlns="uri:oozie:workflow:0.4">
<start to="wordcount"/>
<action name="wordcount">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>/user/root/jane/inputPath</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/root/jane/outputPath3</value>
</property>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>MapReduceGenerateReports$Map</value>
</property>
<property>
<name>mapreduce.reduce.class</name>
<value>MapReduceGenerateReports$Reduce</value>
</property>
<property>
<name> mapred.output.key.class</name>
<value>org.apache.hadoop.io.LongWritable</value>
</property>
<property>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
jobtracker log:
1)
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed
Task Attempts
map 100.00%
1 0 0 1 0 0 / 0
reduce 100.00%
0 0 0 0 0 0 / 0
2)
Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks Killed tasks Start Time Finish Time
Setup 1 1 0 0 5-Apr-2014 18:36:22 5-Apr-2014 18:36:23 (1sec)
Map 1 1 0 0 5-Apr-2014 18:33:27 5-Apr-2014 18:33:33 (5sec)
Reduce 0 0 0 0
Cleanup 1 1 0 0 5-Apr-2014 18:33:33 5-Apr-2014 18:33:37 (4sec)

Check out the instructions for using the new API here
However, if you really need to run MapReduce jobs written using the 20 API in Oozie, below are the changes you need to make in workflow.xml.
change mapred.mapper.class to mapreduce.map.class
change mapred.reducer.class to mapreduce.reduce.class
add mapred.output.key.class
add mapred.output.value.class
and, include the following property into MR action configuration

Related

Pass an optional property from main oozie workflow to subworkflow

I have an HDFS_file_path or property that needs to be passed from workflow-1 to common_subworkflow.
I also have workflow-2 which doesn't have that property or HDFS_file_path. But workflow-2 calls common_subworkflow.
In common_subworkflow I am fetching the property value with ${HDFS_file_path}.
It works fine when workflow-1 calls common_subworkflow but fails when workflow-2 calls common_subworkflow since HDFS_file_path doesn't exist in workflow-2.
Is there any way to
read the dynamic property if present, or
set some default value(null or empty) if variable not present
<workflow-app name='hello-wf' xmlns="uri:oozie:workflow:0.4">
<parameters>
<property>
<name>inputDir</name>
</property>
<property>
<name>outputDir</name>
<value>out-dir</value>
</property>
</parameters>
...
<action name='firstjob'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>com.foo.FirstMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>com.foo.FirstReducer</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='secondjob'/>
<error to='killcleanup'/>
</action>
...
</workflow-app>
In the above example, if inputDir is not specified, Oozie will print an error message instead of submitting the job. If =outputDir= is not specified, Oozie will use the default value, out-dir .
Taken from https://oozie.apache.org/docs/3.3.1/WorkflowFunctionalSpec.html#a4.1_Workflow_Job_Properties_or_Parameters

OOZIE workflow: HIVE table did not exists but directory created in HDFS

I am trying to run a HIVE action using a OOZIE workflow. Below is the hive action:
create table abc (a INT);
I can locate the internal table in HDFS (directory abc getting created under /user/hive/warehouse) but when I trigger the command SHOW TABLES from hive>, I am not able to see the table.
This is the workflow.xml file:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hiveac"/>
<action name="hiveac">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<!-- <prepare> <delete path="${nameNode}/user/${wf:user()}/case1/out"/> </prepare> -->
<!-- <job-xml>hive-default.xml</job-xml>-->
<configuration>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.q</script>
<!-- <param>INPUT=/user/${wf:user()}/case1/sales_history_temp4</param>
<param>OUTPUT=/user/${wf:user()}/case1/out</param> -->
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Pig Script failed!!!</message>
</kill>
<end name="end"/>
</workflow-app>
This is the hive-default.xml file:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.stats.autogather</name>
<value>false</value>
</property>
</configuration>
This is the job.properties file:
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
oozie.libpath=/user/oozie/shared/lib
#oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/my/jobhive
The logs did not gave any errors as such:
stderr logs
Logging initialized using configuration in jar:file:/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/distcache/3179985539753819871_-620577179_884768063/localhost/user/oozie/shared/lib/hive-common-0.9.0-cdh4.1.1.jar!/hive-log4j.properties
Hive history file=/tmp/mapred/hive_job_log_mapred_201603060735_17840386.txt
OK
Time taken: 9.322 seconds
Log file: /var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/training/jobcache/job_201603060455_0012/attempt_201603060455_0012_m_000000_0/work/hive-oozie-job_201603060455_0012.log not present. Therefore no Hadoop jobids found
I came across a similar thread: Tables created by oozie hive action cannot be found from hive client but can find them in HDFS
But this did not resolved my issue. Please let me know how to resolve this issue.
I haven't used Oozie for a couple months (and did not keep archives because of legal reasons) and anyway it was V4.x so it's a bit of guesswork...
upload your valid hive-site.xml to HDFS somewhere
tell Oozie to inject all these properties in the Launcher Configuration before running the Hive class, so that it inherits them all, with
<job-xml>/some/hdfs/path/hive-site.xml</job-xml>
remove any reference to oozie.hive.defaults
Warning: all that assumes that your sandbox cluster has a persistent Metastore -- i.e. your hive-site.xml does not point to a Derby embedded DB that gets erased every time!

hadoop streaming workflow multiple files

I am trying to write a workflow with hadoop streaming action which executes a awk program, Below is my scenario
Hadoop streaming commands works fine from client.However ever when executing as oozie workflow it does not work as its not able to find second file. please note the awk script is on local home directory which is mounted on hadoop as well and the input paths are on HDFS
In sample.awk(code attached below) i am passing two variables $1 and $2 which should get data from file1 and file2
From CLI , I have also attached the streaming workflow which I configured from hue which is not working as expected.
/usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.0.jar -D mapreduce.job.reduces=0 -D mapred.reduce.tasks=0 -input /user/cloudera/input/file1 /user/cloudera/input/file2 -output /user/cloudera/awk/ouput -mapper /home/cloudera/diff_files/op_code/sample.awk -file /home/cloudera/diff_files/op_code/sample.awk
Workflow.xml
------------------
<workflow-app name="awk" xmlns="uri:oozie:workflow:0.4">
<global>
<configuration>
<property>
<name></name>
<value></value>
</property>
</configuration>
</global>
<start to="awk-streaming"/>
<action name="awk-streaming" cred="">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<streaming>
<mapper>/home/clouderasample.awk</mapper>
<reducer>/home/clouderasample.awk</reducer>
</streaming>
<configuration>
<property>
<name>mapred.output.dir</name>
<value>/user/cloudera/awk/output</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/cloudera/awk/input</value>
</property>
</configuration>
<file>/user/cloudera/awk/input/file1#file1</file>
<file>/user/cloudera/awk/input/file2#file2</file>
</map-reduce>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Kindly see this link for more details
http://wiki.apache.org/hadoop/JobConfFile
<property>
<name>mapred.input.dir</name>
<value>/user/cloudera/awk/input/file1,/user/cloudera/awk/input/file2</value>
<description>A comma separated list of input directories.</description>
</property>

LeaseExpiredException while running oozie fork

We are trying to run a Oozie workflow with 3 sub workflows running in parallel using fork. The sub-workflows contains a node running a native map reduce job, and subsequent two nodes running some complex PIG jobs. Finally the three sub-workflows are joined to a single end node.
When we run this workflow, we get LeaseExpiredException. The exception occurs randomly while running the PIG jobs. There is no definite place when it occurs, but it occurs every time we run the WF.
Also, if we remove the fork and run the sub-workflows sequentially, it works fine. However, our expectation is to have them run in parallel and same on some execution time.
Can you please help me understand this issue and some pointers on where we could be going wrong. We are starting with hadoop development and haven't faced such an issue earlier.
It looks like due to several tasks running in parallel, one of the threads closed a part file and when another thread tried to close the same, it throws the error.
Following is the stack trace of the exception from the hadoop logs.
2013-02-19 10:23:54,815 INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher: 57% complete
2013-02-19 10:26:55,361 INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher: 59% complete
2013-02-19 10:27:59,666 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file <hdfspath>/oozie-oozi/0000105-130218000850190-oozie-oozi-W/aggregateData--pig/output/_temporary/_attempt_201302180007_0380_m_000000_0/part-00000 : org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on <hdfspath>/oozie-oozi/0000105-130218000850190-oozie-oozi-W/aggregateData--pig/output/_temporary/_attempt_201302180007_0380_m_000000_0/part-00000 File does not exist. Holder DFSClient_attempt_201302180007_0380_m_000000_0 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1664)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1655)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1710)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1698)
at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:793)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1439)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1435)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1433)
Following is the sample for main workflow and one sub-workflow.
Main Work-Flow:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="MainProcess">
<start to="forkProcessMain"/>
<fork name="forkProcessMain">
<path start="Proc1"/>
<path start="Proc2"/>
<path start="Proc3"/>
</fork>
<join name="joinProcessMain" to="end"/>
<action name="Proc1">
<sub-workflow>
<app-path>${nameNode}${wfPath}/proc1_workflow.xml</app-path>
<propagate-configuration/>
</sub-workflow>
<ok to="joinProcessMain"/>
<error to="fail"/>
</action>
<action name="Proc2">
<sub-workflow>
<app-path>${nameNode}${wfPath}/proc2_workflow.xml</app-path>
<propagate-configuration/>
</sub-workflow>
<ok to="joinProcessMain"/>
<error to="fail"/>
</action>
<action name="Proc3">
<sub-workflow>
<app-path>${nameNode}${wfPath}/proc3_workflow.xml</app-path>
<propagate-configuration/>
</sub-workflow>
<ok to="joinProcessMain"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>WF Failure, 'wf:lastErrorNode()' failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
Sub-WorkFlow:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="Sub Process">
<start to="Step1"/>
<action name="Step1">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${step1JoinOutputPath}"/>
</prepare>
<configuration>
<property>
<name>mapred.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.absd.mr.step1</main-class>
<arg>${wf:name()}</arg>
<arg>${wf:id()}</arg>
<arg>${tbMasterDataOutputPath}</arg>
<arg>${step1JoinOutputPath}</arg>
<arg>${tbQueryKeyPath}</arg>
<capture-output/>
</java>
<ok to="generateValidQueryKeys"/>
<error to="fail"/>
</action>
<action name="generateValidQueryKeys">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${tbValidQuerysOutputPath}"/>
</prepare>
<configuration>
<property>
<name>pig.tmpfilecompression</name>
<value>true</value>
</property>
<property>
<name>pig.tmpfilecompression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.map.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.map.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>${pigDir}/tb_calc_valid_accounts.pig</script>
<param>csvFilesDir=${csvFilesDir}</param>
<param>step1JoinOutputPath=${step1JoinOutputPath}</param>
<param>tbValidQuerysOutputPath=${tbValidQuerysOutputPath}</param>
<param>piMinFAs=${piMinFAs}</param>
<param>piMinAccounts=${piMinAccounts}</param>
<param>parallel=80</param>
</pig>
<ok to="aggregateAumData"/>
<error to="fail"/>
</action>
<action name="aggregateAumData">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${tbCacheDataPath}"/>
</prepare>
<configuration>
<property>
<name>pig.tmpfilecompression</name>
<value>true</value>
</property>
<property>
<name>pig.tmpfilecompression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.map.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.map.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>pig.output.compression</name>
<value>true</value>
</property>
<property>
<name>pig.output.compression.codec</name>
<value>lzo</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>${pigDir}/aggregationLogic.pig</script>
<param>csvFilesDir=${csvFilesDir}</param>
<param>tbValidQuerysOutputPath=${tbValidQuerysOutputPath}</param>
<param>tbCacheDataPath=${tbCacheDataPath}</param>
<param>currDate=${date}</param>
<param>udfJarPath=${nameNode}${wfPath}/lib</param>
<param>parallel=150</param>
</pig>
<ok to="loadDataToDB"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>WF Failure, 'wf:lastErrorNode()' failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
We've got the same error when we were running three pig actions in parallel and one of them failed. That message error is consequence of an unexpected workflow stop because one action failed, the workflow is stopped and the others actions are trying to continue. You must look at the failed action with status ERROR to know what happened, doesn't look at actions with status KILLED

Submitting applications externally via REST APIs

Is there currently a way to submit applications externally via the supplied REST APIs for MapReduceV1 and/or YARN? I'm hoping to find a way to do this without adding a custom service.
So far I've only figured out how to GET the application status from the ResourceManager using YARN.
Maybe I'm looking at this the wrong and there's a better way to do this externally?
So after doing some research, I've decided that the Oozie Workflow Scheduler is the way to go.
This is a sample workflow that can be submitted to a REST endpoint running inside your Hadoop system to start a MapReduce job. <action>s are not limited to MapReduce.
<workflow-app xmlns='uri:oozie:workflow:0.1' name='map-reduce-wf'>
<start to='hadoop1' />
<action name='hadoop1'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.oozie.example.SampleMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.oozie.example.SampleReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>input-data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>output-map-reduce</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>unfunded</value>
</property>
</configuration>
</map-reduce>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
Sample taken from https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases

Resources