Pass an optional property from main oozie workflow to subworkflow - hadoop

I have an HDFS_file_path or property that needs to be passed from workflow-1 to common_subworkflow.
I also have workflow-2 which doesn't have that property or HDFS_file_path. But workflow-2 calls common_subworkflow.
In common_subworkflow I am fetching the property value with ${HDFS_file_path}.
It works fine when workflow-1 calls common_subworkflow but fails when workflow-2 calls common_subworkflow since HDFS_file_path doesn't exist in workflow-2.
Is there any way to
read the dynamic property if present, or
set some default value(null or empty) if variable not present

<workflow-app name='hello-wf' xmlns="uri:oozie:workflow:0.4">
<parameters>
<property>
<name>inputDir</name>
</property>
<property>
<name>outputDir</name>
<value>out-dir</value>
</property>
</parameters>
...
<action name='firstjob'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>com.foo.FirstMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>com.foo.FirstReducer</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='secondjob'/>
<error to='killcleanup'/>
</action>
...
</workflow-app>
In the above example, if inputDir is not specified, Oozie will print an error message instead of submitting the job. If =outputDir= is not specified, Oozie will use the default value, out-dir .
Taken from https://oozie.apache.org/docs/3.3.1/WorkflowFunctionalSpec.html#a4.1_Workflow_Job_Properties_or_Parameters

Related

E0405: Submission request doesn't have any application or lib path

Its a first time am running mapreduce program from Oozie.
Here is my job.properties file
nameNode=file:/usr/local/hadoop_store/hdfs/namenode
jobTracker=localhost:8088
queueName=default
oozie.wf.applications.path=${nameNode}/Config
Here is my hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
Here is my core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.groups</name>
<value>*</value>
</property>
</configuration>
But when I run Ozzie command to run my Mapreduce program, Its give error that lib folder is not found. Error: E0405 : E0405: Submission request doesn't have any application or lib path
oozie job -oozie http://localhost:11000/oozie -config job.properties -run
I've created Config folder in HDFS and in that folder created lib folder too. In lib folder placed my mapreduce jar file and inside Config folder placed my workflow.xml file. (Its all in HDFS)
I think I ve give wrong HDFS path (nameNode) in job.properties file. That's why its not able to find {nameNode}/Config, May I know please what would be hdfs path ..?
Thanks
Update - 1 job.properties
nameNode=hdfs://localhost:8020
jobTracker=localhost:8088
queueName=default
oozie.wf.applications.path=${nameNode}/Config
still getting same error:
Error: E0405 : E0405: Submission request doesn't have any application or lib path
Update - 2 workflow.xml in Config folder in HDFS.
<workflow-app xmlns="uri:oozie:workflow:0.4" name="simple-Workflow">
<start to="RunMapreduceJob" />
<action name="RunMapreduceJob">
<map-reduce>
<job-tracker>localhost:8088</job-tracker>
<name-node>file:/usr/local/hadoop_store/hdfs/namenode</name-node>
<prepare>
<delete path="file:/usr/local/hadoop_store/hdfs/namenode"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>mapred.mapper.class</name>
<value>DataDividerByUser.DataDividerMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>DataDividerByUser.DataDividerReducer</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/dataoutput</value>
</property>
</configuration>
</map-reduce>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Mapreduce program Failed</message>
</kill>
<end name="end" />
</workflow-app>
The <namenode> tag should not be a file path. It should point to the NameNode of the underlying Hadoop cluster where Oozie has to run the MapReduce job. Your name node should be the value of the fs.default.name from your core-site.xml.
nameNode=hdfs://localhost:9000
Also, change the property name oozie.wf.applications.path to oozie.wf.application.path (without the s).
Add the property oozie.use.system.libpath=true to your properties file.
Source: Apache Oozie by Mohammad Kamrul Islam & Aravind Srinivasan

How to set hive properties from Oozie in global configuration

I would like to pass hive set commands into all hql calling in Oozie scripts. I have many hql and I would like to pass the hive parameters to each hql.I used to write all the set commands in each hql file now I would like to keep in work flow level. Can any one suggest If I am doing something wrong.
I have put part of my workflow. when executing the jobs I don't see the hive parameters are not propagated and hence jobs are failing.
<workflow-app name="WF_AMLMKTM_L1_LOAD" xmlns="uri:oozie:workflow:0.5">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>hive.exec.parallel</name>
<value>true</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
</configuration>
</global>
<action name="map_prc_stg_l1_load_com" cred="hive2">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<jdbc-url>${hive2_jdbc_url}</jdbc-url>
<script>${basepath}/applications/stg_l1_load_com.hql</script>
<param>basepath=${basepath}</param>
<param>runsk=${wf:actionData('runsk_gen')['runsk']}</param>
I think you can add it as below.
... <argument>--hiveconf</argument>
<argument>hive.exec.dynamic.partition.mode=nonstrict</argument>
<argument>--hiveconf</argument>
<argument>hive.exec.dynamic.partition=true</argument>
Put all your hive related configurations in hive-site.xml and pass it with hive action using
<job-xml>[HIVE SETTINGS FILE]</job-xml>
https://oozie.apache.org/docs/4.2.0/DG_Hive2ActionExtension.html

OOZIE workflow: HIVE table did not exists but directory created in HDFS

I am trying to run a HIVE action using a OOZIE workflow. Below is the hive action:
create table abc (a INT);
I can locate the internal table in HDFS (directory abc getting created under /user/hive/warehouse) but when I trigger the command SHOW TABLES from hive>, I am not able to see the table.
This is the workflow.xml file:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hiveac"/>
<action name="hiveac">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<!-- <prepare> <delete path="${nameNode}/user/${wf:user()}/case1/out"/> </prepare> -->
<!-- <job-xml>hive-default.xml</job-xml>-->
<configuration>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.q</script>
<!-- <param>INPUT=/user/${wf:user()}/case1/sales_history_temp4</param>
<param>OUTPUT=/user/${wf:user()}/case1/out</param> -->
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Pig Script failed!!!</message>
</kill>
<end name="end"/>
</workflow-app>
This is the hive-default.xml file:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.stats.autogather</name>
<value>false</value>
</property>
</configuration>
This is the job.properties file:
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
oozie.libpath=/user/oozie/shared/lib
#oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/my/jobhive
The logs did not gave any errors as such:
stderr logs
Logging initialized using configuration in jar:file:/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/distcache/3179985539753819871_-620577179_884768063/localhost/user/oozie/shared/lib/hive-common-0.9.0-cdh4.1.1.jar!/hive-log4j.properties
Hive history file=/tmp/mapred/hive_job_log_mapred_201603060735_17840386.txt
OK
Time taken: 9.322 seconds
Log file: /var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/training/jobcache/job_201603060455_0012/attempt_201603060455_0012_m_000000_0/work/hive-oozie-job_201603060455_0012.log not present. Therefore no Hadoop jobids found
I came across a similar thread: Tables created by oozie hive action cannot be found from hive client but can find them in HDFS
But this did not resolved my issue. Please let me know how to resolve this issue.
I haven't used Oozie for a couple months (and did not keep archives because of legal reasons) and anyway it was V4.x so it's a bit of guesswork...
upload your valid hive-site.xml to HDFS somewhere
tell Oozie to inject all these properties in the Launcher Configuration before running the Hive class, so that it inherits them all, with
<job-xml>/some/hdfs/path/hive-site.xml</job-xml>
remove any reference to oozie.hive.defaults
Warning: all that assumes that your sandbox cluster has a persistent Metastore -- i.e. your hive-site.xml does not point to a Derby embedded DB that gets erased every time!

How can I use Oozie workflow configuration property in the workflow itself?

I have an Oozie coordinator that watches for a file to show up in a certain directory. This coordinator runs daily. If the file being watched shows up, a workflow is launched.
The workflow takes the parameter of the file/directory being watched. Oozie passes this to it. It is a fully qualified path (i.e: hdfs://myhost/dir1/dir2/2015-02-17).
I need to grab the /dir1/dir2/2015-02-17 and pass it into a Hive script, which doesn't seem to take a fully qualified HDFS path. Which means I need to use Workflow EL function to strip out the hdfs://myhost part. I think replaceAll() will do this. The problem is passing the result of that into Hive.
Is there a way to use workflow configuration property in the workflow itself?
For example, I want to be able to use 'dateToProcess' which is part of a directory name that is an input to the workflow:
<workflow-app name="mywf" xmlns="uri:oozie:workflow:0.4">
<parameters>
<property>
<name>region</name>
</property>
<property>
<name>hdfsDumpDir</name>
</property>
<property>
<name>hdfsWatchDir</name>
<value>${nameNode}${watchDir}</value>
</property>
</parameters>
<start to="copy_to_entries"/>
<action name="copy_to_entries">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>dateToProcess</name>
<value>${replaceAll(hdfsDumpDir, hdfsWatchDir,"")}</value>
</property>
</configuration>
<script>myhivescript.q</script>
<!--
Parameters referenced within Hive script.
-->
<param>INPUT_TABLE=dumptable</param>
<param>INPUT_LOCATION=${watchDir}/${wf:conf('dateToProcess')}</param>
</hive>
<ok to="cleanup"/>
<error to="sendEmailKill"/>
</action>
...
</workflow>
I get an empty string when I use $wf:conf('dateToProcess').
I get variable not found when I use ${dateToProcess}.
Any ideas?
Remove
<property>
<name>dateToProcess</name>
<value>${replaceAll(hdfsDumpDir, hdfsWatchDir,"")}</value>
</property>
and instead place its value directly into the <param> i.e.
<param>INPUT_LOCATION=${watchDir}/${replaceAll(hdfsDumpDir, hdfsWatchDir,"")}</param>
If you're going to be using this in more than one place, add the dateToProcess property to config-default.xml, and then it will be available as you intended.

Submitting applications externally via REST APIs

Is there currently a way to submit applications externally via the supplied REST APIs for MapReduceV1 and/or YARN? I'm hoping to find a way to do this without adding a custom service.
So far I've only figured out how to GET the application status from the ResourceManager using YARN.
Maybe I'm looking at this the wrong and there's a better way to do this externally?
So after doing some research, I've decided that the Oozie Workflow Scheduler is the way to go.
This is a sample workflow that can be submitted to a REST endpoint running inside your Hadoop system to start a MapReduce job. <action>s are not limited to MapReduce.
<workflow-app xmlns='uri:oozie:workflow:0.1' name='map-reduce-wf'>
<start to='hadoop1' />
<action name='hadoop1'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.oozie.example.SampleMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.oozie.example.SampleReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>input-data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>output-map-reduce</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>unfunded</value>
</property>
</configuration>
</map-reduce>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
Sample taken from https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases

Resources