how to check partition data sets in oozie work flow?

how to check partition data sets in oozie work flow? - hadoop

how to check the partition location exist or not with oozie work flow using decision node.
example: /user/cloudera/year=2016/month=201609/day=20150912
in my hdfs location i will get one data set every day like above.i.e...year=2016/month=201609/day=20150912
with the help of coordination job i will get the date value
<property>
<name>today</name>
<value>${coord:formatTime(coord:dateOffset(coord:dateTzOffset(coord:nominalTime(), "America/Los_Angeles"), -1, 'DAY'), 'yyyyMMdd')}</value>
</property>
In my workflow with the help of decision node,how to check year=2016/month=201609/day=20150912 path exist or not?

You can use the HCatalog EL Functions from the oozie workflow EL functions:
The format to specify a hcatalog table partition URI is
hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value].
For example:
hcat://foo:8020/mydb/mytable/region=us;dt=20121212

It seems like this is the location that you would want to check:
/user/cloudera/year=${YEAR}/month=${YEAR}${MONTH}/day=${YEAR}${MONTH}${DAY}
Of course you would correct these with the right offset where required.

Thank you for your prompt response #YoungHobbit and #Dennis Jaheruddin.
I wanted to use the decision node to check whether path is exist or not but not the URI.
I have found out that the coordinate job and workflow.xml helped me to achieve the solution.
coordinate_job.xml
<coordinator-app name="testemailjob" frequency="15" start="${jobStart}" end="${jobEnd}" timezone="America/Los_Angeles" xmlns="uri:oozie:coordinator:0.2" >
<controls>
<execution>FIFO</execution>
</controls>
<action>
<workflow>
<app-path>${test}</app-path>
<configuration>
<property>
<name>year</name>
<value>${coord:formatTime(coord:dateOffset(coord:dateTzOffset(coord:nominalTime(), "America/Los_Angeles"), -1, 'DAY'), 'yyyy')}</value>
</property>
<property>
<name>month</name>
<value>${coord:formatTime(coord:dateOffset(coord:dateTzOffset(coord:nominalTime(), "America/Los_Angeles"), -1, 'DAY'), 'yyyyMM')}</value>
</property>
<property>
<name>yesterday</name>
<value>${coord:formatTime(coord:dateOffset(coord:dateTzOffset(coord:nominalTime(), "America/Los_Angeles"), -1, 'DAY'), 'yyyyMMdd')}</value>
</property>
<property>
<name>today</name>
<value>${coord:formatTime(coord:dateTzOffset(coord:nominalTime(), "America/Los_Angeles"), 'yyyyMMdd')}</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>True</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
My workflow.xml :
<workflow-app name= ......>
...........................
...............................
<decision name="CheckFile">
<switch>
<case to="nextOozieTask">
${fs:exists(concat(concat(concat(concat(concat(concat(nameNode, path),year),"/month="),month),"/day="),today))}
</case>
<case to="nextOozieTask1">
${fs:exists(concat(concat(concat(concat(concat(concat(nameNode, path),year),'/month='),month),'/day='),yesterday))}
</case>
<default to="MailActionFileMissing" />
</switch>
</decision>
....................
......................
</workflow-app>

Related

E0405: Submission request doesn't have any application or lib path

Its a first time am running mapreduce program from Oozie.
Here is my job.properties file
nameNode=file:/usr/local/hadoop_store/hdfs/namenode
jobTracker=localhost:8088
queueName=default
oozie.wf.applications.path=${nameNode}/Config
Here is my hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
Here is my core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.groups</name>
<value>*</value>
</property>
</configuration>
But when I run Ozzie command to run my Mapreduce program, Its give error that lib folder is not found. Error: E0405 : E0405: Submission request doesn't have any application or lib path
oozie job -oozie http://localhost:11000/oozie -config job.properties -run
I've created Config folder in HDFS and in that folder created lib folder too. In lib folder placed my mapreduce jar file and inside Config folder placed my workflow.xml file. (Its all in HDFS)
I think I ve give wrong HDFS path (nameNode) in job.properties file. That's why its not able to find {nameNode}/Config, May I know please what would be hdfs path ..?
Thanks
Update - 1 job.properties
nameNode=hdfs://localhost:8020
jobTracker=localhost:8088
queueName=default
oozie.wf.applications.path=${nameNode}/Config
still getting same error:
Error: E0405 : E0405: Submission request doesn't have any application or lib path
Update - 2 workflow.xml in Config folder in HDFS.
<workflow-app xmlns="uri:oozie:workflow:0.4" name="simple-Workflow">
<start to="RunMapreduceJob" />
<action name="RunMapreduceJob">
<map-reduce>
<job-tracker>localhost:8088</job-tracker>
<name-node>file:/usr/local/hadoop_store/hdfs/namenode</name-node>
<prepare>
<delete path="file:/usr/local/hadoop_store/hdfs/namenode"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>mapred.mapper.class</name>
<value>DataDividerByUser.DataDividerMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>DataDividerByUser.DataDividerReducer</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/dataoutput</value>
</property>
</configuration>
</map-reduce>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Mapreduce program Failed</message>
</kill>
<end name="end" />
</workflow-app>

The <namenode> tag should not be a file path. It should point to the NameNode of the underlying Hadoop cluster where Oozie has to run the MapReduce job. Your name node should be the value of the fs.default.name from your core-site.xml.
nameNode=hdfs://localhost:9000
Also, change the property name oozie.wf.applications.path to oozie.wf.application.path (without the s).
Add the property oozie.use.system.libpath=true to your properties file.
Source: Apache Oozie by Mohammad Kamrul Islam & Aravind Srinivasan

Pass an optional property from main oozie workflow to subworkflow

I have an HDFS_file_path or property that needs to be passed from workflow-1 to common_subworkflow.
I also have workflow-2 which doesn't have that property or HDFS_file_path. But workflow-2 calls common_subworkflow.
In common_subworkflow I am fetching the property value with ${HDFS_file_path}.
It works fine when workflow-1 calls common_subworkflow but fails when workflow-2 calls common_subworkflow since HDFS_file_path doesn't exist in workflow-2.
Is there any way to
read the dynamic property if present, or
set some default value(null or empty) if variable not present

<workflow-app name='hello-wf' xmlns="uri:oozie:workflow:0.4">
<parameters>
<property>
<name>inputDir</name>
</property>
<property>
<name>outputDir</name>
<value>out-dir</value>
</property>
</parameters>
...
<action name='firstjob'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>com.foo.FirstMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>com.foo.FirstReducer</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='secondjob'/>
<error to='killcleanup'/>
</action>
...
</workflow-app>
In the above example, if inputDir is not specified, Oozie will print an error message instead of submitting the job. If =outputDir= is not specified, Oozie will use the default value, out-dir .
Taken from https://oozie.apache.org/docs/3.3.1/WorkflowFunctionalSpec.html#a4.1_Workflow_Job_Properties_or_Parameters

How to set hive properties from Oozie in global configuration

I would like to pass hive set commands into all hql calling in Oozie scripts. I have many hql and I would like to pass the hive parameters to each hql.I used to write all the set commands in each hql file now I would like to keep in work flow level. Can any one suggest If I am doing something wrong.
I have put part of my workflow. when executing the jobs I don't see the hive parameters are not propagated and hence jobs are failing.
<workflow-app name="WF_AMLMKTM_L1_LOAD" xmlns="uri:oozie:workflow:0.5">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>hive.exec.parallel</name>
<value>true</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
</configuration>
</global>
<action name="map_prc_stg_l1_load_com" cred="hive2">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<jdbc-url>${hive2_jdbc_url}</jdbc-url>
<script>${basepath}/applications/stg_l1_load_com.hql</script>
<param>basepath=${basepath}</param>
<param>runsk=${wf:actionData('runsk_gen')['runsk']}</param>

I think you can add it as below.
... <argument>--hiveconf</argument>
<argument>hive.exec.dynamic.partition.mode=nonstrict</argument>
<argument>--hiveconf</argument>
<argument>hive.exec.dynamic.partition=true</argument>

Put all your hive related configurations in hive-site.xml and pass it with hive action using
<job-xml>[HIVE SETTINGS FILE]</job-xml>
https://oozie.apache.org/docs/4.2.0/DG_Hive2ActionExtension.html

OOZIE workflow: HIVE table did not exists but directory created in HDFS

I am trying to run a HIVE action using a OOZIE workflow. Below is the hive action:
create table abc (a INT);
I can locate the internal table in HDFS (directory abc getting created under /user/hive/warehouse) but when I trigger the command SHOW TABLES from hive>, I am not able to see the table.
This is the workflow.xml file:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hiveac"/>
<action name="hiveac">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<!-- <prepare> <delete path="${nameNode}/user/${wf:user()}/case1/out"/> </prepare> -->
<!-- <job-xml>hive-default.xml</job-xml>-->
<configuration>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.q</script>
<!-- <param>INPUT=/user/${wf:user()}/case1/sales_history_temp4</param>
<param>OUTPUT=/user/${wf:user()}/case1/out</param> -->
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Pig Script failed!!!</message>
</kill>
<end name="end"/>
</workflow-app>
This is the hive-default.xml file:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.stats.autogather</name>
<value>false</value>
</property>
</configuration>
This is the job.properties file:
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
oozie.libpath=/user/oozie/shared/lib
#oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/my/jobhive
The logs did not gave any errors as such:
stderr logs
Logging initialized using configuration in jar:file:/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/distcache/3179985539753819871_-620577179_884768063/localhost/user/oozie/shared/lib/hive-common-0.9.0-cdh4.1.1.jar!/hive-log4j.properties
Hive history file=/tmp/mapred/hive_job_log_mapred_201603060735_17840386.txt
OK
Time taken: 9.322 seconds
Log file: /var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/training/jobcache/job_201603060455_0012/attempt_201603060455_0012_m_000000_0/work/hive-oozie-job_201603060455_0012.log not present. Therefore no Hadoop jobids found
I came across a similar thread: Tables created by oozie hive action cannot be found from hive client but can find them in HDFS
But this did not resolved my issue. Please let me know how to resolve this issue.

I haven't used Oozie for a couple months (and did not keep archives because of legal reasons) and anyway it was V4.x so it's a bit of guesswork...
upload your valid hive-site.xml to HDFS somewhere
tell Oozie to inject all these properties in the Launcher Configuration before running the Hive class, so that it inherits them all, with
<job-xml>/some/hdfs/path/hive-site.xml</job-xml>
remove any reference to oozie.hive.defaults
Warning: all that assumes that your sandbox cluster has a persistent Metastore -- i.e. your hive-site.xml does not point to a Derby embedded DB that gets erased every time!

Submitting applications externally via REST APIs

Is there currently a way to submit applications externally via the supplied REST APIs for MapReduceV1 and/or YARN? I'm hoping to find a way to do this without adding a custom service.
So far I've only figured out how to GET the application status from the ResourceManager using YARN.
Maybe I'm looking at this the wrong and there's a better way to do this externally?

So after doing some research, I've decided that the Oozie Workflow Scheduler is the way to go.
This is a sample workflow that can be submitted to a REST endpoint running inside your Hadoop system to start a MapReduce job. <action>s are not limited to MapReduce.
<workflow-app xmlns='uri:oozie:workflow:0.1' name='map-reduce-wf'>
<start to='hadoop1' />
<action name='hadoop1'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.oozie.example.SampleMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.oozie.example.SampleReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>input-data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>output-map-reduce</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>unfunded</value>
</property>
</configuration>
</map-reduce>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
Sample taken from https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to check partition data sets in oozie work flow? - hadoop

You can use the HCatalog EL Functions from the oozie workflow EL functions: The format to specify a hcatalog table partition URI is hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value]. For example: hcat://foo:8020/mydb/mytable/region=us;dt=20121212

It seems like this is the location that you would want to check: /user/cloudera/year=${YEAR}/month=${YEAR}${MONTH}/day=${YEAR}${MONTH}${DAY} Of course you would correct these with the right offset where required.

Related

E0405: Submission request doesn't have any application or lib path

Pass an optional property from main oozie workflow to subworkflow

How to set hive properties from Oozie in global configuration

OOZIE workflow: HIVE table did not exists but directory created in HDFS

Submitting applications externally via REST APIs

Categories

Resources