How to get file name dynamically in decision node in OOZIE? - shell

i want check whether the file is exist or not, In HDFS location using oozie batch.
in my HDFS location , in daily base I will get file like "test_08_01_2016.csv","test_08_02_2016.csv" at every day 11PM.
So i want check whether the file exist are after 11.15 PM ,i can check file exist on not using decision node. by using below workflow .
<workflow-app name="HIVECoWorkflow" xmlns="uri:oozie:workflow:0.5">
<start to="CheckFile"/>
<decision name="CheckFile">
<switch>
<case to="nextOozieTask">
${fs:exists("/user/cloudera/file/input/test_08_01_2016.csv")}
</case>
<default to="MailActionFileMissing" />
</switch>
<action name="MailActionFileMissing" cred="hive2">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<jdbc-url>jdbc:hive2://quickstart.cloudera:10000/default</jdbc-url>
<script>/user/cloudera/email/select.hql</script>
<file>/user/cloudera/hive-site.xml</file>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="nextOozieTask" cred="hive2">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<jdbc-url>jdbc:hive2://quickstart.cloudera:10000/default</jdbc-url>
<script>/user/cloudera/email/select1.hql</script>
<file>/user/cloudera/hive-site.xml</file>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="End"/>
but i want to get file name dynamically like for example
" filenamt_todaysdate i.e test_08_01_2016.csv".
please help me on this how can i get filename dynamical.
thanks in advance.

The solution for the above question is, we have to get the date value from coordination job like below code ,inside the coordination job.
<property>
<name>today</name>
<value>${coord:formatTime(coord:dateTzOffset(coord:nominalTime(), "America/Los_Angeles"), 'yyyyMMdd')}</value>
</property>
We can check the file exist or not in given HDFS location with the help fs:exists i.e
${fs:exists(concat(concat(nameNode, path),today))}
And in workflow we have to pass the parameter of the coordination job date value “today” like below code
<workflow-app name="HIVECoWorkflow" xmlns="uri:oozie:workflow:0.5">
<start to="CheckFile"/>
<decision name="CheckFile">
<switch>
<case to="nextOozieTask">
${fs:exists(concat(concat(nameNode, path),today))}
</case>
<case to="nextOozieTask1">
${fs:exists(concat(concat(nameNode, path),yesterday))}
</case>
<default to="MailActionFileMissing" />
</switch> </decision>
<action name="MailActionFileMissing" cred="hive2">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<jdbc-url>jdbc:hive2://quickstart.cloudera:10000/default</jdbc-url>
<script>/user/cloudera/email/select.hql</script>
<file>/user/cloudera/hive-site.xml</file>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="nextOozieTask" cred="hive2">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<jdbc-url>jdbc:hive2://quickstart.cloudera:10000/default</jdbc-url>
<script>/user/cloudera/email/select1.hql</script>
<file>/user/cloudera/hive-site.xml</file>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action><kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="End"/>
in job.properties we can declare all static values like below.
jobStart=2016-08-23T09:50Z
jobEnd=2016-08-23T10:26Z
tzOffset=-8
initialDataset=2016-08-23T09:50Z
oozie.use.system.libpath=True
security_enabled=False
dryrun=True
jobTracker=localhost:8032
nameNode=hdfs://quickstart.cloudera:8020
test=${nameNode}/user/cloudera/email1
oozie.coord.application.path=${nameNode}/user/cloudera/email1/add-partition-coord-app.xml
path=/user/cloudera/file/input/ravi_

May be you can write a shell script which does the hdfs file exists check. Upon success return 0 else 1. Based on this rewrite oozie workflow success and error nodes...

Related

How to read config properties in sub-workflow (separate xml file)?

I am getting below mentioned error message while reading config properties in separate sub-workflow file. I am posting the sample code. Appreciate your help in resolving this issue. Thank you!
2019-01-17 08:44:52,885 WARN ActionStartXCommand:523 - SERVER[localhost] USER[user1] GROUP[-] TOKEN[] APP[subWorkflow] JOB[0338958-190114130857167-oozie-oozi-W] ACTION[0338958-190114130857167-oozie-oozi-W#subWorkflowAction1] ELException in ActionStartXCommand
javax.servlet.jsp.el.ELException: variable [jobtracker] cannot be resolved
Coordinator job trigger command
oozie job --oozie http://localhost:11000/oozie --config /home/user/oozie-scripts/props/job.properties -run
job.properties
namenode=hdfs://localhost
workflowpath=${namenode}/user/user1/oozie-workflow/parentWorkflow.xml
frequency=25
starttime=2018-08-06T13\:29Z
endtime=2108-08-06T13\:29Z
timezone=UTC
oozie.coord.application.path=${namenode}/user/user1/oozie-workflow/coordinator.xml
jobtracker=http://localhost:8088
scriptpath=/user/user1/oozie-workflow
Coordinator
<coordinator-app name="sampleCoord" frequency="${frequency}" start="${starttime}" end="${endtime}" timezone="${timezone}" xmlns="uri:oozie:coordinator:0.4">
<action>
<workflow>
<app-path>${workflowpath}</app-path>
</workflow>
</action>
</coordinator-app>
Parent Workflow
<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "Parent-Workflow">
<start to = "workflowAction1" />
<action name = "workflowAction1">
<sub-workflow>
<app-path>/user/user1/oozie-workflow/subWorkflow1.xml</app-path>
</sub-workflow>
<ok to = "end" />
<error to = "end" />
</action>
Sub-Workflow
<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "subWorkflow">
<start to = "subWorkflowAction1" />
<action name = "subWorkflowAction1">
<hive xmlns = "uri:oozie:hive-action:0.4">
<job-tracker>${jobtracker}</job-tracker>
<script>${scriptpath}/dropTempTable.hive</script>
<param>Temp_TableVar=${concat(concat("HBASE_",replaceAll(wf:id(),"- ","_")),"_TEMP")}</param>
</hive>
<ok to = "end" />
<error to = "kill_job" />
</action>
<kill name = "kill_job">
<message>Job failed</message>
</kill>
<end name = "end" />
</workflow-app>
Adding propagate-configuration tag in parent workflow xml file resolved the issue.
<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "Parent-Workflow">
<start to = "workflowAction1" />
<action name = "workflowAction1">
<sub-workflow>
<app-path>/user/user1/oozie-workflow/subWorkflow1.xml</app-path>
<propagate-configuration />
</sub-workflow>
<ok to = "end" />
<error to = "end" />

Any other option to run oozie actions in parallel

Currently I have 6 actions in my oozie workflow as shown below.
After MainJob1 completes all the first, second and third jobs should run in parallel.
After MainJob2 completes only second and third jobs should run in parallel.
Is there any possibility to solve the above way of workflow executions?
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
....
<decision name="execution-mode-decision">
<switch>
<case to="MainJob1">${executionMode eq "DEFAULT"}</case>
<case to="MainJob2">${executionMode eq "INVALID"}</case>
<default to="MainJob1" />
</switch>
</decision>
<action name="MainJob1">
<map-reduce>
.......
</map-reduce>
<ok to="fork1"/>
<error to="kill"/>
</action>
<action name="MainJob2">
<map-reduce>
......
</map-reduce>
<ok to="fork2"/>
<error to="kill"/>
</action>
...
<fork name="fork1">
<path start="firstparalleljob"/>
<path start="secondparalleljob"/>
<path start="thirdparalleljob"/>
</fork>
<fork name="fork2">
<path start="secondparalleljob"/>
<path start="thirdparalleljob"/>
</fork>
<action name="firstparallejob">
<map-reduce>
...........
<ok to="joining"/>
<error to="kill"/>
</action>
<action name="secondparalleljob">
<map-reduce>
........
</map-reduce>
<ok to="joining"/>
<error to="kill"/>
</action>
<action name="thirdparalleljob">
<map-reduce>
........
</map-reduce>
<ok to="joining"/>
<error to="kill"/>
</action>
<join name="joining" to="emailFailure"/>
...
</workflow-app>
You can put firstparalleljob, secondparalleljob and thirdparalleljob in separate 3 sub-workflows, then call 3 sub workflows in the first fork and 2 sub-workflow in next fork. In this way, we can even pass a different value to a variable at different fork time in the same action.

Copying files from a hdfs directory to another with oozie distcp-action

My actions
start_fair_usage ends with status okey, but test_copy returns
Main class [org.apache.oozie.action.hadoop.DistcpMain], main() threw exception, null
In /user/comverse/data/${1}_B I have a lot of different files, some of which I want to copy to ${NAME_NODE}/user/evkuzmin/output. For that I try to pass paths from copy_files.sh which holds an array of paths to the files I need.
<action name="start_fair_usage">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${JOB_TRACKER}</job-tracker>
<name-node>${NAME_NODE}</name-node>
<exec>${copy_file}</exec>
<argument>${today_without_dash}</argument>
<argument>${mta}</argument>
<!-- <file>${path}#${start_fair_usage}</file> -->
<file>${path}${copy_file}#${copy_file}</file>
<capture-output/>
</shell>
<ok to="test_copy"/>
<error to="KILL"/>
</action>
<action name="test_copy">
<distcp xmlns="uri:oozie:distcp-action:0.2">
<job-tracker>${JOB_TRACKER}</job-tracker>
<name-node>${NAME_NODE}</name-node>
<arg>${wf:actionData('start_fair_usage')['paths']}</arg>
<!-- <arg>${NAME_NODE}/user/evkuzmin/input/*</arg> -->
<arg>${NAME_NODE}/user/evkuzmin/output</arg>
</distcp>
<ok to="END"/>
<error to="KILL"/>
</action>
start_fair_usage starts copy_file.sh
echo ${1}
echo ${2}
dirs=(
/user/comverse/data/${1}_B
)
args=()
for i in $(hadoop fs -ls "${dirs[#]}" | egrep ${2}.gz | awk -F " " '{print $8}')
do
args+=("$i")
echo "copy file - "${i}
done
paths=${args}
echo ${paths}
Here is what I did in the end.
<start to="start_copy"/>
<fork name="start_copy">
<path start="copy_mta"/>
<path start="copy_rcr"/>
<path start="copy_sub"/>
</fork>
<action name="copy_mta">
<distcp xmlns="uri:oozie:distcp-action:0.2">
<prepare>
<delete path="${NAME_NODE}${dstFolder}mta/*"/>
</prepare>
<arg>${NAME_NODE}${srcFolder}/*mta.gz</arg>
<arg>${NAME_NODE}${dstFolder}mta/</arg>
</distcp>
<ok to="end_copy"/>
<error to="KILL"/>
</action>
<action name="copy_rcr">
<distcp xmlns="uri:oozie:distcp-action:0.2">
<prepare>
<delete path="${NAME_NODE}${dstFolder}rcr/*"/>
</prepare>
<arg>${NAME_NODE}${srcFolder}/*rcr.gz</arg>
<arg>${NAME_NODE}${dstFolder}rcr/</arg>
</distcp>
<ok to="end_copy"/>
<error to="KILL"/>
</action>
<action name="copy_sub">
<distcp xmlns="uri:oozie:distcp-action:0.2">
<prepare>
<delete path="${NAME_NODE}${dstFolder}sub/*"/>
</prepare>
<arg>${NAME_NODE}${srcFolder}/*sub.gz</arg>
<arg>${NAME_NODE}${dstFolder}sub/</arg>
</distcp>
<ok to="end_copy"/>
<error to="KILL"/>
</action>
<join name="end_copy" to="END"/>
<kill name="KILL">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="END"/>
Turned out it was possible to use wildcards in distcp, so I didn't need bash at all.
Also. Some people adviced me to write it in scala.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path, FileUtil}
val conf = new Configuration()
val fs = FileSystem.get(conf)
val listOfFileTypes = List("mta", "rcr", "sub")
val listOfPlatforms = List("B", "C", "H", "M", "Y")
for(fileType <- listOfFileTypes){
FileUtil.fullyDeleteContents(new File("/apps/hive/warehouse/arstel.db/fair_usage/fct_evkuzmin/file_" + fileType))
for (platform <- listOfPlatforms) {
var srcPaths = fs.globStatus(new Path("/user/comverse/data/" + "20170404" + "_" + platform + "/*" + fileType + ".gz"))
var dstPath = new Path("/apps/hive/warehouse/arstel.db/fair_usage/fct_evkuzmin/file_" + fileType)
for(srcPath <- srcPaths){
println("copying " + srcPath.getPath.toString)
FileUtil.copy(fs, srcPath.getPath, fs, dstPath, false, conf)
}
}
}
Both things work, thought I haven't tried to run the scala script in Oozie.

Imported Failed: Cannot convert SQL type 2005==> during importing CLOB data from Oracle database

I am trying to import a Oracle table's data with CLOB data type using sqoop and it is failing with the error Imported Failed: Cannot convert SQL type 2005. I am using Running Sqoop version: 1.4.5-cdh5.4.7.
Please help me how to import CLOB data type.
I am using the below oozie workflow to import the data
<workflow-app xmlns="uri:oozie:workflow:0.4" name="EBIH_Dly_tldb_dly_load_wf">
<credentials>
<credential name="hive2_cred" type="hive2">
<property>
<name>hive2.jdbc.url</name>
<value>${hive2_jdbc_uri}</value>
</property>
<property>
<name>hive2.server.principal</name>
<value>${hive2_server_principal}</value>
</property>
</credential>
</credentials>
<start to="sqp_imp_tldb_table1"/>
<action name="sqp_imp_tldb_table1">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<arg>import</arg>
<arg>-Dmapreduce.output.fileoutputformat.compress=false</arg>
<arg>--connect</arg>
<arg>${connect_string}</arg>
<arg>--username</arg>
<arg>${username}</arg>
<arg>--password</arg>
<arg>${password}</arg>
<arg>--num-mappers</arg>
<arg>8</arg>
<arg>--as-textfile</arg>
<arg>--append</arg>
<arg>--fields-terminated-by</arg>
<arg>|</arg>
<arg>--split-by</arg>
<arg>created_dt</arg>
<arg>--target-dir</arg>
<arg>${sqp_table1_dir}</arg>
<arg>--map-column-hive</arg>
<arg>ID=bigint,XML1=string,XML2=string,APP_PAYLOAD=string,created_dt=date,created_day=bigint</arg>
<arg>--query</arg>
<arg>"select * from schema1.table1 where $CONDITIONS AND trunc(created_dt) BETWEEN to_date('${load_start_date}','yyyy-mm-dd') AND to_date('${load_end_date}','yyyy-mm-dd')"</arg>
</sqoop>
<ok to="dly_load_wf_complete"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="dly_load_wf_complete"/>
</workflow-app>
Finally it worked for me with an additional clause -D oraoop.disabled=true in sqoop import option.
The below worked
<action name="sqp_imp_tldb_table1">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<arg>import</arg>
<arg>-Dmapreduce.output.fileoutputformat.compress=false</arg>
<arg>-Doraoop.disabled=true</arg>
<arg>--connect</arg>
<arg>${connect_string}</arg>
<arg>--username</arg>
<arg>${username}</arg>
<arg>--password</arg>
<arg>${password}</arg>
<arg>--num-mappers</arg>
<arg>8</arg>
<arg>--as-textfile</arg>
<arg>--append</arg>
<arg>--fields-terminated-by</arg>
<arg>\t</arg>
<arg>--split-by</arg>
<arg>created_dt</arg>
<arg>--target-dir</arg>
<arg>${sqp_table1_dir}</arg>
<arg>--map-column-hive</arg>
<arg>ID=bigint,XML1=string,XML2=string,APP_PAYLOAD=string,created_dt=date,created_day=bigint</arg>
<arg>--query</arg>
<arg>"select * from schema1.table1 where $CONDITIONS AND trunc(created_dt) BETWEEN to_date('${load_start_date}','yyyy-mm-dd') AND to_date('${load_end_date}','yyyy-mm-dd')"</arg>
</sqoop>
<ok to="dly_load_wf_complete"/>
<error to="fail"/>
</action>

MapR oozie sqoop error; Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]

I'm repeatedly getting this error when I submit a sqoop job using oozie on MapR. Details below. I even copied the mysql jar file to the share/lib/sqoop directory, with no result. Could you please help?
Command:
/opt/mapr/oozie/oozie-4.0.1/bin/oozie job -oozie=http://OOZIE_URL:11000/oozie -config job.properties -run
Error
2015-06-18 01:54:05,818 WARN SqoopActionExecutor:542 - SERVER[data-sci1] USER[mapr] GROUP[-] TOKEN[] APP[sqoop-orders-wf] JOB[0000024-150616000730465-oozie-mapr-W] ACTION[0000024-150616000730465-oozie-mapr-W#sqoop-orders-node] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
MaprFS:
/oozie/share/lib/sqoop/mysql-connector-java-5.1.25-bin.jar
job.properties:
nameNode=maprfs:/// jobTracker=YARN_RESOURCE_MANAGER:8032 queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=maprfs:/oozie/data/sqoop/orders
mapreduce.framework.name=yarn
workflow.xml:
<start to="sqoop-orders-node"/>
<action name="sqoop-orders-node">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<arg>import</arg>
<arg>--hbase-create-table</arg>
<arg>--hbase-table</arg>
<arg>orders</arg>
<arg>--column-family</arg>
<arg>d</arg>
<arg>--username</arg>
<arg>USERNAME</arg>
<arg>--password</arg>
<arg>PASSWORD</arg>
<arg>--connect</arg>
<arg>"jdbc:mysql://HOST?zeroDateTimeBehavior=round"</arg>
<arg>--query</arg>
<arg>--split-by</arg>
<arg>o.OrderId</arg>
<arg>--hbase-row-key</arg>
<arg>rowkey</arg>
<arg>-m</arg>
<arg>8</arg>
<arg>--verbose</arg>
<arg>--query</arg>
<arg>select o.OrderId as rowkey, o.OrderId as orderId from orders WHERE \$CONDITIONS</arg>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Sqoop free form failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>

Resources