Oozie - Setting strategy on DistCp through action configuration - hadoop

I have a workflow with a distCp action, and it's running fairly well. However, now I'm trying to change the copy strategy and am unable to do that through the action arguments. The documentation is fairly slim on this topic and looking at the source code for the distCp action executor did not help.
If running the distCp from the command line I can use the command line argument
-strategy {uniformsize|dynamic} to set the copy strategy.
Using that logic I tried to do this in the oozie action.
<action name="distcp-run" retry-max="3" retry-interval="1">
<distcp xmlns="uri:oozie:distcp-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapreduce.job.queuename</name>
<value>${poolName}</value>
</property>
</configuration>
<arg>-Dmapreduce.job.queuename=${poolName}</arg>
<arg>-Dmapreduce.job.name=distcp-s3-${wf:id()}</arg>
<arg>-update</arg>
<arg>-strategy dynamic</arg>
<arg>${region}/d=${day2HoursAgo}/h=${hour2HoursAgo}</arg>
<arg>${region2}/d=${day2HoursAgo}/h=${hour2HoursAgo}</arg>
<arg>${region3}/d=${day2HoursAgo}/h=${hour2HoursAgo}</arg>
<arg>${nameNode}${rawPath}/${partitionDate}</arg>
</distcp>
<ok to="join-distcp-steps"/>
<error to="error-report"/>
</action>
However, the action fails when I execute.
From stdout:
...>>> Invoking Main class now >>>
Fetching child yarn jobs
tag id : oozie-1d1fa70383587ae625b6495e30a315f7
Child yarn jobs are found -
Main class : org.apache.hadoop.tools.DistCp
Arguments :
-Dmapreduce.job.queuename=merged
-Dmapreduce.job.name=distcp-s3-0000019-160622133128476-oozie-oozi-W
-update
-strategy dynamic
s3a://myfirstregion/d=21/h=17,s3a://mysecondregion/d=21/h=17,s3a://ttv-logs-eu/tsv/clickstream-clean/y=2016/m=06/d=21/h=17,s3a://mythirdregion/d=21/h=17
hdfs://myurl:8020/data/raw/2016062117
found Distcp v2 Constructor
public org.apache.hadoop.tools.DistCp(org.apache.hadoop.conf.Configuration,org.apache.hadoop.tools.DistCpOptions) throws java.lang.Exception
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.DistcpMain], main() threw exception, Returned value from distcp is non-zero (-1)
java.lang.RuntimeException: Returned value from distcp is non-zero (-1)
at org.apache.oozie.action.hadoop.DistcpMain.run(DistcpMain.java:66)...
Looking at the syslog it seems that it grabbed the -strategy dynamic and tried to put it in the array of source paths:
2016-06-22 14:11:18,617 INFO [uber-SubtaskRunner] org.apache.hadoop.tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[-strategy dynamic, s3a://myfirstregion/d=21/h=17,s3a:/mysecondregion/d=21/h=17,s3a:/ttv-logs-eu/tsv/clickstream-clean/y=2016/m=06/d=21/h=17,s3a:/mythirdregion/d=21/h=17], targetPath=hdfs://myurl:8020/data/raw/2016062117, targetPathExists=true, preserveRawXattrs=false, filtersFile='null'}
2016-06-22 14:11:18,624 INFO [uber-SubtaskRunner] org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at sandbox/10.191.5.128:8032
2016-06-22 14:11:18,655 ERROR [uber-SubtaskRunner] org.apache.hadoop.tools.DistCp: Invalid input:
org.apache.hadoop.tools.CopyListing$InvalidInputException: -strategy dynamic doesn't exist
So from the DistCpOptions there is a copyStrategy but it's set to a default uniformsize value.
I've tried to move the argument in the first place, but then both -Dmapreduce arguments end up in the source paths (but -update does not).
How can I, through Oozie workflow configuration, set the copy strategy to dynamic?
Thanks.

Looking at the code, it doesn't seem possible to set the strategy via configuration. Instead of using the distcp-action you could use a map-reduce action, that way you can configure it however you want.
The Oozie MapReduce Cookbook has examples.
Looking at the Distcp code the relevant part is around line 237 at createJob().
Job job = Job.getInstance(getConf());
job.setJobName(jobName);
job.setInputFormatClass(DistCpUtils.getStrategy(getConf(), inputOptions));
job.setJarByClass(CopyMapper.class);
configureOutputFormat(job);
job.setMapperClass(CopyMapper.class);
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputFormatClass(CopyOutputFormat.class);
job.getConfiguration().set(JobContext.MAP_SPECULATIVE, "false");
job.getConfiguration().set(JobContext.NUM_MAPS, String.valueOf(inputOptions.getMaxMaps()));
The code above isn't everything you will need, you'll need to look at the distcp source to work them all out.
So you would need to configure all of the properties yourself in a map-reduce action. This way you could set the InputFormatClass which is where the strategy setting is used.
You can see the available properties for the InputFormatClass in the distcp properties file here.
The one you need is org.apache.hadoop.tools.mapred.lib.DynamicInputFormat.

Related

Oozie workflow fails - Mkdirs failed to create file

I am using an Oozie workflow to run a pyspark script, and I'm running into an error I can't figure out.
When running the workflow (either locally or on YARN) a MapReduce job is run before the Spark starts. After a few minutes the task fails (before the Spark action), and digging through the logs shows the following error:
java.io.IOException: Mkdirs failed to create file:/home/oozie/oozie-oozi/0000011-160222043656138-oozie-oozi-W/bulk-load-node--spark/output/_temporary/1/_temporary/attempt_1456129482428_0003_m_000000_2 (exists=false, cwd=file:/hadoop/yarn/local/usercache/root/appcache/application_1456129482428_0003/container_e68_1456129482428_0003_01_000004)
(Apologies for the length)
There are no other evident errors. I do not directly create this folder (I assume given the name that it is used for temporary storage of MapReduce jobs). I can create this folder from the command line using mkdir -p /home/oozie/blah.... It doesn't appear to be a permissions issue, as setting that folder to 777 made no difference. I have also added default ACLs for oozie, yarn and mapred users for that folder, so I've pretty much ruled out permission issues. It's also worth noting that the working directory listed in the error does not exist after the job fails.
After some Googling I saw that a similar problem is common on Mac systems, but I'm running on CentOS. I am running the HDP 2.3 VM Sandbox, which is a single node 'cluster'.
My workflow.xml is as follows:
<workflow-app xmlns='uri:oozie:workflow:0.4' name='SparkBulkLoad'>
<start to = 'bulk-load-node'/>
<action name = 'bulk-load-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>client</mode>
<name>BulkLoader</name>
<jar>file:///test/BulkLoader.py</jar>
<spark-opts>
--num-executors 3 --executor-cores 1 --executor-memory 512m --driver-memory 512m\
</spark-opts>
</spark>
<ok to = 'end'/>
<error to = 'fail'/>
</action>
<kill name = 'fail'>
<message>
Error occurred while bulk loading files
</message>
</kill>
<end name = 'end'/>
</workflow-app>
and job.properties is as follows:
nameNode=hdfs://192.168.26.130:8020
jobTracker=http://192.168.26.130:8050
queueName=spark
oozie.use.system.libpath=true
oozie.wf.application.path=file:///test/workflow.xml
If necessary I can post any other parts of the stack trace. I appreciate any help.
Update 1
After having checked my Spark History Server, I can confirm that the actual Spark action is not starting - no new Spark apps are being submitted.

Running Hive Query in Spark through Oozie 4.1.0.3

Getting table not found exception while running Hive Query in Spark using Oozie version 4.1.0.3, as java action.
Copied hive-site.xml and hive-default.xml from hdfs path
workflow.xml used:
<start to="scala_java"/>
<action name="scala_java">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${nameNode}/user/${wf:user()}/${appRoot}/env/devbox/hive- site.xml</job-xml>
<configuration>
<property>
<name>oozie.hive.defaults</name>
<value>${nameNode}/user/${wf:user()}/${appRoot}/env/devbox/hive-default.xml</value>
</property>
<property>
<name>pool.name</name>
<value>${etlPoolName}</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>${QUEUE_NAME}</value>
</property>
</configuration>
<main-class>org.apache.spark.deploy.SparkSubmit</main-class>
<arg>--master</arg>
<arg>yarn-cluster</arg>
<arg>--class</arg>
<arg>HiveFromSparkExample</arg>
<arg>--deploy-mode</arg>
<arg>cluster</arg>
<arg>--queue</arg>
<arg>testq</arg>
<arg>--num-executors</arg>
<arg>64</arg>
<arg>--executor-cores</arg>
<arg>5</arg>
<arg>--jars</arg>
<arg>datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus- rdbms-3.2.9.jar</arg>
<arg>TEST-0.0.2-SNAPSHOT.jar</arg>
<file>TEST-0.0.2-SNAPSHOT.jar</file>
</java>
INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Table not found test_hive_spark_t1)
Exception in thread "Driver" org.apache.hadoop.hive.ql.metadata.InvalidTableException: Table not found test_hive_spark_t1
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:980)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:79)
at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:255)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:255)
A. The X-default config files are just for user information; they are created at install time, from the hard-coded defaults in the JARs.
It's the X-site config files that contain useful information, e.g. how to connect to the Metastore (default for that is "just start an embedded Derby DB with no data inside"... might explain the "table not found message!
B. Hadoop components search for X-site config files in the CLASSPATH; and if they don't find them there, they silently fallback to default.
So you must tell Oozie to download them to local CWD via <file> instructions.
(Except for an explicit Hive Action that uses another, explicit, convention for its specific hive-site but that's not the case here)
hive-default.xml is not needed.
Create a custom hive-site.xml and which has hive.metastore.uris property alone.
Pass the custom hive-site.xml in --files hive-site.xml as spark Arguments.
Remove the job-xml property and oozie-hive-defaults.

Hive Oozie Error Handling

Does anyone have any suggestion on what is the best practice around Oozie Exception/Error handling?
We have Hive Actions within Oozie workflows and find that the errors are not logging with enough detail. We need more of stack trace and more context around each failure.
Any suggestions?
Thanx in advance...
Himanshu
Once the oozie job submitted the Yarn will responsible for the action to completes the mapreduce. Check the log in mapred historyserver once the job is submitted to the yarn or check it via the job logs located in oozie with the list of error code in web UI.
The logging level of the Hive action can set in the Hive action configuration using the property oozie.hive.log.level . The default value is INFO .
You can change it to DEBUG and include in your Hive action configuration of your workflow.xml.
<configuration>
<property>
<name>oozie.log.hive.level</name>
<value>DEBUG</value>
</property>
</configuration>
This log level is in-turn passed onto log4j I believe.
https://github.com/apache/oozie/blob/master/sharelib/hive/src/main/java/org/apache/oozie/action/hadoop/HiveMain.java

Oozie workflow succeeds when map-reduce task throws exceptions - I want it to fail

I have an Oozie workflow that runs a map-reduce task like this:
<action name="generate">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
...
<property>
<name>mapred.mapper.class</name>
<value>com.tendril.pr.report.generate.ReportGenerationMapper</value>
</property>
My mapper class is ReportGenerationMapper which implements the interface Mapper interface. This class is instantiated and the map() method is called by the hadoop system just fine. During the course of the method call though a runtime exception occurs and appears in the log like this:
Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml, hdfs-site.xml, /data/1/mapred/local/tasktracker/taskTracker/root/jobcache/job_201402032014_4231/job.xml
com.tendril.pr.core.domain.TendrilException: Could not retrieve property null
...
This is all fine, but the workflow succeeds and I want it to fail. I can't seem to make it fail. I searched without any luck on how to fail the task that is running this mapper. My oozie jobs succeeds and gives no indication of a failure without diving into this task and looking at the logs. I want the failure to be obvious by having the task and whole workflow fail. Any advice on how to make this happen?
Please try to capture the exit code within the driver class and throw exception manually from there.

oozie : How to use oozie coodinator properties in corresponding workflow?

I have a oozie job running as a coordinator which calls a worflow , in the coordinator there are some configuration properties which uses coordinator EL functions, like this -
${oozieAppDir}/copyLogs.wf.xml
<configuration>
<property>
<name>filename3</name>
<value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -3, 'HOUR'), 'MM')}-${coord:formatTime(coord:dateOffset(coord:nominalTime(), -3, 'HOUR'), 'dd')}-${coord:formatTime(coord:dateOffset(coord:nominalTime(), -3, 'HOUR'), 'yyyy')}-${coord:formatTime(coord:dateOffset(coord:nominalTime(), -3, 'HOUR'), 'HH')}</value>
</property>
</configuration>
When an instance of this job is killed I want to rerun the workflow individually from the command line , but it gives error since its using properties defined in the coordinator and I can't add these properties in workflow since its using the coordinator EL functions, moreover I did not find the corresponding wf:EL functions , what is the best way to do this, I am basically interested in automating the failure of this workflow triggered by a coordinator. Please suggest the best way to go about it with minimal changes.
You need to define coordinator EL function/property in coordinator.xml and call that property in workflow.xml as:
Coordinator
<action>
<workflow>
<app-path>${workflowPath}</app-path>
<configuration>
<property>
<name>nominalTime</name>
<value>${coord:nominalTime()}</value>
</property>
</configuration>
</workflow>
</action>
Workflow(Hive example)
<param>date=${nominalTime}</param>
You can pass parameters to the workflow during submission using -D
oozie job -oozie <oozie URL> -config <configFile> -Dnameofproperty=value -submit
You can use String wf:conf(String name) function in your workflow to retrieve the value
Are you sure you can use oozie expressions in configuration files?
When I need to run a workflow manually I define a property in a config file and submit that workflow to oozie.
configuration.properties :
filename3=/user/john/data.tsv
workflow.xml :
...
<property>
<name>filename3</name>
<value>${filename3}</value>
</property>
...
Then to run this workflow:
oozie -config configuration.properties -run

Resources