How to use apache Oozie java lib - hadoop

I'm new to Apache Oozie. As far as I understood, we can define a workflow of action either using a workflow file or a coordinator file, which are both in xml format, and submit it to the Oozie engine. However, there is a Oozie java library as well and I'd like to know how I can use this library? Is it for programmatically generating the xml files and submitting them to the engine? Can someone point me to an example please.

Related

Test framework for Spark Application validations

I am looking for your suggestions/help in testing framework for one of our Spark application.
We have a spark application which processes the input data from HDFS and pushes the processed output data to HDFS. We are planning to automate the process of testing this spark application.
I would appreciate any suggestions on how to automate the testing or whether any framework available for testing spark applications/jobs.
-Sri
Spark code can be checked without any additional Spark-related frameworks. Just set configuration master to "local":
val config = new SparkConf().setMaster("local")
Computer file system is used as HDFS by default. And such approach will work in usual test frameworks (ScalaTest, etc).
Note: SparkContext must be declared as singleton for all tests.

Correct way of submitting a job to YARN cluster in which job has dependencies on external jars?

I am trying to understand on what is the correct way of submitting a MR (for that matter a Spark based Java) job to YARN cluster.
Consider the situation below:
Using client machine develop code (MR or Spark) jobs, and say the codes uses 3rd party jar's. Now, when a developer has to submit the job to the YARN cluster, what is the correct way of submitting the job to cluster so that there is no run time exception of class not found. Since job is submitted as jar file, how can a developer "put" the 3rd party jars?
I am having difficulty in understanding this, can anyone help me understand this?
You have to simply build a "fat jar," with Gradle or Maven, that contains not only your compiled code but also all transitive dependencies.
You can use either the Maven Assembly Plugin or any of the Gradle plugins like the Shadow Plugin.
The output of these is what you should supply to spark-submit.

Add Spark to Oozie shared lib

By default, Oozie shared lib directory provides libraries for Hive, Pig, and Map-Reduce. If I want to run Spark job on Oozie, it might be better to add Spark lib jars to Oozie's shared lib instead of copy them to app's lib directory.
How can I add Spark lib jars (including spark-core and its dependencies) to Oozie's shared lib? Any comment / answer is appreciated.
Spark action is scheduled to be released with Oozie 4.2.0, even though the doc seems to be a bit behind. See related JIRA here :
Oozie JIRA - Add spark action executor
Cloudera's release CDH 5.4 has it already though, see official doc here:
CDH 5.4 oozie doc - Oozie Spark Action Extension
With the older version of Oozie, the jars could be shared with various approaches. The first approach may work the best. The complete listings anyway :
Below are the various ways to include a jar with your workflow:
Set oozie.libpath=/path/to/jars,another/path/to/jars in job.properties.
This is useful if you have many workflows that all need the same jar; you can put it in one place in HDFS and use it with many workflows. The jars will be available to all actions in that workflow.
There is no need to ever point this at the ShareLib location. (I see that in a lot of workflows.) Oozie knows where the ShareLib is and will include it automatically if you set oozie.use.system.libpath=true in job.properties.
Create a directory named “lib” next to your workflow.xml in HDFS and put jars in there.
This is useful if you have some jars that you only need for one workflow. Oozie will automatically make those jars available to all actions in that workflow.
Specify the tag in an action with the path to a single jar; you can have multiple tags.
This is useful if you want some jars only for a specific action and not all actions in a workflow.
The downside is that you have to specify them in your workflow.xml, so if you ever need to add/remove some jars, you have to change your workflow.xml.
Add jars to the ShareLib (e.g. /user/oozie/share/lib/lib_/pig)
While this will work, it’s not recommended for two reasons:
The additional jars will be included with every workflow using that ShareLib, which may be unexpected to those workflows and users.
When upgrading the ShareLib, you’ll have to recopy the additional jars to the new ShareLib.
quoted from Rober Kanter's blog here : How-to: Use the ShareLib in Apache Oozie (CDH 5)

Orchestration of Apache Spark using Apache Oozie

We are thinking of the integration of apache spark in our calculation process where we at first wanted to use apache oozie and standard MR or MO (Map-Only) jobs.
After some research several questions remain:
Is it possible to orchestrate an apache spark process by using apache oozie? If yes, how?
Is oozie necessary anymore or could spark handle orchestration by itself? (unification seems to be one of the main concerns in spark)
Please consider the following scenarios when answering:
executing a work flow every 4 hours
executing a work flow whenever specific data is accessible
trigger a work flow and configure it with parameters
Thanks for your answers in advance.
Spark is supported in Oozie 4.2 as an action type, see docs. The scenarios you mentioned are common Oozie features.

Running Cascading using Oozie

I'm trying to run a Cascading job using Oozie.
I am getting a java.lang.ClassNotFoundException: cascading.tap.hadoop.MultiInputSplit
I am including the cascading jar in the workflow lib, but it is not being included when cascading launches the m/r job.
Is there anyone out there using Cascading along with Oozie?
You should combine cascading jar with your own jar and give it to workflow/lib.

Resources