Orchestration of Apache Spark using Apache Oozie - hadoop

We are thinking of the integration of apache spark in our calculation process where we at first wanted to use apache oozie and standard MR or MO (Map-Only) jobs.
After some research several questions remain:
Is it possible to orchestrate an apache spark process by using apache oozie? If yes, how?
Is oozie necessary anymore or could spark handle orchestration by itself? (unification seems to be one of the main concerns in spark)
Please consider the following scenarios when answering:
executing a work flow every 4 hours
executing a work flow whenever specific data is accessible
trigger a work flow and configure it with parameters
Thanks for your answers in advance.

Spark is supported in Oozie 4.2 as an action type, see docs. The scenarios you mentioned are common Oozie features.

Related

Apache Airflow/Azkaban workflow Schedulers compatibility with Hadoop MRv1

I'm working on a project that relies on Hadoop but MRv1 architecture (Hadoop-1.1.2). I tried oozie scheduler for creating workflows(mapred) but gave up eventually, cause it is a nightmare to configure and I couldn't get it to work. I was wondering if I should try these other workflow Schedulers such as Azkaban or Apache Airflow. Would they be compatible with my requirements ?

Writing MapReduce and YARN application together

I want to run MapReduce application using Hadoop 2.6.5 (in my own native cluster) and I want to update some things in YARN thus I have seen that I can write my own YARN application (https://hadoop.apache.org/docs/r2.6.5/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html).
But it seems like if I am running YARN in this way (with YARN client) then, I can't use the MapReduce paradigm (Map and Reduce Function) with a Job class.
Is there any option to write my own ApplicationManager and using the MapReduce paradigm with the simple Job class?
These are some useful examples I have found regarding writing YARN application
https://github.com/noitcudni/yarn
https://github.com/hortonworks/simple-yarn-app
https://github.com/blrunner/yarn-beginners-examples/blob/master/src/main/java/com/wikibooks/hadoop/yarn/examples/MyClient.java
*Using Spring or Twill will result in the same problem.

Test framework for Spark Application validations

I am looking for your suggestions/help in testing framework for one of our Spark application.
We have a spark application which processes the input data from HDFS and pushes the processed output data to HDFS. We are planning to automate the process of testing this spark application.
I would appreciate any suggestions on how to automate the testing or whether any framework available for testing spark applications/jobs.
-Sri
Spark code can be checked without any additional Spark-related frameworks. Just set configuration master to "local":
val config = new SparkConf().setMaster("local")
Computer file system is used as HDFS by default. And such approach will work in usual test frameworks (ScalaTest, etc).
Note: SparkContext must be declared as singleton for all tests.

Hadoop integration testing

I would like to know what is the best way to perform integration tests in Hadoop ecosystem?
Currently, I use Hadoop, HBase and Oozie, and I was wondering what would be the best approach to test the integration. So I don't want a mock of Oozie or HBase, but I want a 'light-weight' instances of those so I could for example write to HBase from a web service, without the need to inject a mock. Similarly, I don't want a mock Oozie client, but light-weight Oozie running on some port.
Would it be a good approach to setup a pseudo-mode cluster on a single machine and install HBase and Oozie additionally, or is there a better way?

How to use apache Oozie java lib

I'm new to Apache Oozie. As far as I understood, we can define a workflow of action either using a workflow file or a coordinator file, which are both in xml format, and submit it to the Oozie engine. However, there is a Oozie java library as well and I'd like to know how I can use this library? Is it for programmatically generating the xml files and submitting them to the engine? Can someone point me to an example please.

Resources