Hadoop integration testing - hadoop

I would like to know what is the best way to perform integration tests in Hadoop ecosystem?
Currently, I use Hadoop, HBase and Oozie, and I was wondering what would be the best approach to test the integration. So I don't want a mock of Oozie or HBase, but I want a 'light-weight' instances of those so I could for example write to HBase from a web service, without the need to inject a mock. Similarly, I don't want a mock Oozie client, but light-weight Oozie running on some port.
Would it be a good approach to setup a pseudo-mode cluster on a single machine and install HBase and Oozie additionally, or is there a better way?

Related

Apache Airflow/Azkaban workflow Schedulers compatibility with Hadoop MRv1

I'm working on a project that relies on Hadoop but MRv1 architecture (Hadoop-1.1.2). I tried oozie scheduler for creating workflows(mapred) but gave up eventually, cause it is a nightmare to configure and I couldn't get it to work. I was wondering if I should try these other workflow Schedulers such as Azkaban or Apache Airflow. Would they be compatible with my requirements ?

Run MapReduce Jar in Spring cloud data

I need to run a mapreduce spring boot application in spring cloud data flow. Usually applications registered in scdf is executed using "java -jar jar-name" command. But my program is a mapreduce and it has to be executed using "hadoop jar jar-name". How do I achieve this ? What would be better approach to run mapreduce application in scdf ? Is it possible to directly register mapreduce apps ?
I'm using local data flow server to register the application.
In SCDF the format of the command to run a JAR file is managed by a deployer. For example, there are local deployer. Cloud Foundry etc... There is/was Hadoop/YARN but it was discontinued I believe.
Given that the deployer itself is an SPI you can easily implement your own or even fork/extend local-deployer and modify only what's needed.

Writing MapReduce and YARN application together

I want to run MapReduce application using Hadoop 2.6.5 (in my own native cluster) and I want to update some things in YARN thus I have seen that I can write my own YARN application (https://hadoop.apache.org/docs/r2.6.5/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html).
But it seems like if I am running YARN in this way (with YARN client) then, I can't use the MapReduce paradigm (Map and Reduce Function) with a Job class.
Is there any option to write my own ApplicationManager and using the MapReduce paradigm with the simple Job class?
These are some useful examples I have found regarding writing YARN application
https://github.com/noitcudni/yarn
https://github.com/hortonworks/simple-yarn-app
https://github.com/blrunner/yarn-beginners-examples/blob/master/src/main/java/com/wikibooks/hadoop/yarn/examples/MyClient.java
*Using Spring or Twill will result in the same problem.

Test framework for Spark Application validations

I am looking for your suggestions/help in testing framework for one of our Spark application.
We have a spark application which processes the input data from HDFS and pushes the processed output data to HDFS. We are planning to automate the process of testing this spark application.
I would appreciate any suggestions on how to automate the testing or whether any framework available for testing spark applications/jobs.
-Sri
Spark code can be checked without any additional Spark-related frameworks. Just set configuration master to "local":
val config = new SparkConf().setMaster("local")
Computer file system is used as HDFS by default. And such approach will work in usual test frameworks (ScalaTest, etc).
Note: SparkContext must be declared as singleton for all tests.

Orchestration of Apache Spark using Apache Oozie

We are thinking of the integration of apache spark in our calculation process where we at first wanted to use apache oozie and standard MR or MO (Map-Only) jobs.
After some research several questions remain:
Is it possible to orchestrate an apache spark process by using apache oozie? If yes, how?
Is oozie necessary anymore or could spark handle orchestration by itself? (unification seems to be one of the main concerns in spark)
Please consider the following scenarios when answering:
executing a work flow every 4 hours
executing a work flow whenever specific data is accessible
trigger a work flow and configure it with parameters
Thanks for your answers in advance.
Spark is supported in Oozie 4.2 as an action type, see docs. The scenarios you mentioned are common Oozie features.

Resources