Run MapReduce Jar in Spring cloud data - hadoop

I need to run a mapreduce spring boot application in spring cloud data flow. Usually applications registered in scdf is executed using "java -jar jar-name" command. But my program is a mapreduce and it has to be executed using "hadoop jar jar-name". How do I achieve this ? What would be better approach to run mapreduce application in scdf ? Is it possible to directly register mapreduce apps ?
I'm using local data flow server to register the application.

In SCDF the format of the command to run a JAR file is managed by a deployer. For example, there are local deployer. Cloud Foundry etc... There is/was Hadoop/YARN but it was discontinued I believe.
Given that the deployer itself is an SPI you can easily implement your own or even fork/extend local-deployer and modify only what's needed.

Related

Create spring dataflow server container with local jar included

I'd like to package spring dataflow server into a container which will contain one local jar application. Publish this into local repo, expectation is that end result is same as the normal dataflow server:
https://hub.docker.com/r/springcloud/spring-cloud-dataflow-server
just with the local jar added.
Creating the Dockerfile to include the jar is straightforward, but I'm strugling a bit with how to register the jar into dataflow server.
I know one option is to use the RESTapi, but it feels quite complicated to start the dataflow server during the docker creation. I found documentation that application.yml might be a way to do this as well, but couldn't figure out how exactly.
https://github.com/spring-cloud/spring-cloud-dataflow/blob/main/spring-cloud-dataflow-server/README.adoc
https://docs.spring.io/spring-boot/docs/1.5.13.RELEASE/reference/html/boot-features-external-config.html
So is there a straightforward way to package a jar into dataflow server docker container?
The API is the only practical way to do it. Take a look at how we register apps with the docker-compose installation.
Technically, you could also pre-populate the associated DB table(s), but I don’t recommend this.

Is it sutable to use Spring Cloud DataFlow to orchestrate long running external batch jobs inside infinite running apps?

We have String Batch applications with triggers defined in each app.
Each Batch application runs tens of similar jobs with different parameters and is able to do that with 1400 MiB per app.
We use Spring Batch Admin, which is deprecated years ago, to launch individual job and to get brief overview what is going in jobs. Migration guide recommends to replace Spring Batch Admin with Spring Cloud DataFlow.
Spring Cloud DataFlow docs says about grabbing jar from Maven repo and running it with some parameters. I don't like idea to wait 20 sec for application downloading, 2 min to application launching and all that security/certificates/firewall issues (how can I download proprietary jar across intranets?).
I'd like to register existing applications in Spring Cloud DataFlow via IP/port and pass job definitions to Spring Batch applications and monitor executions (including ability to stop job). Is Spring Cloud DataFlow usable for that?
Few things to unpack here. Here's an attempt at it.
Spring Cloud DataFlow docs says about grabbing jar from Maven repo and running it with some parameters. I don't like idea to wait 20 sec for application downloading, 2 min to application launching and all that security/certificates/firewall issues
Yes, there's an App resolution process. However, once downloaded, we would reuse the App from Maven cache.
As for the 2mins bootstrapping window, it is up to Boot and the number of configuration objects, and of course, your business logic. Maybe all that in your case is 2mins.
how can I download proprietary jar across intranets?
There's an option to resolve artifacts from a Maven artifactory hosted behind the firewall through proxies - we have users on this model for proprietary JARs.
Each Batch application runs tens of similar jobs with different parameters and is able to do that with 1400 MiB per app.
You may want to consider the Composed Task feature. It not only provides the ability to launch child Tasks as Direct Acyclic Graphs, but it also allows transitions based on exit-codes at each node, to further split and branch to launch more Tasks. All this, of course, is automatically recorded at each execution level for further tracking and monitoring from the SCDF Dashboard.
I'd like to register existing applications in Spring Cloud DataFlow via IP/port and pass job definitions to Spring Batch applications and monitor executions (including ability to stop job).
As far as the batch-jobs are wrapped into Spring Cloud Task Apps, yes, you'd be able to register them in SCDF and use it in the DSL or drag & drop them into the visual canvas, to create coherent data pipelines. We have a few "batch-job as task" samples here and here.

How to push and run a spring-batch application on Bluemix?

I have created a spring batch application in spring boot to perform a daily activity for processing data in batch. I am also able to build and create an uber-jar that has all the dependencies in it. How do I push this jar file as a batch application (not as a web application) and how do I invoke the application to start from the command line if possible?
For cloud foundry applications without the web interface, you can run them with:
---
...
no-route: true
This will disable the web interface - https://docs.cloudfoundry.org/devguide/deploy-apps/manifest.html#no-route
You may also want to set the health check to process:
---
...
health-check-type: process
This will monitor the exit status of your application. Note that this will monitor the process your java process. If it stops (e.g. batch job application finishes), cloud foundry will try to restart it - https://docs.cloudfoundry.org/devguide/deploy-apps/healthchecks.html. This assumes that you want your application to run continuously.
You will probably want to check the java build pack for methods for running your jar - https://github.com/cloudfoundry/java-buildpack/blob/master/README.md.
I think you will want to deploy using the Java Main method, which runs Java applications with a main() method provided that they are packaged as self-executable JARs. Cloud foundry will automatically run the main() method.

Test framework for Spark Application validations

I am looking for your suggestions/help in testing framework for one of our Spark application.
We have a spark application which processes the input data from HDFS and pushes the processed output data to HDFS. We are planning to automate the process of testing this spark application.
I would appreciate any suggestions on how to automate the testing or whether any framework available for testing spark applications/jobs.
-Sri
Spark code can be checked without any additional Spark-related frameworks. Just set configuration master to "local":
val config = new SparkConf().setMaster("local")
Computer file system is used as HDFS by default. And such approach will work in usual test frameworks (ScalaTest, etc).
Note: SparkContext must be declared as singleton for all tests.

jdbc to HDFS import using spring batch job

I am able to import data from my MS sql to HDFS using JDBCHDFC Spring Batch jobs.But if that containers fails , the job does not shift to other container. How do I proceed to make the job fault tolerant.
I am using spring xd 1.0.1 release
You don't mention which version of Spring XD you're currently using so I can't verify the exact behavior. However, on a container failure with a batch job running in the current version, the job should be re-deployed to a new eligible container. That being said, it will not restart the job automatically. We are currently looking at options for how to allow a user to specify if they want it restarted (there are scenarios that fall into both camps so we need to allow a user to configure that).

Resources