Spring XD Yarn: Stream runs only on exactly two containers - spring-xd

Spring XD Yarn: Stream runs only on exactly two containers
Spring XD Yarn ver 1.2.1
1.In servers.yml, set no of containers to 15.(I have 16 Node managers in my YARN cluster)
2.All 15 containers are created. I confirmed this by executing 'runtime containers' in xd-shell
3.When I run a Spring XD stream from kafka source to hdfs sink, exactly only two containers(of the 15 containers) are used. The remaining 13 containers are not used. My stream runs for 6 to 7 hrs.In all this 6 hrs, only two of the 15 live containers are used for this stream.
4. Please let me know how to make my stream run on all 15 live containers.
--> Is there any configuration that I missed, please do the needful.

You can take a look at the deployment manifest: http://docs.spring.io/spring-xd/docs/current/reference/html/#deployment-manifest
You can use deployment properties to scale up your stream and control the module count - i.e. how many instances of each module are you deploying. I would suspect that your stream runs with the default of 1, which means that you are getting exactly one source module instance and one sink module instance. The default deployment algorithm would indeed deploy them on separate containers.

Related

Prometheus Integration with Hadoop (Ozone Cluster)

I am trying to follow the Apache documentation in order to integrate Prometheus with Apache Hadoop. One of the preliminary steps is to setup Apache Ozone cluster. However, I am finding issues in running the ozone cluster concurrently with Hadoop. It throws a class not found exception for "org.apache.hadoop.ozone.HddsDatanodeService" whenever I try to start the ozone manager or storage container manager.
I also found that ozone 1.0 release is pretty recent and it is mentioned that it is tested with Hadoop 3.1. I have a running Hadoop cluster of version of 3.3.0. Now, I doubt if the version is a problem.
The tar ball for Ozone also has the Hadoop config files, but I wanted to configure ozone with my existing Hadoop cluster. I want to configure the ozone with my existing hadoop cluster.
Please let me know what should be the right approach here. If this can not be done, then please also let me know what is good way to monitor and extract metrics for Apache Hadoop in production.

Is it sutable to use Spring Cloud DataFlow to orchestrate long running external batch jobs inside infinite running apps?

We have String Batch applications with triggers defined in each app.
Each Batch application runs tens of similar jobs with different parameters and is able to do that with 1400 MiB per app.
We use Spring Batch Admin, which is deprecated years ago, to launch individual job and to get brief overview what is going in jobs. Migration guide recommends to replace Spring Batch Admin with Spring Cloud DataFlow.
Spring Cloud DataFlow docs says about grabbing jar from Maven repo and running it with some parameters. I don't like idea to wait 20 sec for application downloading, 2 min to application launching and all that security/certificates/firewall issues (how can I download proprietary jar across intranets?).
I'd like to register existing applications in Spring Cloud DataFlow via IP/port and pass job definitions to Spring Batch applications and monitor executions (including ability to stop job). Is Spring Cloud DataFlow usable for that?
Few things to unpack here. Here's an attempt at it.
Spring Cloud DataFlow docs says about grabbing jar from Maven repo and running it with some parameters. I don't like idea to wait 20 sec for application downloading, 2 min to application launching and all that security/certificates/firewall issues
Yes, there's an App resolution process. However, once downloaded, we would reuse the App from Maven cache.
As for the 2mins bootstrapping window, it is up to Boot and the number of configuration objects, and of course, your business logic. Maybe all that in your case is 2mins.
how can I download proprietary jar across intranets?
There's an option to resolve artifacts from a Maven artifactory hosted behind the firewall through proxies - we have users on this model for proprietary JARs.
Each Batch application runs tens of similar jobs with different parameters and is able to do that with 1400 MiB per app.
You may want to consider the Composed Task feature. It not only provides the ability to launch child Tasks as Direct Acyclic Graphs, but it also allows transitions based on exit-codes at each node, to further split and branch to launch more Tasks. All this, of course, is automatically recorded at each execution level for further tracking and monitoring from the SCDF Dashboard.
I'd like to register existing applications in Spring Cloud DataFlow via IP/port and pass job definitions to Spring Batch applications and monitor executions (including ability to stop job).
As far as the batch-jobs are wrapped into Spring Cloud Task Apps, yes, you'd be able to register them in SCDF and use it in the DSL or drag & drop them into the visual canvas, to create coherent data pipelines. We have a few "batch-job as task" samples here and here.

Improve Spring-Boot startup in Docker

Approximately for standalone start as java process :
java -jar myspring_boot.jar
it takes around 20 seconds. But if I run it in a docker container which contain more micro services it takes around 3 minutes.
Is there a way to speed up the spring-boot boot time as for an example
if I enable debug longing I notice that there are a lot of unnecessary validations for different spring configurations.
How I can speed up the spring-boot startup time only for dev purpose using Docker containers?
I installed haveged daemon as it says in many answers out there, for example:
https://stackoverflow.com/a/39461346/2748325
And also added -XX:MaxMetaspaceSize=128m to my java CMD in the Dockerfile and the times went down in about 2 minutes.

File path issue while creating simple spring XD stream

I am running spring XD on distributed YARN setup.I am using hortonworks data platform with 6 data nodes and 1 name node. and using name node as a client node.I have invoked xd Shell from name node and admin and containers are running on the data node. So when I create the spring XD stream definition as below:
xd> stream create --name filetest --definition "file | log" --deploy
It looks for the /tmp/xd/input/filetest on the data nodes to which i dont have access to. Is this the normal behavior of spring XD ? I think it should look for the location on the node from which i have invoked the XD shell. Could you please help me on this.
The containers (regardless of whether they are running on Yarn or not) have no knowledge of where the shell is running.

jdbc to HDFS import using spring batch job

I am able to import data from my MS sql to HDFS using JDBCHDFC Spring Batch jobs.But if that containers fails , the job does not shift to other container. How do I proceed to make the job fault tolerant.
I am using spring xd 1.0.1 release
You don't mention which version of Spring XD you're currently using so I can't verify the exact behavior. However, on a container failure with a batch job running in the current version, the job should be re-deployed to a new eligible container. That being said, it will not restart the job automatically. We are currently looking at options for how to allow a user to specify if they want it restarted (there are scenarios that fall into both camps so we need to allow a user to configure that).

Resources