Is remote partitioning available with JSR-352 under Spring Batch? - spring

I would like to configure a JSR-352 batch job for remote partitioning, behind the scenes through configuration, without having to explicitly define the controller/worker in the job definition (which is not supported by the JSR-352 specification anyway). IBM WebSphere Liberty provides this capability with their "multi-JVM" feature, where you define remote partitioning within the configuration file (server.xml).
I have seen that legacy Spring Batch has the ability to support remote partitioning, but only through an explicit job definition. I don't want to use legacy Spring Batch. Instead, I want to build a solution that is portable to Java EE, including IBM Liberty.

I don't want to use legacy Spring Batch. Instead, I want to build a solution that is portable to Java EE, including IBM Liberty.
If you really want a portable solution, you need to define your job in the Job Specification Language of the JSR-352 and launch it with:
javax.batch.runtime.BatchRuntime.getJobOperator().start(jobName, properties)
Now if JSR-352 does not specify how to define a remote partitioning setup, then you can't get a portable solution for that.
Please note that there are some differences between Spring Batch and JSR-352 which are detailed in the JSR-352 Support section. The differences regarding partitioning are documented here.

Related

Dynamic Camel route configuration at deployment time: Java DSL or XML DSL?

Let me preface this with the fact that I am still very new to Apache Camel. I'm still trying to understand how it all works, and what needs to be done (and HOW to do it) to achieve a particular effect.
I am trying to develop a Spring Boot application that will use Apache Camel to handle the transmission (and possibly also receipt) of data to/from a number of possible sources and destinations. The purpose of the application is to provide a means to produce/generate network traffic, at the network application level, that will be fed into another Spring Boot application - let's call this the target. We are trying to observe and measure the effects various network loads have on the target.
We would like to be able to transmit data via a number of protocols, including: ftp, http/s, file systems (nfs), various mail protocols (smtp, pop) and data streaming protocols for voice and video. There may be other protocols added at a later time. The data itself is irrelevant, we just need to be able to transmit data via various protocols with various loads.
These applications/services will be running in a containerized environment (Docker) that will be run within our local development and test environment, as well as possibly in a cloud environment, such as AWS. We have used Docker, Ansible, Terraform and are currently working towards using Kubernetes and Istio to manage the configuration, deployment, and operation of these applications.
We need to be able to provide specific configurations of Camel routes for particular deployments.
It would appear that the preferred method to configure Camel routes is via Java DSL, rather than XML DSL. The Camel documentation and nearly every other source of information I've found have a strong bias towards using Java DSL. Examples of XML DSL route configuration are far and few.
My initial impression is that going the Java DSL route (excuse the pun), would not work well with our need to be able to deploy a Camel application with a specific route configuration. It seems like you are required to have Java DSL defined route configurations hardwired into the code.
We think that it will be easier to provide a specific route configuration via an XML file that can be included in a deployment, hence why I've been trying to investigate and experiment with XML DSL. Perhaps we are mistaken in this regard.
My question to the community is: Considering what I've described above, can the Java DSL approach be used to meet the requirements as I've described them? Can we use Java DSL in a way that allows for dynamic route configuration? Keep in mind we would not be attempting to change configuration during operation, just in the course of performing a deployment.
If Java DSL could be used for this purpose, it would be very much appreciated if pointers to documentation, examples, etc. could be provided.
For your use cases you could use XML DSL also. Anyhow below book covers most aspects Camel development with examples. In this book authors describes XML DSL use for most of java DSL examples.
https://www.manning.com/books/camel-in-action-second-edition
In below github repository you can find the source code for all the examples listed in above book.
https://github.com/camelinaction/camelinaction2
Simple tutorial and github repository for Apache Camel using Spring boot.
https://www.baeldung.com/apache-camel-spring-boot
https://github.com/eugenp/tutorials/tree/master/spring-boot-modules/spring-boot-camel
Maven Plugin for build and deployment of spring boot container application into Kubernetes cluster
https://maven.fabric8.io/
In case if your company can afford some funding for your effort look at below link which provides commercial offerings around Camel.
https://camel.apache.org/manual/latest/commercial-camel-offerings.html
Thanks
Madhu Gupta
Our team has a few projects which use the Java DSL for building routes. In order to make them dynamic, there are control structures for iterating and setting endpoints based off configurations. That works for us because the routes are basically all the same, just with different sources and sinks.
If you could dynamically add/change the XML DSL files in a way that doesn't involve redeploying your application, that might be a viable route to follow. One might, for example, change the camel.springboot.xml-routes property to point to a folder which changes as needed.

ETL in Java Spring Batch vs Apache Spark Benchmarking

I have been working with Apache Spark + Scala for over 5 years now (Academic and Professional experiences). I always found Spark/Scala to be one of the robust combos for building any kind of Batch or Streaming ETL/ ELT applications.
But lately, my client decided to use Java Spring Batch for 2 of our major pipelines :
Read from MongoDB --> Business Logic --> Write to JSON File (~ 2GB | 600k Rows)
Read from Cassandra --> Business Logic --> Write JSON File (~ 4GB | 2M Rows)
I was pretty baffled by this enterprise-level decision. I agree there are greater minds than mine in the industry but I was unable to comprehend the need of making this move.
My Questions here are:
Has anybody compared the performances between Apache Spark and Java Spring Batch?
What could be the advantages of using Spring Batch over Spark?
Is Spring Batch "truly distributed" when compared to Apache Spark? I came across methods like chunk(), partition etc in offcial docs but I was not convinced of its true distributedness. After all Spring Batch is running on a single JVM instance. Isn't it ???
I'm unable to wrap my head around these. So, I want to use this platform for an open discussion between Spring Batch and Apache Spark.
As the lead of the Spring Batch project, I’m sure you’ll understand I have a specific perspective. However, before beginning, I should call out that the frameworks we are talking about were designed for two very different use cases. Spring Batch was designed to handle traditional, enterprise batch processing on the JVM. It was designed to apply well understood patterns that are common place in enterprise batch processing and make them convenient in a framework for the JVM. Spark, on the other hand, was designed for big data and machine learning use cases. Those use cases have different patterns, challenges, and goals than a traditional enterprise batch system, and that is reflected in the design of the framework. That being said, here are my answers to your specific questions.
Has anybody compared the performances between Apache Spark and Java Spring Batch?
No one can really answer this question for you. Performance benchmarks are a very specific thing. Use cases matter. Hardware matters. I encourage you to do your own benchmarks and performance profiling to determine what works best for your use cases in your deployment topologies.
What could be the advantages of using Spring Batch over Spark?
Programming model similar to other enterprise workloads
Enterprises need to be aware of the resources they have on hand when making architectural decisions. Is using new technology X worth the retraining or hiring overhead of technology Y? In the case of Spark vs Spring Batch, the ramp up for an existing Spring developer on Spring Batch is very minimal. I can take any developer that is comfortable with Spring and make them fully productive with Spring Batch very quickly. Spark has a steeper learning curve for the average enterprise developer, not only because of the overhead of learning the Spark framework but all the related technologies to prodictionalize a Spark job in that ecosystem (HDFS, Oozie, etc).
No dedicated infrastructure required
When running in a distributed environment, you need to configure a cluster using YARN, Mesos, or Spark’s own clustering installation (there is an experimental Kubernetes option available at the time of this writing, but, as noted, it is labeled as experimental). This requires dedicated infrastructure for specific use cases. Spring Batch can be deployed on any infrastructure. You can execute it via Spring Boot with executable JAR files, you can deploy it into servlet containers or application servers, and you can run Spring Batch jobs via YARN or any cloud provider. Moreover, if you use Spring Boot’s executable JAR concept, there is nothing to setup in advance, even if running a distributed application on the same cloud-based infrastructure you run your other workloads on.
More out of the box readers/writers simplify job creation
The Spark ecosystem is focused around big data use cases. Because of that, the components it provides out of the box for reading and writing are focused on those use cases. Things like different serialization options for reading files commonly used in big data use cases are handled natively. However, processing things like chunks of records within a transaction are not.
Spring Batch, on the other hand, provides a complete suite of components for declarative input and output. Reading and writing flat files, XML files, from databases, from NoSQL stores, from messaging queues, writing emails...the list goes on. Spring Batch provices all of those out of the box.
Spark was built for big data...not all use cases are big data use cases
In short, Spark’s features are specific for the domain it was built for: big data and machine learning. Things like transaction management (or transactions at all) do not exist in Spark. The idea of rolling back when an error occurs doesn’t exist (to my knowledge) without custom code. More robust error handling use cases like skip/retry are not provided at the level of the framework. State management for things like restarting is much heavier in Spark than Spring Batch (persisting the entire RDD vs storing trivial state for specific components). All of these features are native features of Spring Batch.
Is Spring Batch “truly distributed”
One of the advantages of Spring Batch is the ability to evolve a batch process from a simple sequentially executed, single JVM process to a fully distributed, clustered solution with minimal changes. Spring Batch supports two main distributed modes:
Remote Partitioning - Here Spring Batch runs in a master/worker configuration. The masters delegate work to workers based on the mechanism of orchestration (many options here). Full restartability, error handling, etc. is all available for this approach with minimal network overhead (transmission of metadata describing each partition only) to the remote JVMs. Spring Cloud Task also provides extensions to Spring Batch that allow for cloud native mechanisms to dynamically deploying the workers.
Remote Chunking - Remote chunking delegates only the processing and writing phases of a step to a remote JVM. Still using a master/worker configuration, the master is responsible for providing the data to the workers for processing and writing. In this topology, the data travels over the wire, causing a heavier network load. It is typically used only when the processing advantages can surpass the overhead of the added network traffic.
There are other Stackoverflow answers that discuss these features in further detail (as does as the documentation):
Advantages of spring batch
Difference between spring batch remote chunking and remote partitioning
Spring Batch Documentation

Spring Boot best practices for hiding or encrypting passwords

I have been using the Spring Framework for about 4 years now, and now Spring Boot for the last couple of months. My Spring MVC applications are usually deployed on a Java EE container such as JBoss/WildFly or WebLogic. Doing so allows me to use JNDI for things like datasources or any other sensitive data that involve secrets/passwords. That makes my app "consume" that JNDI resource based on its name.
Now with Spring Boot and especially for self-contained microservices (embedded tomcat), that information is now stored within the application (application.properties and/or in Spring Java Config classes), so versioned in Git.
That makes that information a lot more exposed to other developers, which I'm not very comfortable with. I also don't like having those details show up in SonarQube and Jenkins (through workspaces).
Question is: Are there any best practices for this specific requirement?
* UPDATE *
I see some articles here and there about the use of Jasypt, but I wonder if it's still a valid library to use since the last stable release is dated from 2014.
Thank you
You could consider using a vault. Spring supports a few of them out of the box. You can find more information here http://projects.spring.io/spring-vault/.
If you have spring cloud in your stack, then it's very easy. Use encrypt the value and put it in the application properties. Follow the instruction mentioned here.
Other way is, set the values as environmental variables and using the environmental variables in the application properties. Instructions here

Asynchronous task execution using Spring in container managed environment

I want to run few tasks asynchronously in a web application. My question is which Spring implementation of task executors i should use in a Container managed environment.
I refereed to this chapter in Spring documentation and found few options.
One option I considered is WorkManagerTaskExecutor. This is very simple and works seamlessly with the IBM Websepher server which I'm currently using but this is very specific to IBM Websphere and Oracle Weblogic servers. I don't want to tie my code specifically to one particular implementation as in some test and local regions we are using Jetty container & this implementation creates problems to run the code in Jetty.
Other options like SimpleThreadPoolTaskExecutor does not seem to be best fit to leverage thread pooling in container managed environment and I don't want to create new thread myself.
Could you pleas suggest how do I go about this. Any pointers to a sample implementation will be great help.
As usual, it depends. If you rely on the container's thread management and want to be able to set thread pools on its admin interface or if you're application is not the only app inside the container or you use specific features like setting thread pool priorities for EJB or JMS you should add support for the WorkManagerTaskExecutor and make it configurable. If not, you can use whatever you want cause in the end threads are just threads. Since Spring is an IOC container you can do it. To use the same app everywhere I wouldn't suggest to change the XML config per app version. Rather
use profiles with configuration to set the executor type and inside your java config return the proper bean type. If you use Jetty you should have a configuration for the thread pool sizes to to be able to tune it.
use spring boot like auto configuration which usually rely on available classes on classpath (#ConditionalOnClass). If your weblogic or websphere specific classes are available or any other container specific thing like env variables you can create the WorkManagerTaskExecutor
With both of these you can deploy the same war everywhere.

How to distribute websphere server configuration (datasources,jms,...) to multiple instances?

For a team of developers it is essential that everybody sets up and configures the application server. In our case we are using websphere 8.5.
I'm looking for an easy way to do this. Normally you do it using the profile management tool located in WAS_HOME/bin/ProfileManagement and this tool works quiet well. But after the installation of the websphere server one still needs to configure the server profile - creating datasources, JMS queues, buses, variables and so on. So I thought it would be nice if there is a way to apply these configurations to an existing profile.
My first try was to just configure one profile and then make a configuration backup using
%WAS_HOME%/bin/backupConfig.bat
But the configuration contains e.g. the hostname and other host dependent configurations. So the backupConfig.bat tool is not what I'm looking for.
The next thought that came in my mind is that I might could create a special profileTemplate. So that others can use the profile management tool and use this template. But the template structure does not seem to be made for customization. A lot of files and nearly no documentation can be found of how to create an own profile template.
So I came across the augment templates. These template are used (as the name implies) to add specific configuration to an existing profile. I found a lot of documentation of how to apply an augmentation to an existing profile but no documentation of how to create an augmentation.
At last I think that there must be some way of exporting websphere datasource, buses, jms etc. configuration and apply them to other profiles, because in very big installations the operations team must have this ability.
I know that I can add container-specific descriptors to the EAR. E.g. META-INF/ibmconfig/cells/defaultCell/applications/defaultApp/resources.xml. But I don't want to build environment specific EAR files, because it couples our builds to the infrastructure and thus we have to build and redeploy when ever operations changes the infrastructure, e.g. hostnames, IPs, passwords.
Does anyone know how to manage the distribution of datasource, buses, jms, etc. to multiple websphere installations?
In addition to wsadmin scripts - which are very good for these kind of tasks, I'd suggest Properties-based configuration. It might be more useful for you, since it allows to export many configuration objects at one time and then apply it to different environments. It is also might a bit easier, since you work on plain text files instead of jython scripts.
Properties file based configuration enables you to:
Extract data out of the configuration repository to create properties
files.
Update a properties file to manipulate the configuration, as
needed.
Apply the updated data in the properties file to a target
configuration repository.
See more details here:
Properties-based configuration
Infocenter documentation
Education assistant
I suggest you to use wsadmin shell scripting and create a script for resource creation. A bonus is that you can run it directly from RAD (right click Run As->Administrative Script).
Here is the complete example written in Jython for JDBC resource creation along with JAAS login information (note: I'm using Oracle Database, your setup could be different depending on database you are using):
cell=AdminConfig.showAttribute(AdminConfig.list("Cell"), "name")
node=AdminConfig.showAttribute(AdminConfig.list("Node"), "name")
#Add JAAS credentials
print "Adding JAAS credentials"
security = AdminConfig.getid('/Cell:'+cell+'/Security:/')
alias = ['alias', node+'/dbUser']
userid = ['userId', 'DBUSER']
password = ['password', 'PASSWORD']
jaasAttrs = [alias, userid, password]
AdminConfig.create('JAASAuthData', security, jaasAttrs)
AdminConfig.save()
#Add JDBC jar path
print "Adding JDBC jar path"
AdminTask.setVariable('[-variableName ORACLE_JDBC_DRIVER_PATH -variableValue ${WAS_INSTALL_ROOT}/lib/ext -scope Cell='+cell+',Node='+node+']')
AdminConfig.save()
#JDBC Provider print "Adding JDBC Provider"
AdminTask.createJDBCProvider('[-scope Node='+node+',Server=server1 -databaseType Oracle -providerType "Oracle JDBC Driver" -implementationType "Connection pool data source" -name "Oracle JDBC Driver" -description "Oracle JDBC Driver-compliant Provider." -classpath ${ORACLE_JDBC_DRIVER_PATH}/ojdbc6.jar]')
AdminConfig.save()
#JDBC Datasources print "Creating Datasource"
AdminJDBC.createDataSourceAtScope("Node="+node+",Server=server1", "Oracle JDBC Driver", "test", "jdbc/test", "com.ibm.websphere.rsadapter.Oracle11gDataStoreHelper", "jdbc:oracle:thin:#10.0.0.1:1521:TEST", [['componentManagedAuthenticationAlias', node+'/test'], ['containerManagedPersistence', 'true']])
AdminConfig.save()
I just remebered the wsadmin tool and guess it is the best way to implement my requirements.
Fortunately IBM provides sample scripts that show you how to create datasources or modify them using jython or jacl scripts.
An example of how to create datasources can be found for example in the Administration scripts (1-12) -- Jython version (file ex7.py in the zip)
Hope this helps others who have the same or similay question.

Resources