Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I understood that spring batch framework processes data in chunks. However, I was thinking that when the same chunking functionality can be acheived through java why do we need to go for batch framework.
Could any one please let me know if there are more reasons for going to spring batch framework?
Let me rephrase your question a bit and see if this addresses it.
What does Spring Batch provide that I'd have to handle myself when building a batch application?
Spring Batch served as the basis for JSR-352 (the java batch specification) and since that specification has come out, there is a lot of Spring Batch now available within the java space. That being said, there is still a lot that Spring Batch provides outside of the scope of what basic Java does:
Within a "basic" batch job
Within the scope of a simple batch job, Spring Batch provides a collection of utilities and implementations that have been battle tested in all enterprise verticals. Some examples are:
Over 17 ItemReader and 15 ItemWriter implementations covering vast options for input and output (File, JDBC, NoSQL, JMS, etc). All of these provide declarative I/O options so that you don't have to write and test code for stateful readers and writers.
A collection of Tasklet (Spring Batch's equivalent to JSR-352's Batchlet) implementations including ones for executing shell commands and interfacing with Hadoop.
The ability to stop/start/restart jobs and maintain state between executions.
The ability to skip and retry records as they are being processed.
Transaction management. Spring Batch handles transactions for you.
The ability to notify other systems when errors occur via messaging by integrating Spring Integration.
Java or XML based configuration.
All the Spring features like DI, AOP, testability, etc.
Vendor independence - By using Spring Batch, you get to use a framework that open source and not tied to any one vendor.
Additional advantages
Beyond the above examples of what Spring Batch brings to the table, it goes much further:
Scalability options - Spring Batch provides a number of scalability options that range from within a single JVM via threads (multithreaded step, local partitioning, and splits) to multi-JVM scalability (remote partitioning and remote chunking).
Integration with Spring Integration - Spring Integration provides a number of useful elements that allow you to build robust batch applications to handle things like error messages, poling directories for files, automatically FTPing files, etc.
Big data support - Through the Spring for Apache Hadoop project, there are a number of extensions to Spring Batch that allow it to work well with Hadoop. You can run Spring Batch jobs on YARN, you can execute Pig, Hive, MapReduce, etc jobs.
Integration with Spring XD - Spring XD provides a distributed runtime for the deployment, management, and execution of batch jobs.
I personally view batch processing as the "set it and forget it" model of programming. While it isn't sexy, batch processing is a very useful model of processing and is more useful in places than most people realize. Spring Batch provides an environment that makes developing robust batch jobs as easily as possible.
Related
I'm trying to utilize Spring Batch in one of the projects that I have, as there is another project that is based on Spring Batch.
However the more I read the more I realize that Spring batch is nothing like ApacheBeam or MapReduce, it is only used for transferring the SAME data from one place to another with some type mapping like varchar -> string.
However, the task in hand requires some processing, not only types mapping and converting but also aggregations and data structure.
Can Spring batch used for data processing or it is only an ETL tool ?
well, i disagree on this point that spring batch - is only used for transferring the SAME data from one place to another with some type mapping like varchar -> string.
Worked in 4 years in this technology and have witnessed this framework grow a lot.
Spring batch is well capable of processing data, mapping, required conversion and data aggregations - spring batch can definitely be used for data processing .
being open source technology - you will get lot of material to read about, and the forums like stackoverflow have ton of FAQs around it.
For scaling and paralleling there are various architectures in spring batch, which will help in enhancing your performance.
Further details you can find here
SPRING_BATCH_SCALING_AND_PARALLELING
If you want to monitor your jobs then you cas use - Spring cloud date flow.
Monitoring can also be done - with AppDynamics.
Referrer this blog -
MONITOR_SPRING_BATCH_JOB_WITH_APP_DYNAMICS
Another advantage of using spring batch is you have lot of standerd predefined reader , processor and writer types - which support sources like file , DB , stream etc..
On top of this - as it is a java based framework you can do all stuff that can be done with java.
I hope this helps.
Your below write up is incorrect because its comparing apples to oranges,
However the more I read the more I realize that Spring batch is
nothing like ApacheBeam or MapReduce, it is only used for transferring
the SAME data from one place to another with some type mapping like
varchar -> string.
Unlike ApacheBeam or MapReduce, Spring Batch is not an engine but a programming framework. A programming framework usually consists of two major components - Code Structure Guidelines + APIs
So only restriction on a Java developer is to follow Spring Batch program structure guidelines and usage of Spring Batch APIs is optional.Though the modeling is Read -> Process -> Write, but a Java developer is free to write any kind of logic that he or she wishes to write in these components - only thoughts can limit as what a Java developer could write in these components.Further on, one artifact can be integrated with another artifact.
So I reiterate again that Spring Batch is a programming framework & not an engine or pre configured software like Hadoop so that comparison is like apple to oranges.
See this - Spring Tips: Spring Batch and Apache Kafka
As I have already said, a Java developer can develop any kind of program by being only in program structure limitations but logic being written has no bounds!
Saying one more time - Spring Batch is not an ETL tool like Informatica or Pentaho but a programming framework using Java and Spring. A developer can be as creative as he or she wants to be.
I had developed a real time data matching job that needed free text search capabilities using Apache Lucene by fitting in my programming into Spring Batch model.
Spring Batch (SB) gives us all three - E, T and L.
However, we have to decide whether or not use SB. Its again a quantitative decision whether if an individual/team really needs to learn it, if they dont know it. Need to evaluate ROI (Return on Investment). If its just E or T or L only, there might be another simpler solutions.
If we talk about Java only, AND either of these three, SB is not required. But again, when it comes to simplicity (if you know SB), scalability, monitoring, Transaction Managed Parallel Processing - all these come hand-in-hand with SB out of the box.
I have been working with Apache Spark + Scala for over 5 years now (Academic and Professional experiences). I always found Spark/Scala to be one of the robust combos for building any kind of Batch or Streaming ETL/ ELT applications.
But lately, my client decided to use Java Spring Batch for 2 of our major pipelines :
Read from MongoDB --> Business Logic --> Write to JSON File (~ 2GB | 600k Rows)
Read from Cassandra --> Business Logic --> Write JSON File (~ 4GB | 2M Rows)
I was pretty baffled by this enterprise-level decision. I agree there are greater minds than mine in the industry but I was unable to comprehend the need of making this move.
My Questions here are:
Has anybody compared the performances between Apache Spark and Java Spring Batch?
What could be the advantages of using Spring Batch over Spark?
Is Spring Batch "truly distributed" when compared to Apache Spark? I came across methods like chunk(), partition etc in offcial docs but I was not convinced of its true distributedness. After all Spring Batch is running on a single JVM instance. Isn't it ???
I'm unable to wrap my head around these. So, I want to use this platform for an open discussion between Spring Batch and Apache Spark.
As the lead of the Spring Batch project, I’m sure you’ll understand I have a specific perspective. However, before beginning, I should call out that the frameworks we are talking about were designed for two very different use cases. Spring Batch was designed to handle traditional, enterprise batch processing on the JVM. It was designed to apply well understood patterns that are common place in enterprise batch processing and make them convenient in a framework for the JVM. Spark, on the other hand, was designed for big data and machine learning use cases. Those use cases have different patterns, challenges, and goals than a traditional enterprise batch system, and that is reflected in the design of the framework. That being said, here are my answers to your specific questions.
Has anybody compared the performances between Apache Spark and Java Spring Batch?
No one can really answer this question for you. Performance benchmarks are a very specific thing. Use cases matter. Hardware matters. I encourage you to do your own benchmarks and performance profiling to determine what works best for your use cases in your deployment topologies.
What could be the advantages of using Spring Batch over Spark?
Programming model similar to other enterprise workloads
Enterprises need to be aware of the resources they have on hand when making architectural decisions. Is using new technology X worth the retraining or hiring overhead of technology Y? In the case of Spark vs Spring Batch, the ramp up for an existing Spring developer on Spring Batch is very minimal. I can take any developer that is comfortable with Spring and make them fully productive with Spring Batch very quickly. Spark has a steeper learning curve for the average enterprise developer, not only because of the overhead of learning the Spark framework but all the related technologies to prodictionalize a Spark job in that ecosystem (HDFS, Oozie, etc).
No dedicated infrastructure required
When running in a distributed environment, you need to configure a cluster using YARN, Mesos, or Spark’s own clustering installation (there is an experimental Kubernetes option available at the time of this writing, but, as noted, it is labeled as experimental). This requires dedicated infrastructure for specific use cases. Spring Batch can be deployed on any infrastructure. You can execute it via Spring Boot with executable JAR files, you can deploy it into servlet containers or application servers, and you can run Spring Batch jobs via YARN or any cloud provider. Moreover, if you use Spring Boot’s executable JAR concept, there is nothing to setup in advance, even if running a distributed application on the same cloud-based infrastructure you run your other workloads on.
More out of the box readers/writers simplify job creation
The Spark ecosystem is focused around big data use cases. Because of that, the components it provides out of the box for reading and writing are focused on those use cases. Things like different serialization options for reading files commonly used in big data use cases are handled natively. However, processing things like chunks of records within a transaction are not.
Spring Batch, on the other hand, provides a complete suite of components for declarative input and output. Reading and writing flat files, XML files, from databases, from NoSQL stores, from messaging queues, writing emails...the list goes on. Spring Batch provices all of those out of the box.
Spark was built for big data...not all use cases are big data use cases
In short, Spark’s features are specific for the domain it was built for: big data and machine learning. Things like transaction management (or transactions at all) do not exist in Spark. The idea of rolling back when an error occurs doesn’t exist (to my knowledge) without custom code. More robust error handling use cases like skip/retry are not provided at the level of the framework. State management for things like restarting is much heavier in Spark than Spring Batch (persisting the entire RDD vs storing trivial state for specific components). All of these features are native features of Spring Batch.
Is Spring Batch “truly distributed”
One of the advantages of Spring Batch is the ability to evolve a batch process from a simple sequentially executed, single JVM process to a fully distributed, clustered solution with minimal changes. Spring Batch supports two main distributed modes:
Remote Partitioning - Here Spring Batch runs in a master/worker configuration. The masters delegate work to workers based on the mechanism of orchestration (many options here). Full restartability, error handling, etc. is all available for this approach with minimal network overhead (transmission of metadata describing each partition only) to the remote JVMs. Spring Cloud Task also provides extensions to Spring Batch that allow for cloud native mechanisms to dynamically deploying the workers.
Remote Chunking - Remote chunking delegates only the processing and writing phases of a step to a remote JVM. Still using a master/worker configuration, the master is responsible for providing the data to the workers for processing and writing. In this topology, the data travels over the wire, causing a heavier network load. It is typically used only when the processing advantages can surpass the overhead of the added network traffic.
There are other Stackoverflow answers that discuss these features in further detail (as does as the documentation):
Advantages of spring batch
Difference between spring batch remote chunking and remote partitioning
Spring Batch Documentation
I'm dipping my toes into the microservices, is spring boot batch applicable to the following requirements?
Files of one or multiple are read from a specific directory in Linux.
Several operations like regex, build new files, write the file and ftp to a location
Send email during a process fail
Using spring boot is confirmed, now the question is
Should I use spring batch or just core spring framework?
I need to integrate with Control-M to trigger the job. Can the Control-M be completely removed by using Spring batch library? As we don't know when to expect the files in the directory.
I've not seen a POC with these requirements. Would someone provide an example POC or an affirmation this could be achieved with Spring batch?
I would use Spring Batch for that use case. Not only does it provide out of the box components for reading, processing, and writing files, it adds a lot more for error handling, scalability, etc. All of those things you'd probably end up wiring up by yourself if you go without Spring Batch.
As for being launched via Control-M, yes MANY large customers use Control-M to launch their jobs. Unfortunately, I've never done it myself so I cannot provide any details on the mechanics, but if Control-M can either launch a script or call a REST API, you can launch a job with it.
I would suggest you, go for spring batch as it has much-inbuilt functionality which will be provided to you for file reading and writing to your required location. Even you will be able to handle record skipping requirement. Your mail triggering requirement will be handled by Control M. You just need to decide one exit code for your handled exception and on the basis of that exit code you can trigger the mail to respective members. And there are many other features which will be helpful if you go for spring batch.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I want to process bulk amount of XML data and saves it into database. Which is the best option. Spring batch kettle pentaho?
I have some checkpoints.
Tool is good when schema is known
Supports Parallel execution, multiple sessions and error log
Faster, less memory and less CPU utilization
Supports both inserts and updates
Foreign key references for target tables, dropping constraints and add after data load
Eliminate duplications
block or batch load support
headless execution (no-gui for schedule and start)
Support multiple input formats
Support custom data transformation as pluggable components
Transaction control, error handling and logging for future execution
Inspecting the Status of the Jobs, Monitoring
Integration testing, Sanity testing
Scalable, how to load multiple node in parallel
Restart Jobs when they crash, automatic restart after failure
Tracking Status and Statistics during execution
Ability to launch through web or Rest interfaces
I will try to address your points with Spring Batch capabilities :
Tool is good when schema is known
This is the case with Spring batch. You will be able to use a StaxEventItemReader which requires an annoted bean (known schema).
Supports Parallel execution, multiple sessions and error log
Spring batch supports Parallel execution and error logging. I'm not sure what you mean by multiple sessions. Here are some info about spring batch scalability.
Faster, less memory and less CPU utilization
Spring batch performances depends a lot on how you will use it. Although it may not be the fastest or more efficient, it is used in many production environment across the world.
Supports both inserts and updates
Spring Batch database writers support common DBMS with such operations (JdcbBatchItemWriter, HibernateItemWriter...)
Foreign key references for target tables, dropping constraints and add after data load
I think this will need some manual implementation, but I'm not sure since I haven't met the requirement as of today.
Eliminate duplications
This will be done in your ItemProcessor. Here's an example : processing batch of records using spring batch before writing to DB
block or batch load support
You can configure your writer's commit-interval and the rollback operations with Spring Batch.
headless execution (no-gui for schedule and start)
Spring Batch can be started with a CommandLineJobRunner or any other way with a JobLauncher (requiring then some manual implementation)
Support multiple input formats
Spring Batch can read any kind of flat file (FlatFileItemReader), xml file (StaxEventItemReader), queue (JmsItemReader) or database (JdbcCursorItemReader).
Support custom data transformation as pluggable components
Data transformation is achieved through ItemProcessor. There are out-of-the-box implementations, but most often you will have to write you own implementation to apply your custom logic. As for pluggable components, I'm not sure what you mean.
Transaction control, error handling and logging for future execution
Spring Batch has a whole Retry mechanism and Restartability. You can read more here and here.
Inspecting the Status of the Jobs, Monitoring
Spring Batch allows you to configure where you store metadata about job status (database, file, RAM...). You will be able to read these data. There is also a second project called spring-batch-admin which is a GUI for monitoring and control. Read more here.
Integration testing, Sanity testing
Can't answer that.
Scalable, how to load multiple node in parallel
See 11. Also Spring Batch can be integrated with Spring-XD.
Restart Jobs when they crash, automatic restart after failure
See 11.
Tracking Status and Statistics during execution
See 12.
Ability to launch through web or Rest interfaces
Spring Batch can be integrated with Spring-Boot to answer these needs.
I hope I answered some of your concerns.
I am reading Spring user guide. I came across below statement. I confused by statement "let the framework take care of infrastructure". I mean infrastructure means any Hardware..Nw in Spring Batch is framework, where does infrastructure came in picture
Batch developers use the Spring programming model: concentrate on business logic; let the
framework take care of infrastructure
Please help me in understanding/
If you will read the complete documentation, you will get:
Figure: Spring Batch Layered Architecture
This layered architecture highlights three major high level
components: Application, Core, and Infrastructure. The application
contains all batch jobs and custom code written by developers using
Spring Batch. The Batch Core contains the core runtime classes
necessary to launch and control a batch job. It includes things such
as a JobLauncher, Job, and Step implementations. Both Application and
Core are built on top of a common infrastructure. This infrastructure
contains common readers and writers, and services such as the
RetryTemplate, which are used both by application
developers(ItemReader and ItemWriter) and the core framework itself.
(retry)
spring-batch reference
The Spring Batch framework is designed to cater to batch applications that run on a daily basis in enterprise organizations. It helps to leverage the benefits of the Spring framework along with the advance services. Spring Batch is mainly used to process huge volume of data. It offers better performance and is highly scalable using different optimization and partition techniques. It also provides advantage over logging/tracing, transaction management, job processing statistics, job restart, steps, and resource management. By using the Spring programming model, I can write the business logic and let the framework take care of infrastructure.
Spring Batch includes three components: batch application, batch execution environment and batch infrastructure.
The Application component contains all the batch jobs and custom code written using Spring Batch.
The Core component contains the core runtime classes necessary to launch and control a batch job. It includes things such as a JobLauncher, Job, and Step implementations. Both Application and Core are built on top of a common infrastructure.
The Infrastructure contains readers, writers and services which are used both by application and the core framework itself. They include things like ItemReader, ItemWriter and MongoTemplate. To use the Spring Batch framework, you need only to configure and customize the XML files. All existing core services should be easy to replace or extend, without any impact to the infrastructure layer.
-from Devx
I hope this would help you understand how it works.