How does one design a spring batch job with a data source, possible concurrent steps and aggregation in the end? - spring

I am new to spring batching and I'm having some doubts on how to implement a use case. My experience so far with spring batching is centered around jobs composed of tasklets with reader, writer and processor. I feel though that the following use case is above my experience so here goes:
I need to read from an mdb
I need to differentiate between the entries based on a combination of column values(will yield a max of 5 combos)
Processing needs in the end to generate a collection of items of type T.
Everything needs to be merged in the end for some aggregations.
My ideea is to avoid reading the mdb multiple times, so I was looking into a way of splitting the data based on combos and then run, maybe concurrently, the processes. Having this in mind I read about the Splitter and partitioning components from spring batching and integration.
What I don't exactly know is how to put all concepts toghether.

What do you mean by MDB? MessageDrivenBean? If the answer if yes - what do you mean by reading from MDB multiple times? Since MDBs are message-driven, we can't read from them at any time, so basis on my understanding of your question I'd do it in the following way:
MDB receives message and stores received entry in some DB table - that would be some kind of transition table; such tables are often used during processing of financial transactions
Batch window comes - job is triggered.
Now you can query the table in any way you want. Since you are looking for splitting and processing the data concurrently, I'd advice using Spring Batch partitioning with TaskExecutorPartitionHandler executing step locally in concurrent threads. What you need to do is to read data from database differentiating on combination of column values - that should be relatively easy - it's just a matter of constructing appropriate SQL query.
Processed chunks are aggregated into ItemWriter write(List<? extends T> items) depending on commit interval; if such aggregation is not enough for you, I'd add another table and Batch step that aggregates previously processed entries.
Basically that's how batch processing works - you read items, transforms them and write. The next step - if it's not just a simple tasklet - does exactly the same.

Related

A SpringBatch job to produce events for a PubSub preserving source order

I'm considering to create a SpringBatch job that uses rows from a table to create events and push the events to a PubSub implementation. The problem here is that the order of events should be the same as the order of the rows in the table that used as source for the events creation process.
It seems to me now that the SpringBatch is not designed for such order perseverance, as batches are processed and then written in parallel. The only ugly but probably working solution for this problem would be to do all the processing in the reader (so the reader could do reading+processing+writing to PubSub), that could help to keep order inside paginated batches, but even that doesn't seem to guarantee the batches order, according to the doc
Any ideas how the transition ordered rows->ordered events could be implemented using SpringBatch or, at least, SpringBoot? Thank you in advance!
It seems to me now that the SpringBatch is not designed for such order perseverance, as batches are processed and then written in parallel.
This is true only for a multi-threaded or a partitioned step. The default (single-threaded) chunk-oriented step implementation processes items in the same order returned by the item reader. So if you make your database reader return items in the order you want, those items will be written to your Pub/Sub broker in the same order.

JdbcBatchItemWriterBuilder vs org.springframework.jdbc.core.jdbcTemplate.batchUpdate

I understand jdbcTemplate.batchUpdate is used for sending several records to data base in one communication.
Lets say i have 1000 records to be updated, instead of 1000 communications from Application to database, the Application will send 1000 records in request.
Coming to JdbcBatchItemWriterBuilder its combination of Tasks in a job.
My question is, if there is 1000 records to be processed(INSERT statements) via JdbcBatchItemWriterBuilder, all INSERTS executed in one go? or one after one?
If one after one, connecting to database 1000 times using JdbcBatchItemWriterBuilder causes perf issues? hows that handled?
i would like to understand if Spring batch performs better than running 1000 INSERT staments using jdbcTemplate.update ?
The JdbcBatchItemWriter uses java.sql.PreparedStatement#addBatch and java.sql.Statement#executeBatch internally (See https://github.com/spring-projects/spring-batch/blob/c4010fbffa6b71cbcfe79d523023251ce73666a4/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/database/JdbcBatchItemWriter.java#L189-L195), so there will be a single batch insert for all items of the chunk.
Moreover, this will be executed in a single transaction as described in the Chunk-oriented Processing section of the reference documentation.

performance issue in getting millions of record from database and processing in ERP in mule esb

We are trying to fetch millions of record from database and processing in ERP system per day and we are facing performance issue, is there any solution regarding this in Community?
What is the best way to process the records in mule? So should we use batch or is there any alternate to it? And if we use batch or any other solution, how can we use it so as not to face any performance issue?
Since we don't have details on your specific situation, here are some general ideas. You will definitely need to do performance testing when dealing with large data sets to make sure your flow design is performing well.
Just to clarify, I'm giving options below that show streaming, which are slightly less performant, but will allow you to process large datasets. If you can handle the dataset in memory and you want faster processing, then turn off streaming.
Test your db queries outside of mule to make sure they are performant and tables are properly indexed.
Use streaming db connection. Tweak chunk size for performance testing. (Using this with batch scope is a good combo)
If using on-premise runtime, do performance tuning.
Use batch scope (enterprise edition)
Batch sounds like what you want to do. For each batch step Mule creates a batch job instance and each instance contains a persistent queue with the batched records. However, it does a deep copy of the MuleEvent containing the flow variables, flow construct, message, processing time, session and exchange pattern so beware, make sure you keep a light footprint before going into your batch job. If you have to set the payload with millions of records to flow variables to do some manipulation, make sure you delete them before you start executing the batch. It will load these batch steps in memory and execute them concurrently so the amount of memory you will need will be the size of the batch job instance (in particular the MuleEvent) by the number of batch steps.

Import data from a large CSV (or stream of data) to Neo4j efficiently in Ruby

I am new to background processes, so feel free to point me out if I am making wrong assumptions.
I am trying to write a script that imports import data into a Neo4j db from a large CSV file (consider it as a stream of data, endless). The csv file only contains two column - user_a_id and user_b_id, which maps the directed relations. A few things to consider:
data might have duplicates
the same user can map to multiple other users and is not guaranteed that when it will show up again.
My current solution: I am using sidekiq and have one worker to read the file in batches and dispatch workers to create edges in the database.
Problems that I am having:
Since the I am receiving a stream of data, I cannot pre-sort the file and assign job that build relation for one user.
Since jobs are performed asynchronously, if two workers are working on relation of the same node, I will get a write lock from Neo4j.
Let's say I get around with the write lock, if two workers are working on records that are duplicated, I will build duplicated edges.
Possible solution: Build a synchronous queue and have only one worker to perform writing (Seems neither sidekiq or resque has the option). And this could be pretty slow since only one thread is working on the job.
Or, I can write my own implementation, which create one worker to build multiple queues of jobs based on user_id (one unique id per queue), and use redis to store them. Then assign one worker per queue to write to database. Set a maximum number of queues so I wouldn't run out of memory, and delete the queue once it exhausts all the jobs (rebuild it if I see the same user_id in the future). - This doesn't sound trivial though, so I would prefer using an existing library before diving into this.
My question is — is there a existing gem that I can use? What is a good practice of handling this?
You have a number of options ;)
If your data really is in a file and not as a stream, I would definitely recommend checking out the neo4j-import command which comes with Neo4j. It allows you to import CSV data at speeds on the order of 1 million rows per second. Two caveats: You may need to modify your file format a bit, and you would need to be generating a fresh database (it doesn't import new data into an existing database)
I would also be familiar with the LOAD CSV command. This takes a CSV in any format and lets you write some Cypher commands to transform and import the data. It's not as fast as neo4j-import, but it's pretty fast and it can stream a CSV file from disk or a URL.
Since you're using Ruby, I would also suggest checking out neo4apis. This is a gem that I wrote to make it easier to batch import data so that you're not making a single request for every entry in your file. It allows you to define a class in a sort of DSL with importers. These importers can take any sort of Ruby object and, given that Ruby object, will define what should be imported using add_node and add_relationship methods. Under the covers this generates Cypher queries which are buffered and executed in batches so that you don't have lots of round trips to Neo4j.
I would investigate all of those things first before thinking about doing things asynchronously. If you really do have a never ending set of data coming in, however. The MERGE clause should help you with any race conditions or locks. It allows you to create objects and relationships if they don't already exist. It's basically a find_or_create, but at a database level. If you use LOAD CSV you'll probably want merge as well, and neo4apis uses MERGE under the covers.
Hope that helps!

Spring Batch Framework

I am not able to finalize whether Spring Batch framework is applicable for the below requirement. I need experts inputs on this.
Following is my requirement:
Read multiple Oracle tables (at least 10 tables including both transaction and master), do complex
calculation based on the business rules, Insert / Update / Delete
records in transaction tables.
I have identified the following two designs:
Design # 1:
ItemReader: Select eligible records from Key transaction table.
ItemProcessor: Fetch additional details from DB using the key available in the record retrieved by ItemReader.(It would require multipble DB transactions)
Do the validation and computation and add the details to be written to DB as objects in a list.
ItemWriter: Write the details available in objects using CustomItemWriter(insert / update / delete operation)
With this design, we can achieve parallel processing but increase the number of DB transactions.
Design # 2:
Step # 1
ItemReader: Use Composite Item Reader (Group of ItemReaders) to read all the required tables.
ItemWriter: Save the result sets as lists of Objects (One list per table) in execution context
Step # 2
ItemReader: Retrieve lists of Objects available in execution context and group them into one list of objects based on the business processing so that processor can process them.
IremProcessor:
Process the chunk of Objects returned by ItemReader.
Do the validation and computation and add the details to be written to DB as objects in a list.
ItemWriter: Write the details available in objects using CustomItemWriter(insert / update / delete operation)
With this design, we can REDUCE the number of DB Transactions but we are delaying the processing till all table records are retrieved and stored in execution context ie we are not using parallel processing provided by SpringBatch.
Please advise whether the above is feasible using SpringBatch or we need to use conventional Java program.
The good news is that your problem description matches a very common use case for spring-batch. The bad news is that the problem description is too generic to allow much meaningful input about the specifc design beyond the comments already provided.
Spring-batch brings facilities similar to JCL and ISPF from the mainframe world into the java context.
Spring batch provides a framework for organizing and managing the boundaries of your process. It is a natural for a lot of ETL and bigdata operations, but it is not the only way to write these processes.
If you process can be broken down into discreet steps, then spring batch is a good choice for you.
The Itemreader should (logicall) be an iterator returning a single object representing the start of one logical unit of work (luw). The luw object is captured by the chunker and assembled into collections of the size you configure, and then passed to the processor. The result of the processor is then passed to the writer. In the context of an RDBMS centric process, the commit happens at the end of the writer's operation.
What happens in each of those pieces of the step is 100% whatever you need (plain old java). The point of the framework is to free you from the complexity and enable you to solve the problem.
From my understanding, Spring batch has nothing to do with database batch operations (or at least the word 'batch' has a different meaning in these two contexts..) Spring batch is used to create processes with multiple steps, and gives you the chance to restart a process if one of the process steps fails (without repeating the previously finished process steps.)

Resources