Springboot batch parallel processing (Using annotation) - spring

I want to process millions of records and currently I am using in spring boot batch. Which is working fine with single thread, but I want to increase the speed of whole process by implementing parallel processing. Is this achievable without changing the reading & writing order?
Eg:
Assume I will be providing input text file 1000 student details where student number starts from 1 to 1000. I want to introduce parallel process creating 10 threads (100 students for each thread) and do some operation. Once all students are processed I should produce the text file output based on the input file.
Here output file also needs to follow same order, student number from 1 to 1000 though it uses multiple threads simultaneously.

Preprocess all keys and create a HashMap (studentkey, studentResponse) and a Collection (ArrayList (studentReponse)) in the order that you want to return them. The student responses in the collection are the same studentResponse instances in the map. Then make your parallel calls that will update the contents of the studentReponse instances in the map according to the key(s) that it is processing. The collection will be updated too since it has the same instances. Now process the collection to create your text file.

Related

JdbcBatchItemWriterBuilder vs org.springframework.jdbc.core.jdbcTemplate.batchUpdate

I understand jdbcTemplate.batchUpdate is used for sending several records to data base in one communication.
Lets say i have 1000 records to be updated, instead of 1000 communications from Application to database, the Application will send 1000 records in request.
Coming to JdbcBatchItemWriterBuilder its combination of Tasks in a job.
My question is, if there is 1000 records to be processed(INSERT statements) via JdbcBatchItemWriterBuilder, all INSERTS executed in one go? or one after one?
If one after one, connecting to database 1000 times using JdbcBatchItemWriterBuilder causes perf issues? hows that handled?
i would like to understand if Spring batch performs better than running 1000 INSERT staments using jdbcTemplate.update ?
The JdbcBatchItemWriter uses java.sql.PreparedStatement#addBatch and java.sql.Statement#executeBatch internally (See https://github.com/spring-projects/spring-batch/blob/c4010fbffa6b71cbcfe79d523023251ce73666a4/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/database/JdbcBatchItemWriter.java#L189-L195), so there will be a single batch insert for all items of the chunk.
Moreover, this will be executed in a single transaction as described in the Chunk-oriented Processing section of the reference documentation.

NiFi Create Indexes after Inserting Records into table

I've got my first Process Group that drops indexes in table.
Then that routes to another process group the does inserts into table.
After successfully inserting the half million rows, I want to create the indexes on the table and analyze it. This is typical Data Warehouse methodology. Can anyone please give advice on how to do this?
I've tried setting counters, but cannot reference counters in Expression Language. I've tried RouteOnAttribute but getting nowhere. Now I'm digging into Wait & Notify Processors - maybe there's a solution there??
I have gotten Counters to count the flow file sql insert statements, but cannot reference the Counter values via Expression Language. Ie this always returns nulls: "${InsertCounter}" where InsertCounter is being set properly it appears via my UpdateCounter process in my flow.
So maybe this code can be used?
In the wait processor set the Target Signal Count to ${fragment.count}.
Set the Release Signal Identifier in both the notify and wait processor to ${fragment.identifier}
nothing works
You can use Wait/Notify processors to do that.
I assume you're using ExecuteSQL, SplitAvro? If so, the flow will look like:
Split approach
At the 2nd ProcessGroup
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
SpritAvro: creates 5,000 FlowFiles, this processor adds fragment.identifier and fragment.count (=5,000) attributes.
split:
XXXX: Do some conversion per record
PutSQL: Insert records individually
Notify: Increase count for the fragment.identifier (Release Signal Identifier) by 1. Executed 5,000 times.
original - to the next ProcessGroup
At the 3rd ProcessGroup
Wait: waiting for fragment.identifier (Release Signal Identifier) to reach fragment.count (Target Signal Count). This route processes the original FlowFile, so executed only once.
PutSQL: Execute a query to create indices and analyze tables
Alternatively, if possible, using Record aware processors would make the flow simpler and more efficient.
Record approach
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
Perform record level conversion: With UpdateRecord or LookupRecord, you can do data processing without splitting records into multiple FlowFiles.
PutSQL: Execute a query to create indices and analyze tables. Since the single FlowFile containing all records, no Wait/Notify is required, the output FlowFile can be connected to the downstream flow.
I Think my suggestion to this question will fit into your scenario as well
How to execute a processor only when another processor is not executing?
Check it out

Import data from a large CSV (or stream of data) to Neo4j efficiently in Ruby

I am new to background processes, so feel free to point me out if I am making wrong assumptions.
I am trying to write a script that imports import data into a Neo4j db from a large CSV file (consider it as a stream of data, endless). The csv file only contains two column - user_a_id and user_b_id, which maps the directed relations. A few things to consider:
data might have duplicates
the same user can map to multiple other users and is not guaranteed that when it will show up again.
My current solution: I am using sidekiq and have one worker to read the file in batches and dispatch workers to create edges in the database.
Problems that I am having:
Since the I am receiving a stream of data, I cannot pre-sort the file and assign job that build relation for one user.
Since jobs are performed asynchronously, if two workers are working on relation of the same node, I will get a write lock from Neo4j.
Let's say I get around with the write lock, if two workers are working on records that are duplicated, I will build duplicated edges.
Possible solution: Build a synchronous queue and have only one worker to perform writing (Seems neither sidekiq or resque has the option). And this could be pretty slow since only one thread is working on the job.
Or, I can write my own implementation, which create one worker to build multiple queues of jobs based on user_id (one unique id per queue), and use redis to store them. Then assign one worker per queue to write to database. Set a maximum number of queues so I wouldn't run out of memory, and delete the queue once it exhausts all the jobs (rebuild it if I see the same user_id in the future). - This doesn't sound trivial though, so I would prefer using an existing library before diving into this.
My question is — is there a existing gem that I can use? What is a good practice of handling this?
You have a number of options ;)
If your data really is in a file and not as a stream, I would definitely recommend checking out the neo4j-import command which comes with Neo4j. It allows you to import CSV data at speeds on the order of 1 million rows per second. Two caveats: You may need to modify your file format a bit, and you would need to be generating a fresh database (it doesn't import new data into an existing database)
I would also be familiar with the LOAD CSV command. This takes a CSV in any format and lets you write some Cypher commands to transform and import the data. It's not as fast as neo4j-import, but it's pretty fast and it can stream a CSV file from disk or a URL.
Since you're using Ruby, I would also suggest checking out neo4apis. This is a gem that I wrote to make it easier to batch import data so that you're not making a single request for every entry in your file. It allows you to define a class in a sort of DSL with importers. These importers can take any sort of Ruby object and, given that Ruby object, will define what should be imported using add_node and add_relationship methods. Under the covers this generates Cypher queries which are buffered and executed in batches so that you don't have lots of round trips to Neo4j.
I would investigate all of those things first before thinking about doing things asynchronously. If you really do have a never ending set of data coming in, however. The MERGE clause should help you with any race conditions or locks. It allows you to create objects and relationships if they don't already exist. It's basically a find_or_create, but at a database level. If you use LOAD CSV you'll probably want merge as well, and neo4apis uses MERGE under the covers.
Hope that helps!

Spring Batch Framework

I am not able to finalize whether Spring Batch framework is applicable for the below requirement. I need experts inputs on this.
Following is my requirement:
Read multiple Oracle tables (at least 10 tables including both transaction and master), do complex
calculation based on the business rules, Insert / Update / Delete
records in transaction tables.
I have identified the following two designs:
Design # 1:
ItemReader: Select eligible records from Key transaction table.
ItemProcessor: Fetch additional details from DB using the key available in the record retrieved by ItemReader.(It would require multipble DB transactions)
Do the validation and computation and add the details to be written to DB as objects in a list.
ItemWriter: Write the details available in objects using CustomItemWriter(insert / update / delete operation)
With this design, we can achieve parallel processing but increase the number of DB transactions.
Design # 2:
Step # 1
ItemReader: Use Composite Item Reader (Group of ItemReaders) to read all the required tables.
ItemWriter: Save the result sets as lists of Objects (One list per table) in execution context
Step # 2
ItemReader: Retrieve lists of Objects available in execution context and group them into one list of objects based on the business processing so that processor can process them.
IremProcessor:
Process the chunk of Objects returned by ItemReader.
Do the validation and computation and add the details to be written to DB as objects in a list.
ItemWriter: Write the details available in objects using CustomItemWriter(insert / update / delete operation)
With this design, we can REDUCE the number of DB Transactions but we are delaying the processing till all table records are retrieved and stored in execution context ie we are not using parallel processing provided by SpringBatch.
Please advise whether the above is feasible using SpringBatch or we need to use conventional Java program.
The good news is that your problem description matches a very common use case for spring-batch. The bad news is that the problem description is too generic to allow much meaningful input about the specifc design beyond the comments already provided.
Spring-batch brings facilities similar to JCL and ISPF from the mainframe world into the java context.
Spring batch provides a framework for organizing and managing the boundaries of your process. It is a natural for a lot of ETL and bigdata operations, but it is not the only way to write these processes.
If you process can be broken down into discreet steps, then spring batch is a good choice for you.
The Itemreader should (logicall) be an iterator returning a single object representing the start of one logical unit of work (luw). The luw object is captured by the chunker and assembled into collections of the size you configure, and then passed to the processor. The result of the processor is then passed to the writer. In the context of an RDBMS centric process, the commit happens at the end of the writer's operation.
What happens in each of those pieces of the step is 100% whatever you need (plain old java). The point of the framework is to free you from the complexity and enable you to solve the problem.
From my understanding, Spring batch has nothing to do with database batch operations (or at least the word 'batch' has a different meaning in these two contexts..) Spring batch is used to create processes with multiple steps, and gives you the chance to restart a process if one of the process steps fails (without repeating the previously finished process steps.)

How does one design a spring batch job with a data source, possible concurrent steps and aggregation in the end?

I am new to spring batching and I'm having some doubts on how to implement a use case. My experience so far with spring batching is centered around jobs composed of tasklets with reader, writer and processor. I feel though that the following use case is above my experience so here goes:
I need to read from an mdb
I need to differentiate between the entries based on a combination of column values(will yield a max of 5 combos)
Processing needs in the end to generate a collection of items of type T.
Everything needs to be merged in the end for some aggregations.
My ideea is to avoid reading the mdb multiple times, so I was looking into a way of splitting the data based on combos and then run, maybe concurrently, the processes. Having this in mind I read about the Splitter and partitioning components from spring batching and integration.
What I don't exactly know is how to put all concepts toghether.
What do you mean by MDB? MessageDrivenBean? If the answer if yes - what do you mean by reading from MDB multiple times? Since MDBs are message-driven, we can't read from them at any time, so basis on my understanding of your question I'd do it in the following way:
MDB receives message and stores received entry in some DB table - that would be some kind of transition table; such tables are often used during processing of financial transactions
Batch window comes - job is triggered.
Now you can query the table in any way you want. Since you are looking for splitting and processing the data concurrently, I'd advice using Spring Batch partitioning with TaskExecutorPartitionHandler executing step locally in concurrent threads. What you need to do is to read data from database differentiating on combination of column values - that should be relatively easy - it's just a matter of constructing appropriate SQL query.
Processed chunks are aggregated into ItemWriter write(List<? extends T> items) depending on commit interval; if such aggregation is not enough for you, I'd add another table and Batch step that aggregates previously processed entries.
Basically that's how batch processing works - you read items, transforms them and write. The next step - if it's not just a simple tasklet - does exactly the same.

Resources