Spring Batch Call API Paginated and Write in Unique File - spring

I need to create a batch (with Spring Batch) and I'm in this situation:
I need to call an API from which to read the account ids (about 100 accounts)
I need to call an API (paginated) from which to read the data (300K - 1000k objects for each account)
I need to add up some values contained in the response
I have to write a file for each account with all the data received in step 2
At the moment I have created a partition handler and set 10 threads (1 thread for 10 accounts). I make a call to the API (step 2) and put all the objects in memory. I do some operations in the processor and pass the list of objects to the writer who writes to the file.
Keeping 500k objects per thread (10) in memory seems to me an incorrect solution. What can be the best solution?

Related

Spring-Batch ItermReader using REST Endpoint as source

Could not find a source as REST endpoint using which ItemReader in spring-batch can can invoke it thousand times and read json.
I tried going through https://docs.spring.io/spring-batch/docs/current/reference/html/readersAndWriters.html#itemReader
and I could not see for ItemReader is hitting any REST endpoint.
My use case is,
Read DB for 200,000 unique value
Using this values call to REST point each time to get json results
Also at the same time using (1) read unique results from DB
Combine result 2 and 3 for each of the elements out of 200,000
Finally store them on to a DB for each element.
Including Error and Exception Handling with Transactional capabilities.
Deploy them to Kubernetes

Spring batch fetch huge amount of data from DB-A and store them in DB-B

I have the following scenario. In a database A I have a table with huge amount of records (several millions); these records increase day by day very rapidly (also 100.000 records at day).
I need to fetch these records, check if these records are valid and import them in my own database. At the first interaction I should take all the stored records. Then I can take only the new records saved. I have a timestamp column I can use for this filter but I can't figure how to create a JpaPagingItemReader or a JdbcPagingItemReader and pass the dynamic filter based on the date (e.g. select all records where timestamp is greater than job last execution date)
I'm using spring boot, spring data jpa and spring batch.I'm configuring the Job instance in chunks with dimension 1000. I can also use a paging query (is it useful if I use chunks?)
I have a micro service (let's call this MSA) with all the business logic needed to check if records are valid and insert the valid records.
I have another service on a separate server. This service contains all the batch operation (let's call this MSB).
I'm wondering what is the best approach to the batch. I was thinking to these solutions:
in MSB I duplicate all the entities, repositories and services I use in the MSA. Then in MSB I can make all needed queries
in MSA I create all the rest API needed. The ItemProcessor of MSB will call these rest API to perform checks on items to be processed and finally in the ItemWriter I'll call the rest API for saving data
The first solution would avoid the http calls but it forces me to duplicate all repositories and services between the 2 micro services. Sadly I can't use a common project where to place all the common objects.
The second solution, on the other hand, would avoid the code duplication but it would imply a lot of http calls (above all in the ItemProcessor to check if an item is valid or less).
Do you have any other suggestion? Is there a better approach?
Thank you
Angelo

How Spring - server.connection-timeout will work internally?

I'm little confused on how the server.connection-timeout property will work on a spring boot REST API project
I have a Spring boot REST API Project in which I have a delete REST API, this will basically do couple of delete operation on a Database table say for example this delete API will delete some rows on 3 tables as following
Delete API gets "customer Id" as Input and execution the following
Delete all records matching the customer Id in Table A (delete call to an external DB)
Delete all records matching the customer Id in Table B (delete call to an external DB)
Delete all records matching the customer Id in Table C (delete call to an external DB)
my question here is if I set "server.connection-timeout" to 5 Seconds what does it actually means?
I have 2 two assumptions
The delete Rest Api will timeout in 5 Seconds meaning all the 3 external DB call has to be done within the 5 Seconds if not the REST API will timeout
Each external DB call will have 5 Seconds timeout, in this case 15 Seconds totally
In worst case if all the 3 External DB call takes 4 Seconds then the Delete API will take 12 Seconds to respond - is this a valid one?
I think you are confusing. server.connection-timeout – is the time that connectors wait for another HTTP request before closing the connection.
It doesn't matter how much time it takes to complete the request.
In your case if server.connection-timeout is 5, this will not effect #1 #2 or #3 deletes which you mentioned.
In a simple terms connection-timeout does not apply to long running requests. Instead It applies to the initial connection, when the server waits for the client to request something.
Default: the connector’s container-specific default is used. Use a value of -1 to indicate infinite timeout.

JdbcBatchItemWriterBuilder vs org.springframework.jdbc.core.jdbcTemplate.batchUpdate

I understand jdbcTemplate.batchUpdate is used for sending several records to data base in one communication.
Lets say i have 1000 records to be updated, instead of 1000 communications from Application to database, the Application will send 1000 records in request.
Coming to JdbcBatchItemWriterBuilder its combination of Tasks in a job.
My question is, if there is 1000 records to be processed(INSERT statements) via JdbcBatchItemWriterBuilder, all INSERTS executed in one go? or one after one?
If one after one, connecting to database 1000 times using JdbcBatchItemWriterBuilder causes perf issues? hows that handled?
i would like to understand if Spring batch performs better than running 1000 INSERT staments using jdbcTemplate.update ?
The JdbcBatchItemWriter uses java.sql.PreparedStatement#addBatch and java.sql.Statement#executeBatch internally (See https://github.com/spring-projects/spring-batch/blob/c4010fbffa6b71cbcfe79d523023251ce73666a4/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/database/JdbcBatchItemWriter.java#L189-L195), so there will be a single batch insert for all items of the chunk.
Moreover, this will be executed in a single transaction as described in the Chunk-oriented Processing section of the reference documentation.

Springboot batch parallel processing (Using annotation)

I want to process millions of records and currently I am using in spring boot batch. Which is working fine with single thread, but I want to increase the speed of whole process by implementing parallel processing. Is this achievable without changing the reading & writing order?
Eg:
Assume I will be providing input text file 1000 student details where student number starts from 1 to 1000. I want to introduce parallel process creating 10 threads (100 students for each thread) and do some operation. Once all students are processed I should produce the text file output based on the input file.
Here output file also needs to follow same order, student number from 1 to 1000 though it uses multiple threads simultaneously.
Preprocess all keys and create a HashMap (studentkey, studentResponse) and a Collection (ArrayList (studentReponse)) in the order that you want to return them. The student responses in the collection are the same studentResponse instances in the map. Then make your parallel calls that will update the contents of the studentReponse instances in the map according to the key(s) that it is processing. The collection will be updated too since it has the same instances. Now process the collection to create your text file.

Resources