Spring batch fetch huge amount of data from DB-A and store them in DB-B - spring-boot

I have the following scenario. In a database A I have a table with huge amount of records (several millions); these records increase day by day very rapidly (also 100.000 records at day).
I need to fetch these records, check if these records are valid and import them in my own database. At the first interaction I should take all the stored records. Then I can take only the new records saved. I have a timestamp column I can use for this filter but I can't figure how to create a JpaPagingItemReader or a JdbcPagingItemReader and pass the dynamic filter based on the date (e.g. select all records where timestamp is greater than job last execution date)
I'm using spring boot, spring data jpa and spring batch.I'm configuring the Job instance in chunks with dimension 1000. I can also use a paging query (is it useful if I use chunks?)
I have a micro service (let's call this MSA) with all the business logic needed to check if records are valid and insert the valid records.
I have another service on a separate server. This service contains all the batch operation (let's call this MSB).
I'm wondering what is the best approach to the batch. I was thinking to these solutions:
in MSB I duplicate all the entities, repositories and services I use in the MSA. Then in MSB I can make all needed queries
in MSA I create all the rest API needed. The ItemProcessor of MSB will call these rest API to perform checks on items to be processed and finally in the ItemWriter I'll call the rest API for saving data
The first solution would avoid the http calls but it forces me to duplicate all repositories and services between the 2 micro services. Sadly I can't use a common project where to place all the common objects.
The second solution, on the other hand, would avoid the code duplication but it would imply a lot of http calls (above all in the ItemProcessor to check if an item is valid or less).
Do you have any other suggestion? Is there a better approach?
Thank you
Angelo

Related

What is the best approach while pooling data from DB and query DB again to fetch additional information?

The spring boot application that I am working on
pools 1000 messages from table X [ This table X is populated by another service s1]
From each message get the account number and query table Y to get additional information about account.
I am using spring integrating to pool messages from table X and reading additional information for account, I am planning to use Spring JDBC.
We are expecting about 10k messages very day.
Is above approach, to query table Y for each message, a good approach ?
No, that indeed not. If all of that data is in the same database, consider to write a proper SELECT to join those tables in a single query performed by that source polling channel adapter.
Another approach is to implement a stored procedure which will do that job for you and will return the whole needed data: https://docs.spring.io/spring-integration/reference/html/jdbc.html#stored-procedures.
Although if the memory for that number of records to handle at once is a limit in your environment or you don't care how fast all of them are processed, then indeed an integration flow with parallel processing of splitted polling result is OK. For that goal you can use a JdbcOutboundGateway as a service in your flow instead of playing with plain JdbcTemplate: https://docs.spring.io/spring-integration/reference/html/jdbc.html#jdbc-outbound-gateway

JdbcBatchItemWriterBuilder vs org.springframework.jdbc.core.jdbcTemplate.batchUpdate

I understand jdbcTemplate.batchUpdate is used for sending several records to data base in one communication.
Lets say i have 1000 records to be updated, instead of 1000 communications from Application to database, the Application will send 1000 records in request.
Coming to JdbcBatchItemWriterBuilder its combination of Tasks in a job.
My question is, if there is 1000 records to be processed(INSERT statements) via JdbcBatchItemWriterBuilder, all INSERTS executed in one go? or one after one?
If one after one, connecting to database 1000 times using JdbcBatchItemWriterBuilder causes perf issues? hows that handled?
i would like to understand if Spring batch performs better than running 1000 INSERT staments using jdbcTemplate.update ?
The JdbcBatchItemWriter uses java.sql.PreparedStatement#addBatch and java.sql.Statement#executeBatch internally (See https://github.com/spring-projects/spring-batch/blob/c4010fbffa6b71cbcfe79d523023251ce73666a4/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/database/JdbcBatchItemWriter.java#L189-L195), so there will be a single batch insert for all items of the chunk.
Moreover, this will be executed in a single transaction as described in the Chunk-oriented Processing section of the reference documentation.

Implementing static shared counter in microservice architecture

I have a use case where i want to record data in rows and display to the user.
Multiple users can add these records and they have to be displayed in order of insertion AND - MOST IMPORTANTLY - with a sequence number starting from 1.
I have a Spring boot microservice architecture at the backend, which obviously means i cannot hold state in my boot application as i'm gonna have multiple running instances.
Another method was to fetch all existing records in the db,count them,increment the count by 1 and use that as my sequence. I need to do this every time i am doing an insert.
But the problem with the second approach is with parallel requests, which could result in same sequence number being given to 2 records.
Third approach is to configure the counter in a db , but since i am using cosmos DB, apparently that is also not an option.
Any suggestions as to how i can implement a static, shared counter ?

spring boot business reports

I have a order fulfillment application mostly made with Spring Data Rest and spring boot. I need to get every product, find them in orders and calculate how many sold in total and what the price is in a given time period. I have other requirements like this.
Now the data represented in the report will not be like any of the business entities. It will have sums, totals, a field from one entity, a field from another entity... And it will not be persisted it will be generated only for the client to consume. It should be pageable too.
What is the correct approach to tackle this? Do I use pojos to represent the report lines? How do I manage paging with this? Does it make sense to have a repository for each report? Is it possible with hql to return a report line that is not a persisted entity from a repository method?

Processing of Million records using Spring Bach including Pattern Matching

My use case is as follows:
1) Read 20 Million records from Db2 databse and read the filter criteria from Db2 where it involves with multiple columns and some of the columns has patterns like Column A with value %EMP%.
2) Now for each combination of the rules filter the data on 20M dataset and at the same time update the database column which has a flag indicating this record is filtered out.
3) At the end of the process, we will invoke a Informatica workflow which will take the unfiltered records for 20 M and process it.
We do not want to have the filtering logic on Informatica as it would be expensive so looking for an option to do it using Spring Batch where we can span multiple threads and run the filtering logic.
I am not sure if the Spring Batch is the right candidate for this. But I need some suggestions if I need to implement this on Java.
Please suggest
You should consider using Camel routes and Spring Boot.
You could use a Camel JPA consumer to place the records on an ActiveMQ queue. Use an JMS consumer with multiple consumers to process the records.
Use an aggregation strategy to invoke your Informatica.
I haven't used Spring Batch so I can't say if its a better solution but Spring Boot and Camel are pretty sweet.

Resources