Spring boot jpa select millions of records from db and process the data

Spring boot jpa select millions of records from db and process the data - spring

I am working on a spring boot application where I need to fetch 400000 rows from db and pass it on as a list.
How should i approach this?
I am thinking of a way to split the records in groups of 1000 and pass it on.
But in that case how will I specify the offset in my sql query, like once I fetch first 1000 records how to fetch 1001 - 2000 records?
Another way is if I can fetch the records as a stream, in that case I have to find a way in which I can send the stream through a REST GET api from my application whenever someone calls my api.
Basically I need to build a rest get api where I need to pass this data to who ever is using my api

You can use OFFSET and LIMIT,
Example:
SELECT *
FROM t_users
ORDER BY employee_name
OFFSET 1000 ROWS FETCH NEXT 1000 ROWS ONLY;
Now in your case, it will fetch 100 records each time, when you can pass the OFFSET value dynamically.

Related

How to push millions of records to db

I have to read the 2 millions records from db and store in another db,tried reading all the data using spring pagination , need the best and east approach to write the batch to process the records batches wise to store in another db.

Spring batch fetch huge amount of data from DB-A and store them in DB-B

I have the following scenario. In a database A I have a table with huge amount of records (several millions); these records increase day by day very rapidly (also 100.000 records at day).
I need to fetch these records, check if these records are valid and import them in my own database. At the first interaction I should take all the stored records. Then I can take only the new records saved. I have a timestamp column I can use for this filter but I can't figure how to create a JpaPagingItemReader or a JdbcPagingItemReader and pass the dynamic filter based on the date (e.g. select all records where timestamp is greater than job last execution date)
I'm using spring boot, spring data jpa and spring batch.I'm configuring the Job instance in chunks with dimension 1000. I can also use a paging query (is it useful if I use chunks?)
I have a micro service (let's call this MSA) with all the business logic needed to check if records are valid and insert the valid records.
I have another service on a separate server. This service contains all the batch operation (let's call this MSB).
I'm wondering what is the best approach to the batch. I was thinking to these solutions:
in MSB I duplicate all the entities, repositories and services I use in the MSA. Then in MSB I can make all needed queries
in MSA I create all the rest API needed. The ItemProcessor of MSB will call these rest API to perform checks on items to be processed and finally in the ItemWriter I'll call the rest API for saving data
The first solution would avoid the http calls but it forces me to duplicate all repositories and services between the 2 micro services. Sadly I can't use a common project where to place all the common objects.
The second solution, on the other hand, would avoid the code duplication but it would imply a lot of http calls (above all in the ItemProcessor to check if an item is valid or less).
Do you have any other suggestion? Is there a better approach?
Thank you
Angelo

fetch 200k records in single jpa select query within 5 seconds

I want to fetch 200k records in single jpa select query within 5 seconds. I am selecting one column which is already indexed. Currently It is taking more than 5 minutes. is it possible to select over 100k of records in 5 seconds?

This is not possible with hibernate or normal native query since it has to create hundreds of thousands of objects in java side and results needs to be sent over the network (Serialization & de-serialization).
You could do below steps for fine tuning,
At DB side you could change the index method default is binery tree instead set it as "HASH" method.
Use Parallel threads to retrieve the results in paginated mode (Use native SQL).
Hope it gives some inputs for further fine tuning.

Use this property to retrieve lakh of records.
query.setHint(org.hibernate.fetchSize, 5000);

Spring #Transactional + Isolation.REPEATABLE_READ for Rate Limiting

We are trying a scenario of Rate Limiting the total no. of JSON records requested in a month to 10000 for an API.
We are storing the total count of records in a table against client_id and a Timestamp(which is primary key).
Per request we fetch record from table for that client with Timestamp with in that month.
From this record we get the current count, then increment it with no. of current records in request and update the DB.
Using the Spring Transaction, the pseudocode is as below
#Transactional(propagation=Propagation.REQUIRES_NEW, isolation=Isolation.REPEATABLE_READ)
public void updateLimitData(String clientId, currentRecordCount) {
//step 1
startOfMonthTimestamp = getStartOfMonth();
endOfMonthTimestamp = getEndOfMonth();
//step 2
//read from DB
latestLimitDetails = fetchFromDB(startOfMonthTimestamp, endOfMonthTimestamp, clientId);
latestLimitDetails.count + currentRecordCount;
//step 3
saveToDB(latestLimitDetails)
}
We want to make sure that in case of multiple threads accessing the "updateLimitData()" method, each thread get the updated data for a clientId for a month and it do not overwrite the count wrongly.
In the above scenario if multiple threads access the method "updateLimitData()" and reach the "step 3". First thread will update "count" in DB, then the second thread update "count" in DB which may not have latest count.
I understand from Isolation.REPEATABLE_READ that "Write Lock" is placed in the rows when update is called at "Step 3" only(by that time other thread will have stale data). How I can ensure that always threads get he latest count from table in multithread scenario.
One solution came to my mind is synchronizing this block but this will not work well in multi server scenario.
Please provide a solution.

A transaction would not help you unless you lock the table/row whilst doing this operation (don't do that as it will affect performance).
You can migrate this to the database, doing this increment within the database using a stored procedure or function call. This will ensure ACID and transactional safety as this is built into the database.
I recommend doing this using standard Spring Actuator to produce a count of API calls however, this will mean re-writing your service to use the actuator endpoint and not the database. You can link this to your Gateway/Firewall/Load-balancer to deny access to the API once their quote is reached. This means that your API endpoint is pure and this logic is removed from your API call. All new API's you developer will automatically get this functionality.

Order by making very slow the application using oracle

In my application I need to generate the report for transaction history which is done by all clients. I have used Oracle 12c for my application. I have 300k clients. This table is related client details and transaction history table. I have written the query to generate showing transaction history per month. It returns near 20 million records.
SELECT C.CLIENT_ID, CD.CLIENT_NAME, ...... FROM CLIENT C, CLIENT_DETAILS CD,
TRANSACTION_HISTORY TH
--Condition part
ORDER BY C.CLIENT_ID
These 3 tables have right indexes which is working fine. But when fetching data using order by to showing customers in order this query takes 8 hours to execute the batch process.
I have analysed the cost of the query. Cost is 80085. But when I remove the order by from query the cost became to 200. So that I have removed the order by as of now. But I need to show the customers by order. I cannot use the limit. Is there any way to overcome this?

you can try indexing the client id in the table, which would speed up the performance of the table to fetch the data in some order.
you can use the link for the reference: link
Hope this would help you

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio