How to push millions of records to db - spring

I have to read the 2 millions records from db and store in another db,tried reading all the data using spring pagination , need the best and east approach to write the batch to process the records batches wise to store in another db.

Related

Spring batch fetch huge amount of data from DB-A and store them in DB-B

I have the following scenario. In a database A I have a table with huge amount of records (several millions); these records increase day by day very rapidly (also 100.000 records at day).
I need to fetch these records, check if these records are valid and import them in my own database. At the first interaction I should take all the stored records. Then I can take only the new records saved. I have a timestamp column I can use for this filter but I can't figure how to create a JpaPagingItemReader or a JdbcPagingItemReader and pass the dynamic filter based on the date (e.g. select all records where timestamp is greater than job last execution date)
I'm using spring boot, spring data jpa and spring batch.I'm configuring the Job instance in chunks with dimension 1000. I can also use a paging query (is it useful if I use chunks?)
I have a micro service (let's call this MSA) with all the business logic needed to check if records are valid and insert the valid records.
I have another service on a separate server. This service contains all the batch operation (let's call this MSB).
I'm wondering what is the best approach to the batch. I was thinking to these solutions:
in MSB I duplicate all the entities, repositories and services I use in the MSA. Then in MSB I can make all needed queries
in MSA I create all the rest API needed. The ItemProcessor of MSB will call these rest API to perform checks on items to be processed and finally in the ItemWriter I'll call the rest API for saving data
The first solution would avoid the http calls but it forces me to duplicate all repositories and services between the 2 micro services. Sadly I can't use a common project where to place all the common objects.
The second solution, on the other hand, would avoid the code duplication but it would imply a lot of http calls (above all in the ItemProcessor to check if an item is valid or less).
Do you have any other suggestion? Is there a better approach?
Thank you
Angelo

How to read multiple tables using Spring Batch

I am looking to read data from multiple tables (different database tables) and aggregate and create final result set. In my case, each query will return the List of object. I went through web many times, I found no link other than - Spring Batch How to read multiple table (queries) as Reader and write it as flat file write, but it returns only single object.
Is there any way if we can do this ? Any working sample example would help a lot.
Example -
One query gives List of Departments - from Oracle DB
One query gives List of Employee - from Postgres
Now I want to build Employee and Department relationship and send final object to processor to further lookup against MongoDB and send the final object to reader.
The question should rather be "how to join three tables from three different databases and write the result in a file". There is no built-in reader in Spring Batch that reads from multiple tables. You either need to create a custom reader, or decompose the problem at hand into tasks that can be implemented using Spring Batch tasklet/chunk-oriented steps.
I believe you can use the driving query pattern in a single chunk-oriented step. The reader reads employee items, then a processor enrich items with 1) department from postgres and 2) other info from mongo. This should work for small/medium datasets. If you have a lot of data, you can use partitioning to parallelize things and improve performance.
Another option if you want to avoid a query per item is to load all departments in a cache for example (I guess there should be less departments than employees) and enrich items from the cache rather than with individual queries to the db.

Is it able to return data portion by portion

I have some client, REST web servise and database. Database has big tables (10 million rows).
I need search some data in db (1 million results), put it in excel file with apache poi and return to the client. The problem is retrieving data from database and forming file can be longer than 1h. Is it exists a way to return data portionally (retrieve 1000 rows, return, retrieve next 1000, return and so on)?

Spring boot jpa select millions of records from db and process the data

I am working on a spring boot application where I need to fetch 400000 rows from db and pass it on as a list.
How should i approach this?
I am thinking of a way to split the records in groups of 1000 and pass it on.
But in that case how will I specify the offset in my sql query, like once I fetch first 1000 records how to fetch 1001 - 2000 records?
Another way is if I can fetch the records as a stream, in that case I have to find a way in which I can send the stream through a REST GET api from my application whenever someone calls my api.
Basically I need to build a rest get api where I need to pass this data to who ever is using my api
You can use OFFSET and LIMIT,
Example:
SELECT *
FROM t_users
ORDER BY employee_name
OFFSET 1000 ROWS FETCH NEXT 1000 ROWS ONLY;
Now in your case, it will fetch 100 records each time, when you can pass the OFFSET value dynamically.

Windows Azure Application high volume of records insertions

We are meant to be developing a Web based application based on Azure platform, though I’ve got some basic understanding but still have many questions
The application that we are to develop will have lot of database interaction and would need to insert a large volume of records every day.
What is the best way to interact with db here is via Queue (ie work role and then worker role reads queue and save data in db)or direct to SQL server?
And should it be a multi-tenant application?
I've been playing around with windows azure SQL database of a little while now and this is a blog post i wrote about inserting large amounts of data
http://alexandrebrisebois.wordpress.com/2013/02/18/ingesting-massive-amounts-of-relational-data-with-windows-azure-sql-database-70-million-recordsday/
my recipe is as follows: to Insert/Update data I used the following dataflow
◾Split your data into reasonably sized DataTables
◾Store the data tables as blobs in Windows Azure Blob Storage Service
◾Use SqlBulkCopy to insert data is into write tables
◾Once you have reached reasonable a amount of records in your write tables, merge the records into your read tables using reasonably sized batches. Depending on the complexity and indexes/triggers present on the read tables, batches should be of about 100000 to 500000.
◾Before merging each batch, be sure to remove duplicates by keeping the most recent records only.
◾Once a batch has been merged remove the data from the write table. Keeping this table reasonably small is quite important.
◾Once your data has been merged, be sure to check up on your index fragmentation.
◾Rince &repeat

Resources