Improving batch insert/update/delete in Spring Boot - spring-boot

I am working with a Spring boot project and Oracle Database. My task is importing data from a csv file. The file includes 10000 rows; each row is equal to a row in a table and includes three attributes which are sCode, isDeleted (isDeleted=1 -> delete, isDeleted=0 -> update or insert). I am using batch size of 1000 to insert, update or delete. Each batch, I do the following steps:
Use JPA findAllBySCodeInAndDepartmentId to file all rows that exist in the table (departmentId is the department of current user, (sCode, departmentId) is unique)
Use Map to store data from the first step's result (key is sCode)
Filter out rows in batch that have isDeleted = 1 and exist in the Map to delete (I use JPA deleteAll)
Filter out rows that isDeleted = 0 and exist in the Map to update (I use JPA saveAll)
Left rows in batch which were not updated or deleted (not in the Map) will be inserted into the table (I use JPA saveAll)
It takes around 5 minutes to finish importing 10000 rows
How can I shrink the amount of time to around 1 minute?

Related

How to update records in db table using spring batch

I have a requirement to update database records of particular table in bulk size. is there any way to update record instead of inserting new one through spring batch ? or any other way ?

spring data saveAll very slow

i m using spring data saveAll to save 3500 records in an Oracle database but it execute very slowly, is there a way to do bulk insert or any other fast way
noteRepository.saveAll(noteEntityList);//<- this one is slow for 3000 records
thanks in advance
By default, saveAll does not create batch, the batch processing needs to be enabled.
You need to set below properties to enable batch processing
spring.jpa.properties.hibernate.jdbc.batch_size=100
spring.jpa.properties.hibernate.order_inserts=true (if inserts)
OR
spring.jpa.properties.hibernate.order_updates=true (if updates)
First property collects the transaction in batch and second property collects the statements grouped by entity.
Check this thread for more details
How to do bulk (multi row) inserts with JpaRepository?
Also, if you want to do batch inserts, make sure that if your table has an auto-incremented column (say as a PK), that its set up as a Sequence (not Identity) and that the allocationSize (Java) and increment_by value (DB Sequence) are set to the batch size you are trying to persist. Don't set those values to one, else insert will still be slow as JPA will need to keep going back to the DB to get the next value from the sequence.

SSIS Incremental Load Performace

I have a table with ~800k records and with ~100 fields.
The table has an ID field which is a unique NVACHAR (18) type.
The table has, also, a field called LastModifiedDate which holds the latest changes that were made.
I’m trying to perform an incremental load based on the following:
Initial load of all data (happens once)
Loading, based on LastModifiedDate, only recent changed/added records (~30k)
Based on the key field (ID), performing INSERT/UPDATE on recent data to the existing data
(*) assuming records are not deleted
I’m trying to achieve this by doing the following steps:
Truncate the temp table (which holds the recent data)
Extracting the recent data and storing it in the temp table
Extracting the data from the temp table
Using Lookup with the following definitions:
a. Cache mode = Full Cache
b. Connection Type = OLE DB connection manager
c. No matching entries = Ignore failure
Selecting ID from the final table and linking it to the ID field from temp table and giving the new filed an output alias LKP_ID
Using Conditional Split and checking if ISNULL(LKP_ID) when true means INSERT and false means UPDATE
INSERT means that that the data from temp table will be inserted to the final table and UPDATE means that an SQL UPDATE statement will be executed based on the temp table data
the final result is good BUT the run time is terrible. it takes ~30 minutes or so to complete
The way I would handle this is to use the LastModifiedDate in your source query to get the records from the source table that have changed since the last import.
Then I would import all of those records into an empty staging table on the destination database server.
Then I would execute a stored procedure to do the INSERT/UPDATE of the final destination table from the data in the staging table. A stored procedure on the destination server will perform MUCH faster than using Lookups and Conditional Splits in SSIS.

Why is my Spring Boot entity ID generation failing?

I'm trying to auto-generate ID's for my entity, but it's not generating. Instead, it's starting from 1 when there already exists an entry with id "1" in my DB. Why is it not generating id "9" for my new entity?
Typically when creating a table with GenerationType.IDENTITY on postgres, Hibernate will setup the id column plus a database sequence to manage this id.
By convention the sequence name will be "tablename_id_seq". E.g., for the table ad_group_action there will be a corresponding sequence ad_group_action_id_seq. You can connect to the database to double-check the actual sequence name created.
The sequence just starts from 1 and increments each time a row is inserted by Hibernate.
But if there are pre-existing rows -- or if rows with existing IDs are inserted "manually" into the table -- those rows can conflict with the sequence.
One solution is to simply reset the sequence (from pgAdmin or another database client) to start at a higher number (say 100), using something like:
ALTER SEQUENCE ad_group_action_id_seq RESTART WITH 100;
Now Hibernate will not conflict with the existing rows (assuming their max id is < 100).
Alternatively, when inserting rows manually, omit the id column and let postgres automatically set them. This way the table and the sequence will always be in sync.

HSQL simple one column update runs forever

I have a database with about 125 000 rows, each row with primary key, couple of int columns and couple of varchars.
I've added an int column and I'm trying to populate it before adding not null constraint.
The db is persisted in script file. I've read somewhere that all the affected rows get loaded to memory before the actual update, which means there wont be a disk write for every row. The whole db is about 20MB which would mean loading it and doing the update should be reasonably fast, right?
So, no joins, no nested queries, basic update.
I've tried multiple db managers including the one bundled with hsql jar.
update tbl1 set col1 = 1
Query never finishes executing.
It is probably running out of memory.
The easier way to do this operation is to define the column with DEFAULT 1, which does not use much memory regardless of the size of the table. You can even add the not null constraint at the same time
ALTER TABLE T ADD COLUMN C INT DEFAULT 1 NOT NULL

Resources