How to increase the speed of bulk inserts with GORM? - go

I am using v2 GORM's CreateInBatches and changed SkipDefaultTransaction: true as mentioned in the "Performance" page, but I find that +350k records inserted in batches of 1000 take almost 3 minutes.
I tried removing the gorm.Model{} fields but didn't see much improvement.
What can I do to increase bulk-insert speed?
EDIT for anyone reading this with the same problem: I ended up saving my data in CSVs and importing it with pg_bulkload - got 1 million rows imported in 1 second (on a server with lots of cores & RAM)

Related

jdbc template batch update with snowflake database is very slow

I have a spring boot application that connects to the Snowflake Database and uploads records (approx 50 columns of different data types). I am using
JdbcTemplate.batchUpdate(insertSql, values, types)
to do the bulk insert. Currently, it is consuming around 100 seconds for 50,000 records. I want to improve the batch performance. but not able to find an optimal solution.
I referred to and tried the solution mentioned in this post, but it didn't help at all. Any suggestions will be highly appreciated
I moved away from batch insert to snowflake copy command using JDBC. It is lightning fast. With the copy command, it is barely taking 2-3 seconds to load 50000 records from a CSV file with XS (extra small) size Dataware house.
Moreover, in case of error, messages are very clear and can be viewed in information_schema.load_history. Different file formats can be loaded and there are a variety of options to customize load process.
In my case, I am first loading the CSV file to the internal staging area (takes less than 1 sec), Run Copy command (takes 1-2 seconds), verifying load status in information_schema.load_history table (takes a few milliseconds) and proceed accordingly
This article was also helpful for running copy command with JDBC

Initial ElasticSearch Bulk Index/Insert /Upload is really slow, How do I increase the speed?

I'm trying to upload about 7 million documents to ES 6.3 and I've been running into and issue where the bulk upload slows to a crawl at about 1 million docs (I have no documents previous to this in the index).
I have a 3 node ES setup with 16GB with 8GB JVM settings, 1 index, 5 shards.
I have turned off refresh ("-1"), set replica to 0, increased the index buffer size to 30%.
On my upload side I have 22 threads running 150 docs per request of bulk insert. This is just a basic ruby script using Postgresql, ActiveRecord, Net/HTTP (For the network call), and and using the ES Bulk API (No gem).
For all of my nodes and upload machines the CPU, Memory, SSD Disk IO is low.
I've been able to get about 30k-40k inserts per/minute, but that seems really slow to me since others have been able to do 2k-3k per/sec. My documents do have nested json, but they don't seem to be very large to me (Is there way to check a single size doc or average?).
I would like to be able to bulk upload these documents in less than 12 - 24hrs and seems like ES should handle that, but once I get to 1 million it seems like it slows to a crawl.
I'm pretty new to ES so any help would be appreciated. I know this seems like question that has already been asked, but I've tried just about everything that I could find and wonder why my upload speed is a factor slower.
I've also checked the logs and only saw some errors about mapping field couldn't change, but nothing about memory over or anything like that.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
I think I found a bottleneck at the active connections to my original database and increased that connection pool which helped, but still slows to a crawl at about 1 Million records, but got to 2 Million over about 8hrs of running.
I also tried an experiment on a big machine, that is used to run the upload job, running 80 threads at 1000 document uploads each. I did some calculations and found out that my documents are about 7-10k per document so doing uploads of 7-10MBs each bulk index. This got to the document count faster to 1M, but once you get there everything slows to a crawl. The machines stats are still really low. I do see output of the threads about every 5 mins or so on the logs for the job, about the same time I see the ES count change.
The ES machines still have low CPU, Memory. The IO is around 3.85MBs and the Network Bandwidth was at 55MBs and drops to about 20MBs.
Any help would be appreciated. Not sure if I should try the ES gem, and use the bulk insert which maybe keeps a connection open, or try something totally different to insert.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
Could you give an example for a breaking change between 6.0 and 6.3 that is a problem for you? We're really trying to avoid those and I can't really recall anything from the top of my head.
I've started profiling that DB and noticed that once you use offset of about 1 Million the queries are starting to take a long time.
Deep pagination is terrible performance wise. There is the great blog post no-offset, which explains
why it's bad: To get the result 1,000 to 1,010 you sort the first 1,010 records, throw away 1,000, and then send 10. The deeper the pagination the more expensive it will be
how to avoid it: Make a unique order of your entries (for example by ID or combine date and ID, but something that is absolute) and add a condition on where to start. For example order by ID, fetch the first 10 entries, and keep the ID of the 10th entry for the next iteration. In that one order by the ID again, but with the condition that the ID must be greater than the last one in your previous run, and fetch the next 10 entries plus remember the last ID again. Repeat until done.
Generally, with your setup you really shouldn't have a problem inserting more than 1 million records. I'd look into the part that is fetching the data first.

JDBC prepared statement with large no columns causing performance bottleneck. How ETL tools circumvent this issue?

Mainly a DB guy and did not use java for Bulk loading etc. as those were done by ETL tools or DB internal tools.
But if I understand correctly Tools are written by Java/C++ etc. and they are using JDBC,ODBC to implement operation .
Recently in one project trying to load bulk data using JDBC and observed the following .
we have one million records, 1.5 Gb of data , the table is with 360 columns .
Reading from Table A and trying to insert in Target table in 5k records Batch mode interval. source abd target is Oracle.
The project uses Spring JDBC. I have used here simple JDBC to test stand alone and debugging the performance issue .
the logic described in pseudo language .
prepare statement for Target with
"insert into target values ( ?,?, .. 368 columns);
rs = ( select * from table a );
while rs.next {
stmt.setstring(1, rs.getString("column1");
.
.
360 columns.
stmt.add_batch();
if 5K records then executeBatch();
}
Main issue:
for every 5K records the set Statements are taking around more than 1 minute.
so loading just 1.5 Gb or 1 million rec will take approx. 4 hours.
I am doing it in a single thread but I feel the volume is very low.
Is there any better way to implement this?
How ETL tools say informatica etc. implements internally?
The other issue is: sometimes executeBatch() for some table with similar no of columns and more volume per record write 5k records in one go. In some cases it writes 100 records in one go though executeBatch is after 5k rows and write also takes eternity for 1 million records.
One more thing instead of result set if I use the set statements as
for (I=1 ,I<=1000000; I++)
stmt.setstring(1, rs.getString("123456789");--hardcode value
.
.
360 columns.
stmt.add_batch();
if 5K records then executeBatch();
}
then it takes around 4 seconds to bind for every 5k and 2-3 secs to executeBatch(). So in 20 mins I am able to load 1 million around 6-7 GB of data.

SSIS File Load WAY TOO SLOW in Large Destination Table

this is my first question, I've searched a lot of info from different sites but none of them where conslusive.
Problem:
Daily I'm loading a flat file with an SSIS Package executed in a scheduled job in SQL Server 2005 but it's taking TOO MUCH TIME(like 2 1/2 hours) and the file just has like 300 rows and its a 50 MB file aprox. This is driving me crazy, because is affecting the performance of my server.
This is the Scenario:
-My package is just a Data Flow Task that has a Flat File Source and an OLE DB Destination, thats all!!!
-The Data Access Mode is set to FAST LOAD.
-Just have 3 indexes in the table and are nonclustered.
-My destination table has 366,964,096 records so far and 32 columns
-I haven't set FastParse in any of the Output columns yet.(want to try something else first)
So I've just started to make some tests:
-Rebuild/Reorganize the indexes in the destination table(they where way too fragmented), but this didn't help me much
-Created another table with the same structure but whitout all the indexes and executed the Job with the SSIS package loading to this new table and IT JUST TOOK LIKE 1 MINUTE !!!
So I'm confused, is there something I'm Missing???
-Is the SSIS package writing all the large table in a Buffer and the writing it on Disk? Or why the BIG difference in time ?
-Is the index affecting the insertion time?
-Should I load the file to this new table as a temporary table and then do a BULK INSERT to the destination table with the records ordered? 'Cause I though that the Data FLow Task was much faster than BULK INSERT, but at this point I don't know now.
Greetings in advance.
One thing I might look at is if the large table has any triggers which are causing it to be slower on insert. Also if the clustered index is on a field that will require a good bit of rearranging of the data during the load, that could cause an issues as well.
In SSIS packages, using a merge join (which requires sorting) can cause slownesss, but from your description it doesn't appear you did that. I mention it only in case you were doing that and didn't mention it.
If it works fine without the indexes, perhaps you should look into those. What are the data types? How many are there? Maybe you could post their definitions?
You could also take a look at the fill factor of your indexes - especially the clustered index. Having a high fill factor could cause excessive IO on your inserts.
Well I Rebuild the indexes with another fill factor (80%) like Sam told me, and the time droped down significantly. It took 30 minutes instead of almost 3hours!!!
I will keep with the tests to fine tune the DB. Also I didnt have to create a clustered index,I guess with the clustered the time will drop a lot more.
Thanks to all, wish that this helps to someone in the same situation.

Oracle SQL*loader running in direct mode is much slower than conventional path load

In the past few days I've playing around with Oracle's SQL*Loader in attempt to bulk load data into Oracle. After trying out different combination of options I was surprised to found the conventional path load runs much quicker than direct path load.
A few facts about the problem:
Number of records to load is 60K.
Number of records in target table, before load, is 700 million.
Oracle version is 11g r2.
The data file contains date, character (ascii, no conversion required), integer, float. No blob/clob.
Table is partitioned by hash. Hash function is same as PK.
Parallel of table is set to 4 while server has 16 CPU.
Index is locally partitioned. Parallel of index (from ALL_INDEXES) is 1.
There's only 1 PK and 1 index on target table. PK constraint built using index.
Check on index partitions revealed that records distribution among partitions are pretty even.
Data file is delimited.
APPEND option is used.
Select and delete of the loaded data through SQL is pretty fast, almost instant response.
With conventional path, loading completes in around 6 seconds.
With direct path load, loading takes around 20 minutes. The worst run takes 1.5 hour to
complete yet server was not busy at all.
If skip_index_maintenance is enabled, direct path load completes in 2-3 seconds.
I've tried quite a number of options but none of them gives noticeable improvement... UNRECOVERABLE, SORTED INDEXES, MULTITHREADING (I am running SQL*Loader on a multiple CPU server). None of them improve the situation.
Here's the wait event I kept seeing during the time SQL*Loader runs in direct mode:
Event: db file sequential read
P1/2/3: file#, block#, blocks (check from dba_extents that it is an index block)
Wait class: User I/O
Does anyone has any idea what has gone wrong with direct path load? Or is there anything I can further check to really dig the root cause of the problem? Thanks in advance.
I guess you are falling fowl of this
"When loading a relatively small number of rows into a large indexed table
During a direct path load, the existing index is copied when it is merged with the new index keys. If the existing index is very large and the number of new keys is very small, then the index copy time can offset the time saved by a direct path load."
from When to Use a Conventional Path Load in: http://download.oracle.com/docs/cd/B14117_01/server.101/b10825/ldr_modes.htm

Resources