Optimizing massive insert performance...?

Optimizing massive insert performance...? - performance

Given: SQL Server 2008 R2. Quit some speedin data discs. Log discs lagging.
Required: LOTS LOTS LOTS of inserts. Like 10.000 to 30.000 rows into a simple table with two indices per second. Inserts have an intrinsic order and will not repeat, as such order of inserts must not be maintained in short term (i.e. multiple parallel inserts are ok).
So far: accumulating data into a queue. Regularly (async threadpool) emptying up to 1024 entries into a work item that gets queued. Threadpool (custom class) has 32 possible threads. Opens 32 connections.
Problem: performance is off by a factor of 300.... only about 100 to 150 rows are inserted per second. Log wait time is up to 40% - 45% of processing time (ms per second) in sql server. Server cpu load is low (4% to 5% or so).
Not usable: bulk insert. The data must be written as real time as possible to the disc. THis is pretty much an archivl process of data running through the system, but there are queries which need access to the data regularly. I could try dumping them to disc and using bulk upload 1-2 times per second.... will give this a try.
Anyone a smart idea? My next step is moving the log to a fast disc set (128gb modern ssd) and to see what happens then. The significant performance boost probably will do things quite different. But even then.... the question is whether / what is feasible.
So, please fire on the smart ideas.

Ok, anywering myself. Going to give SqlBulkCopy a try, batching up to 65536 entries and flushing them out every second in an async fashion. Will report on the gains.

I'm going through the exact same issue here, so I'll go through the steps i'm taking to improve my performance.
Separate the log and the dbf file onto different spindle sets
Use basic recovery
you didn't mention any indexing requirements other than the fact that the order of inserts isn't important - in this case clustered indexes on anything other than an identity column shouldn't be used.
start your scaling of concurrency again from 1 and stop when your performance flattens out; anything over this will likely hurt performance.
rather than dropping to disk to bcp, and as you are using SQL Server 2008, consider inserting multiple rows at a time; this statement inserts three rows in a single sql call
INSERT INTO table VALUES ( 1,2,3 ), ( 4,5,6 ), ( 7,8,9 )
I was topping out at ~500 distinct inserts per second from a single thread. After ruling out the network and CPU (0 on both client and server), I assumed that disk io on the server was to blame, however inserting in batches of three got me 1500 inserts per second which rules out disk io.
It's clear that the MS client library has an upper limit baked into it (and a dive into reflector shows some hairy async completion code).
Batching in this way, waiting for x events to be received before calling insert, has me now inserting at ~2700 inserts per second from a single thread which appears to be the upper limit for my configuration.
Note: if you don't have a constant stream of events arriving at all times, you might consider adding a timer that flushes your inserts after a certain period (so that you see the last event of the day!)

Some suggestions for increasing insert performance:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits (e.g. autoinc column)
Insert into a temporary heap table first, then issue one big "insert-by-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Choose "Bulk Logged" recovery model instad of "Full" recovery model
Place a table lock before inserting (if your business scenario allows for it)
Taken from Tips For Lightning-Fast Insert Performance On SqlServer

Related

Doctrine caching

I am working o a big DB driven application that sometimes needs a huge data import. Data is imported from excel spreadsheets and at the start of the proces (for about 500 rows) the data is processed relatively quicly, but lates slows down significantly. Import generates 6 linked entites per row of excel that are flushed after processing every line. My guess is that all those entities are getting cached by doctrine and just build up. My idea is to clear out all that cach every 200 rows but I could not find how to clear it from within the code (console is not an option at this stage). Any assistance or links would be much appreciated.

I suppose that the cause may lie not in Doctrine but in the database transaction log buffer size. The documentation says
A large log buffer enables large transactions to run without a need to write the log to disk before the transactions commit. Thus, if you have big transactions, making the log buffer larger saves disk I/O.
Most likely you insert your data in one big transaction. When the buffer is full, it is written to disk which is normally slower.
There are several possible solutions.
Increase buffer size so that the transaction fits into the buffer.
Split the transaction into several parts that fit into the buffer.
In the second case keep in mind that each transaction needs time as well, so wrapping each insert in a separate transaction will reduce performance as well.
I recommend to wrap about 500 rows in a transaction because this seems to be a size that fits in the buffer.

Postgresql Performance tip for scattered data

I am trying to improve the performance of my database, which simplified set-up is the following :
EDIT
One table with 3 rows (id_device, timestamp, data) with a composite btree index (id_device, timestamp)
1k devices sending data every minute
The insert are quite fast, since PostgreSQL merely writes the rows in the order they are received. However, when trying to get many data with consecutive timestamp of a given device, the query is not so fast. The way I understand it is that due to the way the data is collected, there is never more than one row of a given device on each page of the table. Therefore, if I want to get 10k data with consecutive timestamp of a given device, PostgreSQL has to fetch 10k pages from disk. Besides, since this operation can be done on any of the 1k devices, those pages are not going to be kept in RAM.
I have tried to CLUSTER the table, and it indeed solve the performance issue, but this operation is incredibly long (~1 day) and it locks the entire table, so I discarded this solution.
I have read about the partitionning, but that would mean a lot of scripting if I need to add a new table every time a new devices is connected, and it seems to me a bit bug-prone.
I am rather confident in the fact that this set-up is not particularly original, so is there an advice I could use?
Thanks for reading,
Guillaume

I'm guessing your index also has low selectivity, because you're indexing device_id first (which are only 1000 different) and not timestamp first.
Depends on what you do with the data you fetch, but maybe the solution could be batching the operation, such as fetching the data for a predetermined period and processing data for all 1000 devices in one go.

Running time / Memory issue while copying from excel to SQL tables using Talend

I am copying data from excel sheet to the SQL tables.
At this time it is around 2000 rows distributed across 18 tables.
Problem with my job is it is taking too much of time. It takes around 2.5 mins to do the job.
Other issue I am facing is with memory. I tried to copy around 250,000 rows and I couldn't run the job with basic settings. I have to increase Xms and Xmx allocation.
How do I solve these issues?

You should start your job with a tMSSQLConnection (I think that's the DBMS you're using) and then finish it with a tMSSQLCommit component and see if that helps at all as it could be that Talend is opening a large amount of connections to the database rather than pooling them.
Increasing the commit size will also help speed up bulk loads but obviously if anything fails to commit it will lose the entire commit.
As well as this, as long as you have no race conditions and don't care in what order tables are inserted to or updated then you could parallelise the whole job with either a tParallelize component or by enabling multi thread executions in the Extra tab under the Job window.
Sometimes the memory usage in the job can be improved by splitting the process down into separate jobs and linking them as child jobs in one large wrapper parent job with tRunJob components. This will also make the job more manageable.
Finally, there's a couple of options in the advanced settings of each database output component that allows you to increase the batch size (although this will increase the memory usage) and also to enable parallel connections which can greatly improve performance by utilising more database server cores.
Your memory issues are unlikely to be resolved short of re-engineering your job to only deal with smaller chunks of a data at a time and commit each part and then grab the next lot.
This could be done by using a tFilterRow component and only selecting the first x records (by some filter condition, if the data set has none you could always add one by first preprocessing everything to give every row a Numeric.Sequence), processing it and putting it in your table and then picking the next x records and so on.

Use "Singe Insert Query" in the MSSQL output. Make sure you're using the correct batch size.
Batch should should be LESS OR EQUAL to: 2000 / column count. This could speed up the load speed.
However I'm not sure about the memory errors. I think talend tries to read the excel inputs into memory as a whole, thus for bigger the excel files you need more memory.

Storing arrays of integers in database

I am creating a database that will store 100.000 (and probably more in the future) users. While this obviously happens in a table with 1 row per user, every user can (and will) store hundreds of items. In programming language this would mean the user has 2 arrays (or one 2-dimensional array) of integers: a column for the itemid's and a column for the amounts.
My instincts tell me to create a table to hold all these items, with rows like (userid, itemid, amount). However this would result in a huge table. 200.000 users with 250 items each... that's 50 million entries in one table. This, plus the fact that the table will undergo continuous and rapid change, frightens me. (How rapid? I estimate up to 100 modifications per second.)
Typically there will be anywhere between 100 and 2000 users, all adding and removing items, and modifying amounts. These actions can and will happen in programming code. It would go as follows:
User starts session, program loads all the users items from the database
User modifies the item list
Every few minutes, the changes are saved into the database
When the user ends the session, it is also saved into the database
It is worth noting that there is a maximum to the number of items a user can store.
Are there any alternatives to using a separate table? Perhaps save the values in a formatted text string? Or is this one of the instances where using a MySQL database is actually a Bad Idea™?
Thank you for your time and insights.

My instincts tell me to create a table to hold all these items
Your instincts are right.
1) avoid premature optimisation
2) don't break the rules of normalization unless you've got a very good and real reason to do so
3) why do you suspect that the multi-table approach will be faster?
that's 50 million entries in one table
So what? Even if you only have an index on userid, the difference in performance compared with a single table per user will not be noticeably slower (in practice, with 200,000 users, it will be much, much faster - since the DBMS can comfortably keep an open file handle for each table!).
I estimate up to 100 modifications per second
Should be possible using MySQL and fairly basic hardware, but if it were me, and I wanted a bit of headroom, I'd go with a pair of mirrored SATA disks, tables on one mirror, indexes on the other.
The only issue I'd be concerned about (which applies regardless of which of the 2 models you choose) is supporting 2000 concurrent connections. Do the connections have to be concurrent? Or can each user download a working set (optionally using an optimistic locking strategy) and close off the connection, then push back the changes on a new connection? If not, then you'll probably want a good whack of memory and CPU.
But leaving aside whether to use one big table or lots of little ones, if this is the only use for the data, and access is not concurrent to particular data items, then why bother with a relational database at all? NoSQL or a shared filesystem might work just as well.

Putting data into one field as a array is alwmost always a mistake. It makes querying the data much harder and much more timeconsuming as well as much less likely to use indexes. It is ok, if the values were just text where you would never need to find one or more elements fo the array but it is my experience that this situation is rarely encountered. Modern databases can handle 50 million records without even breaking a sweat. That's a small table in daatbase terms.

It should be OK to do it as you described using two tables. The database should be able to handle millions of records.
The important points to look at:
1- Optimize your queries as much as possible.
2- Create the appropriate index(es) to speed up your queries.
3- Use InnoDB if you have concurrent read/update operations as it supports row-level locking as opposed to MyISAM.
4- Provide good hardware to support the database server.
5- Run the database server on a dedicated server if affordable.

SQL Server - Merging large tables without locking the data

I have a very large set of data (~3 million records) which needs to be merged with updates and new records on a daily schedule. I have a stored procedure that actually breaks up the record set into 1000 record chunks and uses the MERGE command with temp tables in an attempt to avoid locking the live table while the data is updating. The problem is that it doesn't exactly help. The table still "locks up" and our website that uses the data receives timeouts when attempting to access the data. I even tried splitting it up into 100 record chunks and even tried a WAITFOR DELAY '000:00:5' to see if it would help to pause between merging the chunks. It's still rather sluggish.
I'm looking for any suggestions, best practices, or examples on how to merge large sets of data without locking the tables.
Thanks

Change your front end to use NOLOCK or READ UNCOMMITTED when doing the selects.
You can't NOLOCK MERGE,INSERT, or UPDATE as the records must be locked in order to perform the update. However, you can NOLOCK the SELECTS.
Note that you should use this with caution. If dirty reads are okay, then go ahead. However, if the reads require the updated data then you need to go down a different path and figure out exactly why merging 3M records is causing an issue.
I'd be willing to bet that most of the time is spent reading data from the disk during the merge command and/or working around low memory situations. You might be better off simply stuffing more ram into your database server.
An ideal amount would be to have enough ram to pull the whole database into memory as needed. For example, if you have a 4GB database, then make sure you have 8GB of RAM.. in an x64 server of course.

I'm afraid that I've quite the opposite experience. We were performing updates and insertions where the source table had only a fraction of the number of rows as the target table, which was in the millions.
When we combined the source table records across the entire operational window and then performed the MERGE just once, we saw a 500% increase in performance. My explanation for this is that you are paying for the up front analysis of the MERGE command just once instead of over and over again in a tight loop.
Furthermore, I am certain that merging 1.6 million rows (source) into 7 million rows (target), as opposed to 400 rows into 7 million rows over 4000 distinct operations (in our case) leverages the capabilities of the SQL server engine much better. Again, a fair amount of the work is in the analysis of the two data sets and this is done only once.
Another question I have to ask is well is whether you are aware that the MERGE command performs much better with indexes on both the source and target tables? I would like to refer you to the following link:
http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx

From personal experience, the main problem with MERGE is that since it does page lock it precludes any concurrency in your INSERTs directed to a table. So if you go down this road it is fundamental that you batch all updates that will hit a table in a single writer.
For example: we had a table on which INSERT took a crazy 0.2 seconds per entry, most of this time seemingly being wasted on transaction latching, so we switched this over to using MERGE and some quick tests showed that it allowed us to insert 256 entries in 0.4 seconds or even 512 in 0.5 seconds, we tested this with load generators and all seemed to be fine, until it hit production and everything blocked to hell on the page locks, resulting in a much lower total throughput than with the individual INSERTs.
The solution was to not only batch the entries from a single producer in a MERGE operation, but also to batch the batch from producers going to individual DB in a single MERGE operation through an additional level of queue (previously also a single connection per DB, but using MARS to interleave all the producers call to the stored procedure doing the actual MERGE transaction), this way we were then able to handle many thousands of INSERTs per second without problem.
Having the NOLOCK hints on all of your front-end reads is an absolute must, always.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio