I'm using the LOAD CSV command to import nodes and relationships in Neo4j. For better performance I'm using as well USING PERIODIC COMMIT, because I use large files to import (+/- 50 millions of records in each file).
I want to know if is better use USING PERIODIC COMMIT 1000 or USING PERIODIC COMMIT 5000 or a bigger number of records used in a bulk for performance.
The fatest way is put a big number or the oposite?
Ps: I have a lot of free RAM memory in the machine.
Thanks
Big numbers will make the process faster. The reasoning is: a big number will results in less amount of commits. Consequently, a less amount of IO disk operations.
Example: With 1000 records and USING PERIODIC COMMIT 50 will results 20 write on disk operations (1000 records / 50). Changing to USING PERIODIC COMMIT 100 will results in 10 write on disk operations (1000 records / 100).
I have been working on something similar, my dataset contains some 700k data points.
I have seen that USING PERIODIC COMMIT 100000 is taking more time to insert the data points in the database than USING PERIODIC COMMIT 50000.
So, in my case the smaller numbers are making my process faster and the larger the numbers are it throws an exception of not enough memory to perform current task
Related
I'm running a job using Spring Batch 4.2.0 with postgres (11.2) as backend. It's all wrapped in a spring boot app. I've 5 steps and each runs using a simple partitioning strategy to divide data by id ranges and reads data into each partition (which are processed by separate threads). I've about 18M rows in the table, each step reads, changes few fields and writes back. Each step reads all 18M rows and writes back. The issue I'm facing is, the queries that run to pull data into each thread scans data by id range like,
select field_1, field_2, field_66 from table where id >= 1 and id < 10000.
In this case each thread processes 10_000 rows at a time. When there's no traffic the query takes less than a second to read all 10,000 rows. But when the job runs there's about 70 threads reading all that data in. It goes progressively slower to almost a minute and a half, any ideas where to start troubleshooting this?
I do see autovacuum running in the backgroun for almost the whole duration of job. It definitely has enough memory to hold all that data in memory (about 6GB max heap). Postgres has sufficient shared_buffers 2GB, max_wal_size 2GB but not sure if that in itself is sufficient. Another thing I see is loads of COMMIT queries hanging around when checking through pg_stat_activity. Usually as much as number of partitions. So, instead of 70 connections being used by 70 partitions there are 140 conections used up with 70 of them running COMMIT. As time progresses these COMMITs get progressively slower too.
You are probably hitting https://github.com/spring-projects/spring-batch/issues/3634.
This issue has been fixed and will be part of version 4.2.3 planned to be released this week.
I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).
I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.
I have a multi-threaded application that uses 10 threads, each of which on average result in an insert of 40K rows into a table. These inserts will occur 24/7 around the clock with no pauses.
I ran some performance tests and noted the following:
With CACHE 20 and single threaded, each insert took about 3.5 seconds on average
With CACHE 20 and 10 threads, each insert took 100 seconds on average
After removing the primary key and the sequence, each insert, regardless of the number of threads used, took 3.1 seconds.
With CACHE 400000 and 10 threads, each insert took 5.6 seconds on average. (Incidentally, the average originally was 8, then dropped down to 5.6 over time)
I'm performing an INSERT like this:
INSERT INTO foo (id, bar, baz)
SELECT (foo_id_seq.nextval, bar, baz)
FROM (
SELECT bar, baz
FROM ...
)
Given my constraints of 10 threads processing 40K records each on average, how can I calculate the optimal cache size of a sequence?
I'm tempted to set the cache size = (10 threads * 40K records) == 400,000, but I would be worried about any trade offs that I haven't read about in the docs.
Moreover, the insert with 400K cache size is still 100% worse than the insert with no sequence/pk. Granted this is an acceptable time.
The docs say:
The CACHE clause preallocates a set of sequence numbers and keeps them in memory so that sequence numbers can be accessed faster. When the last of the sequence numbers in the cache has been used, the database reads another set of numbers into the cache.
Sequence numbers can be kept in the sequence cache in the System Global Area (SGA). Sequence numbers can be accessed more quickly in the sequence cache than they can be read from disk.
Follow these guidelines for fast access to all sequence numbers: Be sure the sequence cache can hold all the sequences used concurrently by your applications. Increase the number of values for each sequence held in the sequence cache.
I thinks with a 3.5 second insert time, your cache size is largely irrelevant! I would look at where the time is being spent; and I would start with an execution plan (or preferably a SQL Monitor report ) for the query.
Given: SQL Server 2008 R2. Quit some speedin data discs. Log discs lagging.
Required: LOTS LOTS LOTS of inserts. Like 10.000 to 30.000 rows into a simple table with two indices per second. Inserts have an intrinsic order and will not repeat, as such order of inserts must not be maintained in short term (i.e. multiple parallel inserts are ok).
So far: accumulating data into a queue. Regularly (async threadpool) emptying up to 1024 entries into a work item that gets queued. Threadpool (custom class) has 32 possible threads. Opens 32 connections.
Problem: performance is off by a factor of 300.... only about 100 to 150 rows are inserted per second. Log wait time is up to 40% - 45% of processing time (ms per second) in sql server. Server cpu load is low (4% to 5% or so).
Not usable: bulk insert. The data must be written as real time as possible to the disc. THis is pretty much an archivl process of data running through the system, but there are queries which need access to the data regularly. I could try dumping them to disc and using bulk upload 1-2 times per second.... will give this a try.
Anyone a smart idea? My next step is moving the log to a fast disc set (128gb modern ssd) and to see what happens then. The significant performance boost probably will do things quite different. But even then.... the question is whether / what is feasible.
So, please fire on the smart ideas.
Ok, anywering myself. Going to give SqlBulkCopy a try, batching up to 65536 entries and flushing them out every second in an async fashion. Will report on the gains.
I'm going through the exact same issue here, so I'll go through the steps i'm taking to improve my performance.
Separate the log and the dbf file onto different spindle sets
Use basic recovery
you didn't mention any indexing requirements other than the fact that the order of inserts isn't important - in this case clustered indexes on anything other than an identity column shouldn't be used.
start your scaling of concurrency again from 1 and stop when your performance flattens out; anything over this will likely hurt performance.
rather than dropping to disk to bcp, and as you are using SQL Server 2008, consider inserting multiple rows at a time; this statement inserts three rows in a single sql call
INSERT INTO table VALUES ( 1,2,3 ), ( 4,5,6 ), ( 7,8,9 )
I was topping out at ~500 distinct inserts per second from a single thread. After ruling out the network and CPU (0 on both client and server), I assumed that disk io on the server was to blame, however inserting in batches of three got me 1500 inserts per second which rules out disk io.
It's clear that the MS client library has an upper limit baked into it (and a dive into reflector shows some hairy async completion code).
Batching in this way, waiting for x events to be received before calling insert, has me now inserting at ~2700 inserts per second from a single thread which appears to be the upper limit for my configuration.
Note: if you don't have a constant stream of events arriving at all times, you might consider adding a timer that flushes your inserts after a certain period (so that you see the last event of the day!)
Some suggestions for increasing insert performance:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits (e.g. autoinc column)
Insert into a temporary heap table first, then issue one big "insert-by-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Choose "Bulk Logged" recovery model instad of "Full" recovery model
Place a table lock before inserting (if your business scenario allows for it)
Taken from Tips For Lightning-Fast Insert Performance On SqlServer
I have a very large set of data (~3 million records) which needs to be merged with updates and new records on a daily schedule. I have a stored procedure that actually breaks up the record set into 1000 record chunks and uses the MERGE command with temp tables in an attempt to avoid locking the live table while the data is updating. The problem is that it doesn't exactly help. The table still "locks up" and our website that uses the data receives timeouts when attempting to access the data. I even tried splitting it up into 100 record chunks and even tried a WAITFOR DELAY '000:00:5' to see if it would help to pause between merging the chunks. It's still rather sluggish.
I'm looking for any suggestions, best practices, or examples on how to merge large sets of data without locking the tables.
Thanks
Change your front end to use NOLOCK or READ UNCOMMITTED when doing the selects.
You can't NOLOCK MERGE,INSERT, or UPDATE as the records must be locked in order to perform the update. However, you can NOLOCK the SELECTS.
Note that you should use this with caution. If dirty reads are okay, then go ahead. However, if the reads require the updated data then you need to go down a different path and figure out exactly why merging 3M records is causing an issue.
I'd be willing to bet that most of the time is spent reading data from the disk during the merge command and/or working around low memory situations. You might be better off simply stuffing more ram into your database server.
An ideal amount would be to have enough ram to pull the whole database into memory as needed. For example, if you have a 4GB database, then make sure you have 8GB of RAM.. in an x64 server of course.
I'm afraid that I've quite the opposite experience. We were performing updates and insertions where the source table had only a fraction of the number of rows as the target table, which was in the millions.
When we combined the source table records across the entire operational window and then performed the MERGE just once, we saw a 500% increase in performance. My explanation for this is that you are paying for the up front analysis of the MERGE command just once instead of over and over again in a tight loop.
Furthermore, I am certain that merging 1.6 million rows (source) into 7 million rows (target), as opposed to 400 rows into 7 million rows over 4000 distinct operations (in our case) leverages the capabilities of the SQL server engine much better. Again, a fair amount of the work is in the analysis of the two data sets and this is done only once.
Another question I have to ask is well is whether you are aware that the MERGE command performs much better with indexes on both the source and target tables? I would like to refer you to the following link:
http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx
From personal experience, the main problem with MERGE is that since it does page lock it precludes any concurrency in your INSERTs directed to a table. So if you go down this road it is fundamental that you batch all updates that will hit a table in a single writer.
For example: we had a table on which INSERT took a crazy 0.2 seconds per entry, most of this time seemingly being wasted on transaction latching, so we switched this over to using MERGE and some quick tests showed that it allowed us to insert 256 entries in 0.4 seconds or even 512 in 0.5 seconds, we tested this with load generators and all seemed to be fine, until it hit production and everything blocked to hell on the page locks, resulting in a much lower total throughput than with the individual INSERTs.
The solution was to not only batch the entries from a single producer in a MERGE operation, but also to batch the batch from producers going to individual DB in a single MERGE operation through an additional level of queue (previously also a single connection per DB, but using MARS to interleave all the producers call to the stored procedure doing the actual MERGE transaction), this way we were then able to handle many thousands of INSERTs per second without problem.
Having the NOLOCK hints on all of your front-end reads is an absolute must, always.