SaveChanges takes anywhere from 40 to 60 seconds to save ~8000 context changes - performance

Setup:
Entity Framework 4 with lazy loading enabled (model-first, table-per-hierarchy).
Number of table is about 40 (and no table has more than 15-20 fields).
SQL Server Express 2008 (not r2).
No database triggers or any other stuff like this exist - it is only used for storage. All the logic is in the code.
Database size at the moment is approx 2gb.
(Primary keys are Guids and are generated in code via Guid.NewGuid() - if this matters)
Saving a complex operation result (which produces a complex object graph) takes anywhere from 40 to 60 seconds (the number returned by SaveChanges is approx. 8000 - mostly added objects and a some modified).
Saving the same operation result with an empty (or an almost empty) database usually takes around 1 seconds on the same computer.
The only variable that seems to affect this issue is the database size. But please note that I am only measuring the Context.SaveChages() call (so even if I have some weird sluggish queries somewhere that should not affect this issue).
Any suggestions as to why this operation may last this long are appreciated.
UPDATE 1
Just to clarify - the code that takes 40-60 seconds to execute is (it takes this long only when the DB size is around 2gb):
Stopwatch sw = Stopwatch.StartNew();
int count = objectContext.SaveChanges(); // this method is not overridden
Debug.Write(sw.ElapsedMilliseconds); // prints out 40000 - 60000 ms
Debug.Write(count); // I am testing with exactly the same operation and the
// result always gives the same count for it (8460)
The same operation with an empty DB takes around 1000 ms (while still giving the same count - 8460). Thus the question would be - how could database size affect SaveChanges()?
Update 2
Running a perf profiler shows that the main bottleneck (from "code perspective") is the following method:
Method: static SNINativeMethodWrapper.SNIReadSync
Called: 3251 times
Avg: 10.56 ms
Max: 264.25 ms
Min: 0.01 ms
Total: 34338.51 ms
Update 3
There are non-clustered indexes for all PKs and FKs in the database. We are using random Guids as surrogate keys (not sequential) thus fragmentation is always at very high levels. I tried testing executing the operation in question right after rebuilding all DB indexes (fragmentation was less than 2-3% for all indexes) but it did not seems to improve the situation in any way.
In addition I must say that during the operation in question one table involved in the process has approximately 4 million rows (this table gets lots of inserts). SQL Profiler shows that inserts to that table can last anywhere from 1 to 200 ms (this is a "spike"). Yet again, it does not seem that this changes in case indexes are freshly rebuilt.
In any case - it seems (at the moment) that the problem is on the SQL Server side of the application since the main thing taking up time is that SNIReadSync method. Correct me if I am being completely ignorant.

It hard to guess without profiler, but 8000 of records is definitely too many. Usually EF 4 works ok with up to couple of hundreds objects. I would not be surprised if it turns that change tracking takes most of this time. EF 5 and 6 have some performance optimizations, so if you cannot decrease number of tracked objects somehow, you could experiment with them.

Related

Entityframework Core - Queries are executed fast but SaveChanges complete way after

In my current task, I'm trying to insert 100 users with approx 20 properties. Logger shows me EF Core executing all these insertion processes by dividing to 4 different queries and each query execution takes up to 100ms. Even though all queries are executed under 1 second, it takes around 10 seconds application code to step over SaveChanges.
Things that are considered and implemented:
There is only a single SaveChanges call.
There are no additional relations with the user object. Single entity, single table.
All mapping validated couple times to match entity property types with column types.
For a low record number such as 100, this is unacceptable as you may agree.
Which direction should I look at to understand the underlying problem?
Thanks in advance.

Will Spring Batch prevent my program from grinding to a halt on 94 million transactions if the Garbage Collection is an issue?

This may look like a similar question to Performance optimization for processing of 115 million records for inserting into Oracle but I feel it's a different problem, and the other question does not have a definitive answer because of some lack of clarity.
I am loading a netCDF file consisting of the following variables and dimensions into three tables in a database to collect data from multiple data-sources
Variables:
Time: 365 entries in hours since Jan 1, 1900
Latitude: 360 entries, center of 1/2 degree latitude bands
Longitude: 720 entries, center of 1/2 degree longitude bands
Precipitation: 3 Dimensional Array Time, Lat, Lon in dimensions
The three tables I am constructing are like so:
UpdateLog:
uid year updateTime
Location:
lid lat lon
(hidden MtM table) UpdateLog_Location:
uid lid
Precipitation:
pid lid uid month day amount
If you do the math, the Location (and hidden table) will have around 250k entries each for this one file (it's just the year 2017) and the Precipitation table will have up to 94 million entries.
Right now, I am just using Spring Boot, trying to read in the data and update the tables starting with Location.
When I have a batch size of 1, the database started off updating fairly quickly, but over time bogged down. I didn't have any sort of profiling set up at the time, so I wasn't sure why.
When I set it to 500, I started noticing clearly the steps as it slowed down each update, but it started off much faster than the batch size of 1.
I set it to 250,000 and it updated the first 250,000 entries in about 3 minutes, when on a batch size of 1, 72 hours wouldn't even come close. However, I started profiling the program and I noticed something. This seems to be a problem not with the database (35-40 seconds is all it took to commit all those entries), but with Java, as it seems the Garbage Collection isn't keeping up with all the old POJOs.
Now, I have been looking at 2 possible solutions to this problem. Spring Batch, and just a direct CSV import to MariaDB. I'd prefer to do the former if possible to keep things unified if possible. However, I've noticed that Spring Batch also has me create POJOs for each of the items.
Will Spring Batch remedy this problem for me? Can I fix this with a thread manager and multi-threading the operation so I can have multiple GCs running at once? Or should I just do the direct CSV file import to MariaDB?
The problem is that even if I can get this one file done in a few days, we are building a database of historical weather of all types. There will be many more files to import, and I want to set up a workable framework we can use for each of them. There's even 116 more years of data for this one data source!
Edit: Adding some metrics from the run last night that support my belief that the problem is the garbage collection.
194880 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
1165541217 nanoseconds spent preparing 518405 JDBC statements;
60891115221 nanoseconds spent executing 518403 JDBC statements;
2167044053 nanoseconds spent executing 2 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
6042527312343 nanoseconds spent executing 259203 flushes (flushing a total of 2301027603 entities and 4602055206 collections);
5673283917906 nanoseconds spent executing 259202 partial-flushes (flushing a total of 2300518401 entities and 2300518401 collections)
As you can see, it is spending 2 orders of magnitude longer flushing memory than actually doing the work.
4 tables? I would make 1 table with 4 columns, even if the original data were not that way:
dt DATETIME -- y/m/d:h
lat SMALLINT
lng SMALLINT
amount ...
PRIMARY KEY (dt, lat, lng)
And, I would probably do all the work directly in SQL.
LOAD DATA INFILE into whatever matches the file(s).
Run some SQL statements to convert to the schema above.
Add any desired secondary indexes to the above table.
(In one application, I converted hours into a MEDIUMINT, which is only 3 bytes. I needed that type of column in far more than 94M rows across several tables.)
At best, your lid would be a 3-byte MEDIUMINT with two 2-bytes SMALLINTs behind it. The added complexity probably outweighs a mere 94MB savings.
Total size: about 5GB. Not bad.
I've noticed that Spring Batch also has me create POJOs for each of the items.
Spring Batch does not force you to parse data and map it POJOs. You can use the PassThroughLineMapper and process items in their raw format (even in binary if you want).
I would recommend to use partitioning in your use case.
I'd like to thank those who assisted me as I have found several answers to my question and I will outline them here.
The problem stemmed from the fact that Hibernate ends up creating 1,000 garbage collection jobs per POJO and is not a very good system for batch processing. Any good remedy for large batches will avoid using Hibernate altogether.
The first method of doing so that I found utilizes Spring Boot without Hibernate. By creating my own bulk save method in my repository interface, I was able to bind it to a SQL insert query directly without needing a POJO or using hibernate to create the query. Here is an example of how to do that:
#Query(value = "insert ignore into location (latitude, longitude) values(:latitude, :longitude)",
nativeQuery = true)
public void bulkSave(#Param("latitude") float latitude, #Param("longitude") float longitude);
Doing this greatly reduced the garbage collection overhead allowing the process to run without slowing down at all over time. However, for my purposes, while an order of magnitude faster, this was still far too slow for my purposes, taking 3 days for 94 million lines.
Another method shown to me was to use Spring Batch to bulk send the queries, instead of sending them one at a time. Due to my unusual data-source, it was not a flat file, I had to handle the data and feed it into a ItemReader one entry at a time to make it appear that it was coming from a file directly. This also improved speed, but I found a much faster method before I tried this.
The fastest method I found was to write the tables I wanted out to a CSV file, then compress and then transmit the resulting file to the database where it could be decompressed and imported into the database directly. This can be done for the above table with the following SQL command:
LOAD DATA
INFILE `location.csv`IGNORE
INTO TABLE Location
COLUMNS TERMINATED BY `,`
OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY `\n`
(latitude, longitude)
SET id = NULL;
This process took 15 minutes to load the file in, 5 minutes to compress the 2.2 Gbs of files, 5 minutes to decompress the files, and 2-3 minutes to create the files. Transmission of the file will depend on your network capabilities. At 30 minutes plus network transfer time, this was by far the fastest method of importing the large amounts of data I needed into the database, though it may require more work on your part depending on your situation.
So there are the 3 possible solutions to this problem that I discovered. The first uses the same framework and allows easy understanding and implementation of the solution. The second uses an extension of the framework and allows for larger transfers in the same period. The final one is by far the fastest and is useful if the amount of data is egregious, but requires work on your part to build the software to do so.

SQLCE performance in windows phone very poor

I've writing this thread as I've fought this problem for three whole days now!
Basically, I have a program that collects a big CSV-file and uses that as input to a local SQLCE-database.
For every row in this CSV-file (which represents some sort of object, lets call it "dog"), I need to know whether this dog already exists in the database.
If it already exists, don't add it to the database.
If it doesn't exists, add a new row in the database.
The problem is, every query takes around 60 milliseconds (in the beginning, when the database is empty) and it goes up to about 80ms when the database is around 1000 rows big.
When I have to go thru 1000 rows (which in my opinion is not much), this takes around 70000 ms = 1 minute and 10 seconds (just to check if the database is up to date), way too slow! Considering this amount will probably some day be more than 10000 rows, I cannot expect my user to wait for over 10 minutes before his DB is synchronized.
I've tried to use the compiled query instead, but that does not improve performance.
The field which im searching for is a string (which is the primary key), and it's indexed.
If it's necessary, I can update this thread with code so you can see what I do.
SQL CE on Windows Phone isn't the fastest of creatures but you can optimise it:
This article covers a number of things you can do:WP7 Local DB Best Practices
They als provide a WP7 project that can be downloaded so you can play with the code.
On top of this article I'd suggest changing your PK from a string to an int; strings take up more space than ints so your index will be larger and take more time to load from isolated storage. Certainly in SQL Server searchs of strings are slower than searches of ints/longs.

MongoDB C# cursor performance issue

I installed the lastest MongoDB 64 bit DB and official C# driver as of 13 Marh 2012. I am getting some unexpected performace results with cursors.
The following code will retrieve and loop through 500,000 records at about 26.8 k / sec on my Core 2 Duo 2 GHz laptop:
var query = Query.EQ("_H._t", "Car");
var cursor = mc.FindAs<RoctObj>(query);
double priceTot = 0d;
foreach (RoctObj item in cursor)
{
Car car = (Car)item._H;
priceTot += car.Price;
}
That seems reasonable. Next, I adjusted the query so that only 721 results are returned. The code takes over 1.1 seconds longer to execute than if the foreach segment is replaced with:
long i = cursor.Count();
Given the speed of the first example, 721 records should only take a fraction of a second to iterate. I know there are some other overheads, but they should be that bad. I don't understand why I am getting +1.1 seconds.
Any ideas?
EDIT
Here is the alternate query. Note that the query time isn't the question. It's the iteration time.
var query = Query.And(
Query.LTE("_H.Price", BsonDouble.Create(80000d)).GTE(BsonDouble.Create(40000d)),
Query.LTE("_H.Cylinders", BsonDouble.Create(8d)).GTE(BsonDouble.Create(4d)),
Query.LTE("_H.Capacity", BsonDouble.Create(3000d)).GTE(BsonDouble.Create(2000d)),
Query.LTE("_H.TopSpeed", BsonDouble.Create(200d)).GTE(BsonDouble.Create(100d))
);
Calling cursor.Count() transfers no data from the server to your application. It sends a command to the server and the count is performed on the server, and only a tiny packet comes back from the server containing the numeric result of the count.
Not sure why iterating over the documents is taking that much longer than a simple count. One reason could be that the server is able to compute the count using only an index, but that when you actually iterate over the documents the server would have to fetch every single document from disk if it was not already paged into memory.
It is unlikely to be any bottleneck in the C# driver deserialization code as that is quite fast.
If you can provide a sample program that demonstrates the observed behavior I would be happy to try and reproduce your results.
MongoDB does not return all the results at once, it returns a cursor which reads data off the database one record at a time, as your application asks for it (i.e. during your iterations) which may be why it is slower.
Running a count() simply returns the amount of matches found but without data.

NHibernate slow when reading large bag

I am having performance problems where an aggregate has a bag which has a large number of entities (1000+). Usually it only contains at most 50 entities but sometimes a lot more.
Using NHibernate profiler I see that the duration to fetch 1123 records of this bag from the database is 18ms but it takes NHibernate 1079ms to process it. Problem here is that all those 1123 records have one or two additional records. I fetch these using fetch="subselect" and fetching these additional records takes 16ms to fetch from the database and 2527ms processing by NHibernate. So this action alone takes 3,5 seconds which is way too expensive.
I read that this is due the fact that updating the 1st level cache is the problem here as it performance gets slow when loading a lot of entities. But what is alot? NHibernate Profiler says that I have 1145 entities loaded by 31 queries (which is in my case the absolute minimum). This number of entities loaded does not seem like a lot to me.
In the current project we are using NHibernate v3.1.0.4000
I agree, 1000 entities aren't too many. Are you sure that the time isn't used in one of the constructors or property setters? You may stop the debugger during the load time to take a random sample where it spends the time.
Also make sure that you use the reflection optimizer (I think it's turned on by default).
I assume that you measure the time of the query itself. If you measure the whole transaction, it most certainly spends the time in flushing the session. Avoid flushing by setting the FlushMode to Never (only if there aren't any changes in the session to be stored) or by using a StatelessSession.
A wild guess: Removing the batch-size setting may even make it faster because it doesn't need to assign the entities to the corresponding collections.

Resources