Loading a huge entity tree with EF

Loading a huge entity tree with EF - performance

I need to load a model, existing of +/- 20 tables from the database with Entity Framework.
So there are probably a few ways of doing this:
Use one huge Include call
Use many Includes calls while manually iterating the model
Use many IsLoaded and Load calls
Here's what happens with the 2 options
EF creates a HUGE query, puts a very heavy load on the DB and then again with mapping the model. So not really an option.
The database gets called a lot, with again pretty big queries.
Again, the database gets called even more, but this time with small loads.
All of these options weigh heavy on the performance. I do need to load all of that data (calculations for drawing).
So what can I do?
a) Heavy operation => heavy load => do nothing :)
b) Review design => but how?
c) A magical option that will make all these problems go away

When you need to load a lot of data from a lack of different tables, there is no "magic" solution which makes all problems go away. But in addition to what you have already discussed, you should consider projection. If you don't need every single property of an entity, it is often cheaper to project the information you do need, i.e.:
from parent in MyEntities.Parents
select new
{
ParentName = ParentName,
Children = from child in parent.Children
select new
{
ChildName = child.Name
}
}
One other thing to keep in mind is that for very large queries, the cost of compiling the query can often exceed the cost of executing it. Only profiling can tell you if this is the problem. If this turns out to be the problem, consider using CompiledQuery.

You might analyze the ratio of queries to updates. If you mostly upload the model once, then everything else is a query, then maybe you should store an XML representation of the model in the database as a "shadow" of the model. You should be able to either read the entire XML column in at once fairly quickly, or else maybe you can do your calculations (or at least the fetch of the values necessary for the calculations) using XQuery.
This assumes SQL Server 2005 or above.

You could consider caching your data in memory instead of getting it from the database each time.
I would recommend Enterprise Library Caching Application block: http://msdn.microsoft.com/en-us/library/dd203099.aspx

Related

Query duration in NHibernate Profiler

I have an ASP .Net MVC application which uses Fluent NHibernate to access an oracle database. I also use NHibernate Profiler for monitoring the queries generated by NHibernate. I have one query which is really simple (selecting all rows from a table with 4 string columns). It is used for creating a report in CSV format. My problem is that the query is taking very long to run, and I would like to get a bit more insight into the durations displayed by nhprof. With 65.000 rows, it is taking 10-20 seconds, even though the "Database only" duration only shows something like 20 ms. Network lag should not make out a lot of this time, because the servers are on the same gigabit LAN. I don't expect people to be able to pinpoint for me exactly where the bottleneck is, but what I would like to know is some more details about how to read the duration measurements in NHibernate profiler.
What is included in the "Database only" part, and what is included in the "Total time"? Does the total time also include the processing done after populating the C# objects, so that this time is actually for the entire http request? Knowing more about this would hopefully make me able to eliminate some factors.
This what the NHibernate mapping class looks like:
Table("V_TICKET_DETAILS");
CompositeId()
.KeyProperty(x => x.TicketId, "TICKET_ID")
.KeyProperty(x => x.Key, "COLUMN_NAME")
.KeyProperty(x => x.Parent, "PARENT_NAME");
Map(x => x.Value, "COLUMN_VALUE");
And the query generated by nh profiler is like this:
SELECT this_.TICKET_ID as TICKET1_35_0_,
this_.COLUMN_NAME as COLUMN2_35_0_,
this_.PARENT_NAME as PARENT3_35_0_,
this_.COLUMN_VALUE as COLUMN4_35_0_
FROM V_TICKET_DETAILS this_
The view is really simple, only joining two tables on a 2-digit integer.
I am by no means a database expert, so I would be happy for all comments that would point me in the correct direction.

The total time is for the call to the nHib query only.
However, it includes, in addition to the time in the db, the time it takes nHib to populate your entities (hydration). and that's likely your culprit.
I've had a similar problem, perhaps some of the suggestions there may help you.
The bottom line is that nHib is not really intended to load large datasets.
If none of the suggestions I got helped you, I would suggest a couple of things:
1. It's unlikely that your user needs to view 65,000 rows of data at the same time. perhaps you can find a way to filter the data so that the result set is smaller (and more readable).
2. otherwise- if it's, as you say, an 'special' case that only occurs when you generate a report- you don't have to use nHib. you can just use, say, good ol' ADO.Net classes...

there is also IStatelessSession which is intended for such situations. It doesnt have a session cache and saves a lot of work. It should be a lot faster.
using (var session = factory.OpenStatelessSession())
{
}

Best strategy for retrieving large dynamically-specified tables on an ASP.NET page

Looking for a bit of advice on how to optimise one of our projects. We have a ASP.NET/C# system that retrieves data from a SQL2008 data and presents it on a DevExpress ASPxGridView. The data that's retrieved can come from one of a number of databases - all of which are slightly different and are being added and removed regularly. The user is presented with a list of live "companies", and the data is retrieved from the corresponding database.
At the moment, data is being retrieved using a standard SqlDataSource and a dynamically-created SQL SELECT statement. There are a few JOINs in the statement, as well as optional WHERE constraints, again dynamically-created depending on the database and the user's permission level.
All of this works great (honest!), apart from performance. When it comes to some databases, there are several hundreds of thousands of rows, and retrieving and paging through the data is quite slow (the databases are already properly indexed). I've therefore been looking at ways of speeding the system up, and it seems to boil down to two choices: XPO or LINQ.
LINQ seems to be the popular choice, but I'm not sure how easy it will be to implement with a system that is so dynamic in nature - would I need to create "definitions" for each database that LINQ could access? I'm also a bit unsure about creating the LINQ queries dynamically too, although looking at a few examples that part at least seems doable.
XPO, on the other hand, seems to allow me to create a XPO Data Source on the fly. However, I can't find too much information on how to JOIN to other tables.
Can anyone offer any advice on which method - if any - is the best to try and retro-fit into this project? Or is the dynamic SQL model currently used fundamentally different from LINQ and XPO and best left alone?

Before you go and change the whole way that your app talks to the database, have you had a look at the following:
Run your code through a performance profiler (such as Redgate's performance profiler), the results are often surprising.
If you are constructing the SQL string on the fly, are you using .Net best practices such as String.Concat("str1", "str2") instead of "str1" + "str2". Remember, multiple small gains add up to big gains.
Have you thought about having a summary table or database that is periodically updated (say every 15 mins, you might need to run a service to update this data automatically.) so that you are only hitting one database. New connections to databases are quiet expensive.
Have you looked at the query plans for the SQL that you are running. Today, I moved a dynamically created SQL string to a sproc (only 1 param changed) and shaved 5-10 seconds off the running time (it was being called 100-10000 times depending on some conditions).
Just a warning if you do use LINQ. I have seen some developers who have decided to use LINQ write more inefficient code because they did not know what they are doing (pulling 36,000 records when they needed to check for 1 for example). This things are very easily overlooked.
Just something to get you started on and hopefully there is something there that you haven't thought of.
Cheers,
Stu

As far as I understand you are talking about so called server mode when all data manipulations are done on the DB server instead of them to the web server and processing them there. In this mode grid works very fast with data sources that can contain hundreds thousands records. If you want to use this mode, you should either create the corresponding LINQ classes or XPO classes. If you decide to use LINQ based server mode, the LINQServerModeDataSource provides the Selecting event which can be used to set a custom IQueryable and KeyExpression. I would suggest that you use LINQ in your application. I hope, this information will be helpful to you.

I guess there are two points where performance might be tweaked in this case. I'll assume that you're accessing the database directly rather than through some kind of secondary layer.
First, you don't say how you're displaying the data itself. If you're loading thousands of records into a grid, that will take time no matter how fast everything else is. Obviously the trick here is to show a subset of the data and allow the user to page, etc. If you're not doing this then that might be a good place to start.
Second, you say that the tables are properly indexed. If this is the case, and assuming that you're not loading 1,000 records into the page at once and retreiving only subsets at a time, then you should be OK.
But, if you're only doing an ExecuteQuery() against an SQL connection to get a dataset back I don't see how Linq or anything else will help you. I'd say that the problem is obviously on the DB side.
So to solve the problem with the database you need to profile the different SELECT statements you're running against it, examine the query plan and identify the places where things are slowing down. You might want to start by using the SQL Server Profiler, but if you have a good DBA, sometimes just looking at the query plan (which you can get from Management Studio) is usually enough.

Entity Framework associations killing performance

Here is the performance test i am looking at. I have 8 different entities that are table per type. Some of the entities contain over 100 thousand rows.
This particular application does several recursive calculations on the client so I think it may be best to preload the data instead of lazy loading.
If there are no associations I can load the entire database in about 3 seconds. As I add associations in any way the performance starts to drastically decline.
I am loading all the data the same way (just calling toList() on the entity attached to the context). I ran the test with edmx generated classes and self tracking entities and had similar results.
I am sure if I were to try and deal with the associations myself, similar to how I would in a dataset, the performance problem would go away. On the other hand I am pretty sure this is not how the entity framework was intended to being used. Any thoughts or ideas?

Loading entities with relationships is going to be much slower than loading entities without even if the related entities are not fetched at load time since it will need to create the complex object used to track the relationship in one case vs perhaps a simple value type like an int in the other. How much slower are you seeing it?
But ...
Preloading 100 thousand rows sounds like a really bad idea. When you do ToList() you have eliminated any chance that EF and SQL can do any kind of optimized query against your data. Are your calculations such that you always need to examine all the data? Have you tried it without preloading and examined the queries it is generating? Have you tried using .Include to just include the related objects you know you will need?
EF will be smart about caching if you give it the chance.

Should I store reference data in my application memory, or in the database?

I am faced with the choice where to store some reference data (essentially drop down values) for my application. This data will not change (or if it does, I am fine with needing to restart the application), and will be frequently accessed as part of an AJAX autocomplete widget (so there may be several queries against this data by one user filling out one field).
Suppose each record looks something like this:
category
effective_date
expiration_date
field_A
field_B
field_C
field_D
The autocomplete query will need to check the input string against 4 fields in each record and discrete parameters against the category and effective/expiration dates, so if this were a SQL query, it would have a where clause that looks something like:
... WHERE category = ?
AND effective_date < ?
AND expiration_date > ?
AND (colA LIKE ? OR colB LIKE ? OR colC LIKE ?)
I feel like this might be a rather inefficient query, but I suppose I don't know enough about how databases optimize their indexes, etc. I do know that a lot of really smart people work really hard to make database engines really fast at this exact type of thing.
The alternative I see is to store it in my application memory. I could have a list of these records for each category, and then iterate over each record in the category to see if the filter criteria is met. This is definitely O(n), since I need to examine every record in the category.
Has anyone faced a similar choice? Do you have any insight to offer?
EDIT: Thanks for the insight, folks. Sending the entire data set down to the client is not really an option, since the data set is so large (several MB).

Definitely cache it in memory if it's not changing during the lifetime of the application. You're right, you don't want to be going back to the database for each call, because it's completely unnecessary.
There's can be debate about exactly how much to cache on the server (I tend to cache as little as possible until I really need to), but for information that will not change and will be accessed repeatedly, you should almost always cache that in the Application object.
Given the number of directions you're coming at this data (filtering on 6 or more columns), I'm not sure how much more you'll be able to optimize the information in memory. The first thing I would try is to store it in a list in the Application object, and query it using LINQ-to-objects. Or, if there is one field that is used significantly more than the others, or try using a Dictionary instead of a list. If the performance continues to be a problem, try using storing it in a DataSet and setting indexes on it (but of course you loose some code-simplicity and maintainability this way).

I do not think there is a one size fits all answer to your question. Depending on the data size and usage patterns the answer will vary. More than that the answer may change over time.
This is why in my development I built some intermediate layer which allows me to change how the caching is done by changing configuration (with no code changes). Every while we analyze various stats (cache hit ratio, etc.) and decide if we want to change cache behavior.
BTW there is also a third layer - you can push your static data to the browser and cache it there too

Can you just hard-wire it into the program (as long as you stick to DRY)? Changing it only requires a rebuild.

Entity framework and performance

I am trying to develop my first web project using the entity framework, while I love the way that you can use linq instead of writing sql, I do have some severe performance issuses. I have a lot of unhandled data in a table which I would like to do a few transformations on and then insert into another table. I run through all objects and then inserts them into my new table. I need to do some small comparisons (which is why I need to insert the data into another table) but for performance tests I have removed them. The following code (which approximately 12-15 properties to set) took 21 seconds, which is quite a long time. Is it usually this slow, and what might I do wrong?
DataLayer.MotorExtractionEntities mee = new DataLayer.MotorExtractionEntities();
List<DataLayer.CarsBulk> carsBulkAll = ((from c in mee.CarsBulk select c).Take(100)).ToList();
foreach (DataLayer.CarsBulk carBulk in carsBulkAll)
{
DataLayer.Car car = new DataLayer.Car();
car.URL = carBulk.URL;
car.color = carBulk.SellerCity.ToString();
car.year = //... more properties is set this way
mee.AddToCar(car);
}
mee.SaveChanges();

You cannot create batch updates using Entity Framework.
Imagine you need to update rows in a table with a SQL statement like this:
UPDATE table SET col1 = #a where col2 = #b
Using SQL this is just one roundtrip to the server. Using Entity Framework, you have (at least) one roundtrip to the server loading all the data, then you modify the rows on the client, then it will send it back row by row.
This will slow things down especially if your network connection is limited, and if you have more than just a couple of rows.
So for this kind of updates a stored procedure is still a lot more efficient.

I have been experimenting with the entity framework quite a lot and I haven't seen any real performance issues.
Which row of your code is causing the big delay, have you tried debugging it and just measuring which method takes the most time?
Also, the complexity of your database structure could slow down the entity framework a bit, but not to the speed you are saying. Are there some 'infinite loops' in your DB structure? Without the DB structure it is really hard to say what's wrong.

can you try the same in straight SQL?
The problem might be related to your database and not the Entity Framework. For example, if you have massive indexes and lots of check constraints, inserting can become slow.
I've also seen problems at insert with databases which had never been backed-up. The transaction log could not be reclaimed and was growing insanely, causing a single insert to take a few seconds.
Trying this in SQL directly would tell you if the problem is indeed with EF.

I think I solved the problem. I have been running the app locally, and the database is in another country (neighbor, but never the less). I tried to load the application to the server and run it from there, and it then only took 2 seconds to run instead of 20. I tried to transfer 1000 records which took 26 seconds, which is quite an update, though I don't know if this is the "regular" speed for saving the 1000 records to the database?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio