Memory/Efficiency with Linq and large data sets

Memory/Efficiency with Linq and large data sets - performance

So you know the background I'm coming from, I've been a professional programmer for over twelve years. My best language by far is C# but I've done C, C++, and most recently objectiveC. I've done a lot of work accessing data in databases but I haven't done as much UI work as most people (Except in IOS).
Recently I've begun using the Entity framework in C# for a job and I must say I wish I'd discovered it sooner. I wouldn't say it's the best thing since sliced bread but it's pretty damned close. After using it for a while it got me thinking about best practices and usage as compared to the old school method of using IDBConnections and IDBCommands for everything.
I was coding for a situation where I was going to be listing the contents of a table of users from a database in a bound data grid with the intention of giving the user the ability to do standard CRUD stuff. I started off by making an User class and a IUserManager interface with a corresponding implementation. Each user is assigned to a department and naturally there'd need to be a way to perform CRUD on departments too so I added a Department class, an IDepartmentManager interface and an implementation for that too. I set it up so that the grid bound on the results of the .GetAll() method on the IUserManager interface. Then I started filling in the guts.
I don't have the code in front of me any more but I basically used IDBConnection to tap into the datastore with an IDBCommand using a SQL query. Then I called command.ExecuteReader() and iterated the .Read() method on the IDataReader object. Using the ordinal for each column I pulled out the data, validated it and slipped it into a User class and added the class to a Dictionary that the method would then return. All the DB classes are of course IDisposable so wrapping them in a using takes care of cleaning up the mess.
Pretty standard stuff, I've done it a bazillion times.
That's when I realized that the departmentId I was pulling from the DB wasn't what I wanted to display in my grid. Telling someone 'this guy is in department 7' isn't as useful as saying 'this guy is in accounting'. So I first toyed with modding my query to get both the departmentId and name, and storing the name on the user object for display later. Then I decided to give the user a Department class instance that it would hang onto during it's lifetime that would be populated. That's when I converted the guts to linq.
public Dictionary<int, User> GetAll()
{
var result = new Dictionary<int, User>();
using (var datastore = new myEntities())
{
result = (from user in datastore.userInfoes
join department in datastore.userDepartmentInfoes on user.departmentID equals department.departmentID
select new User()
{
UserIndex = user.id,
FirstName = user.firstName,
LastName = user.lastName,
Department = new Department()
{
DepartmentId = user.departmentID.Value,
DepartmentName = department.departmentName,
},
Username = user.userName,
}
).ToDictionary(x => x.UserIndex, x => x);
}
return result;
}
That's where I started thinking (read: over-analysing probably)
The implementation I had would work just fine. It would even work pretty well for a small dataset. It'll even work fine for a largish dataset (say 10,000). Even if you counted every person in the company I currently work for five times over you'd have less than a thousand people.
But what if for a second I worked for a really big honking company that had 10 million employees? That would result in the departmentName strings being duplicated potentially millions of times.
That also got me thinking that unlike IOS's MVC implementation this particular situation wasn't going to query just enough users to fill the screen and then handle paging and stuff. As soon as the calling code refresh the data binding it was going to pull all 10 million users all at once and pass back the collection. That's going to be slow.
So that leaves me with the idea in my head that this method is both slow and inefficient with larger data sets. Not only that but the fact that there might be 2 million instances of 'Accounting' held with this data set it is going to be a major memory hog. We're also kind of defeating the purpose of a relational database here because of the Department class inside the User. In the DB you just have a departmentId int foreign key referencing an entry in another table. The link only occurs when you cross reference to the other table and even then there's really only one 'Accounting' string at any one time. In the above code you're going to have a whole lot of 'Accounting' strings floating around waiting to be cleaned up.
An MVC scenario would basically 'know' that it takes X number of entries to fill the grid's viewable area. It would only query X at a time starting from index Y and as the user navigated it would query and display additional records as needed. That's a heck of a lot better than querying all 10 million and letting them hang out somewhere whether they're displayed or not.
Like I said, I may very well be over-analysing this. I might also be incorrect in some of my assumptions with the way linq works. But in the interest of learning I figured I had to ask: What is the best way to do something like this? Is this sort of thing ok for small datasets? Would the whole thing be better off as an MCV implementation rather than pulling in the entire dataset to be displayed in the grid?

If you need the whole set of data in memory - you will have to load it anyway. I am sure you will not list 10kk users in a grid, right? The techniques that comes up is paging. Check this article from msdn with examples.
As for departments objects, does your UserInfo has a foreign key to the department? If so you should just have userInfo.Department available to you and no joins are needed.
If you bind the department data to the grid columns, why having the property of Department type? I assume your User class is something you bind to UI. Flatten it out into:
class User
{
Username
UserIndex
FirstName
LastName
DepartmentId
DepartmentName
}
What is the purpose of GetAll()? You return a dictionary and it feels like you need to enable lookups by id. Or do you use the result to enumerate the users?
For lookups, consider talking to the database to get you a single user data when needed. Implement caching if makes sense next.
For enumeration, do not return dictionary - that is all-in-memory object, return IEnumerable with yielded (paged?) results or even better IQueryable so that calling GetAll() doesn't execute the sql call right away, and the calling code can scope the call down by adding necessary filters

Related

Should I extract functionality from this model class into a form class? (ActiveRecord Pattern)

I am in the midst of designing an application following the mvc paradigm. I'm using the sqlalchemy expression language (not the orm), and pyramid if anyone was curious.
So, for a user class, that represents a user on the system, I have several accessor methods for various pieces of data like the avatar_url, name, about, etc. I have a method called getuser which looks up a user in the db(by name or id), retrieves the users row, and encapsulates it with the user class.
However, should I have to make this look-up every-time I create a user class? What if a user is viewing her control panel and wants to change avatars, and sends an xhr; isn't it a waste to have to create a user object, and look up the users row when they wont even be using the data retrieved; but simply want to make a change to subset of the columns? I doubt this lookup is negligible despite indexing because of waiting for i/o correct?
More generally, isn't it inefficient to have to query a database and load all a model class's data to make any change (even small ones)?
I'm thinking I should just create a seperate form class (since every change made is via some form), and have specific form classes inherit them, where these setter methods will be implemented. What do you think?
EX: Class: Form <- Class: Change_password_form <- function: change_usr_pass
I'd really appreciate some advice on creating a proper design;thanks.

SQLAlchemy ORM has some facilities which would simplify your task. It looks like you're having to re-invent quite some wheels already present in the ORM layer: "I have a method called getuser which looks up a user in the db(by name or id), retrieves the users row, and encapsulates it with the user class" - this is what ORM does.
With ORM, you have a Session, which, apart from other things, serves as a cache for ORM objects, so you can avoid loading the same model more than once per transaction. You'll find that you need to load User object to authenticate the request anyway, so not querying the table at all is probably not an option.
You can also configure some attributes to be lazily loaded, so some rarely-needed or bulky properties are only loaded when you access them
You can also configure relationships to be eagerly loaded in a single query, which may save you from doing hundreds of small separate queries. I mean, in your current design, how many queries would the below code initiate:
for user in get_all_users():
print user.get_avatar_uri()
print user.get_name()
print user.get_about()
from your description it sounds like it may require 1 + (num_users*3) queries. With SQLAlchemy ORM you could load everything in a single query.
The conclusion is: fetching a single object from a database by its primary key is a reasonably cheap operation, you should not worry about that unless you're building something the size of facebook. What you should worry about is making hundreds of small separate queries where one larger query would suffice. This is the area where SQLAlchemy ORM is very-very good.
Now, regarding "isn't it a waste to have to create a user object, and look up the users row when they wont even be using the data retrieved; but simply want to make a change to subset of the columns" - I understand you're thinking about something like
class ChangePasswordForm(...):
def _change_password(self, user_id, new_password):
session.execute("UPDATE users ...", user_id, new_password)
def save(self, request):
self._change_password(request['user_id'], request['password'])
versus
class ChangePasswordForm(...):
def save(self, request):
user = getuser(request['user_id'])
user.change_password(request['password'])
The former example will issue just one query, the latter will have to issue a SELECT and build User object, and then to issue an UPDATE. The latter may seem to be "twice more efficient", but in a real application the difference may be negligible. Moreover, often you will need to fetch the object from the database anyway, either to do validation (new password can not be the same as old password), permissions checks (is user Molly allowed to edit the description of Photo #12343?) or logging.
If you think that the difference of doing the extra query is going to be important (millions of users constantly editing their profile pictures) then you probably need to do some profiling and see where the bottlenecks are.

Read up on the SOLID principle, paying particular attention to the S as it answers your question.
Create a single class to perform user existence check, and inject it into any class that requires that functionality.
Also, you need to create a data persistence class to store the user's data, so that the database doesn't have to be queried every time.

Azure Tables, PartitionKeys and RowKeys functionality

So just getting started with Azure tables- haven't played with them before so wanted to check it out.
My understanding is that I should be thinking of this as object storage, rather than a database, which is cool. But I'm a bit confused on a couple points...
First, if I have one to many object relationships, what should the partitionkey of the root object look like? For example, let's say I have a University object, which is one to many to Student objects, and say Student objects are one to many to Classes. For a new student, should its partitionkey be 'universityId'? Or 'universityId + studentId'? I read in the msdn docs that the RowKey is supposed to be an id specific to the item I am adding, which also sounds like studentId.
And then would both the partitionkey and rowkey for a new University just be universityId?
I also read that Azure Tables are not for storing lists- I take it that does not refer to storing an object that contains a List...?
And anyone have any links to code samples using asp mvc 3 or 4 and razor with azure tables? This is my end goal, would be cool to see what someone who actually knows what they are doing does :)
Thanks!

You're definitely right that Azure Tables is closer to an object store than a database. You do have some ability to query on non-key columns, and to do logic in queries. But you shouldn't plan on using those features for anything performance critical.
Because queries are only fast if you specify at least a PartitionKey (and preferably a RowKey or range or RowKeys) that heavily influences how you lay out your tables. The decisions you make at the beginning will have big performance implications later. As a rough analogy, I like to think about them like a SQL Server table with the primary key as (PartitionKey + RowKey), that can never have another index. That's not completely accurate, but it'll get you thinking in the right direction.
First, if I have one to many object relationships, what should the partitionkey of the root object look like?
I would probably use the UniversityId as the PartitionKey. That's generally a safe place to start.
For a new student, should its partitionkey be 'universityId'? Or 'universityId + studentId'?
How do you plan to query the students? If you're always going to have their UniversityId & StudentId I would probably make them the PartitionKey and RowKey, respectively. If you're mostly going to query based on StudentId, I would use that as the PartitionKey instead.
would both the partitionkey and rowkey for a new University just be universityId?
That's a viable choice. You can also use a constant value (eg "UNIVERSITY") for the RowKey, if you've really got nothing else to put there.
I also read that Azure Tables are not for storing lists- I take it that does not refer to storing an object that contains a List...?
I'm not entirely sure what that means. Clearly you can store a collection of objects in a table, that's what they're for. You can't directly store a list in an entity property. So if your Student has a property of typee List, that can't be stored directly. But you could serialize it to XML or binary, and store that.
I don't have any code samples handy, unfortunately. This may be a good time to abstract your data logic into its own layer, rather than putting it in your MVC controllers. We've found that a well-abstracted data layer can make unit testing your logic very easy. If you create some interfaces for your tables, it's very easy to create mock objects using just a List and some LINQ.

JSF session issue

I have got a situation where I have list of records say 10,000, I am using datatable and I am using paging,(10 records per display). I wanted to put put that list in the session as:
facesContext........put("mylist", mylist);
And in the getters of the mylist, I have
public List<MyClass> getMyList() {
if(mylist== null){
mylist= (List<MyClass>) FacesContext......getSessionMap().get("mylist");
}
return mylist;
}
Now the problem is whene ever i click on paging button to go to second page, only the first records are displayed,
I know i am missing some thing, and I have few questions:
Is the way of putting the list in session correct.
Is this the way I should be calling the list in my case.
Thnaks in advance...

Something entirely different: I strongly recommend to not put those 10.000 records in the session scope. That is plain inefficient. If 100 users are visiting your datatable, those records would be duplicated in memory for every user. This makes no sense. Just leave them in the database and write SQL queries accordingly that it returns exactly the rows you want to display per request. That's the job the DB is designed for. If the datamodel is well designed (indexes on columns involved in WHERE and if necessary ORDER BY clauses), then it's certainly faster than hauling the entire table in Java's memory for each visitor.
You can find more insights and code examples in this article.

Entity Framework: Doing large queries

I'm probably addressing one of the bigger usability-issues in EF.
I need to perform a calculation on a very big part of a model. For example, say we need a Building, with all of its doors, the categories of those doors. But I'd also need the windows, furniture, roof etc.
And imagine that my logic also depends on more coupled tables behind those categories (subcategories etc.).
We need most of this model at a lot of points in the code, so I'd need to have the whole model filled and linked up by EF.
For doing this, we are simply querying the ObjectContext and using type-safe includes.
But this gets inpractical and error-prone.
Does anyone have suggestions for tackling this kind of problems?

Use projection to get only the values you need, especially if you don't intend to update everything. You probably don't need every property of a piece of furniture, etc. So instead of retrieving the entity itself, project what you want:
from b in Context.Buildings
where b.Id == 123
select new
{
Name = b.Name,
Rooms = from r in b.Rooms
select new
{
XDimension = r.XDimension,
// etc.
Now you no longer have to worry about whether something is loaded; the stuff you need is loaded, and the stuff you don't need is not. The generated SQL will be dramatically simpler, as well.

Using LINQ with stored procedure that returns multiple instances of the same entity per row

Our development policy dictates that all database accesses are made via stored procedures, and this is creating an issue when using LINQ.
The scenario discussed below has been somewhat simplified, in order to make the explanation easier.
Consider a database that has 2 tables.
Orders (OrderID (PK), InvoiceAddressID (FK), DeliveryAddressID (FK) )
Addresses (AddresID (PK), Street, ZipCode)
The resultset returned by the stored procedure has to rename the address related columns, so that the invoice and delivery addresses are distinct from each other.
OrderID InvAddrID DelAddrID InvStreet DelStreet InvZipCode DelZipCode
1 27 46 Main St Back St abc123 xyz789
This, however, means that LINQ has no idea what to do with these columns in the resultset, as they no longer match the property names in the Address entity.
The frustrating thing about this is that there seems to be no way to define which resultset columns map to which Entity properties, even though it is possible (to a certain extent) to map entity properties to stored procedure parameters for the insert/update operations.
Has anybody else had the same issue?
I'd imagine that this would be a relatively common scenarios, from a schema point of view, but the stored procedure seems to be the key factor here.

Have you considered creating a view like the below for the stored procedure to select from? It would add complexity, but allow LINQ to see the Entity the way you wanted.
Create view OrderAddress as
Select o.OrderID
,i.AddressID as InvID
,d.AddressID as DelID
...
from Orders o
left join Addresses i
on o.InvAddressID= i.AddressID
left join Addresses d
on o.DelAddressID = i.AddressID

LINQ is a bit fussy about querying data; it wants the schema to match. I suspect you're going to have to bring that back into an automatically generated type, and do the mapping to you entity type afterwards in LINQ to objects (i.e. after AsEnumerable() or similar) - as it doesn't like you creating instances of the mapped entities manually inside a query.
Actually, I would recommend challenging the requirement in one respect: rather than SPs, consider using UDFs to query data; they work similarly in terms of being owned by the database, but they are composable at the server (paging, sorting, joinable, etc).
(this bit a bit random - take with a pinch of salt)
UDFs can be associated with entity types if the schema matches, so another option (I haven't tried it) would be to have a GetAddress(id) udf, and a "main" udf, and join them:
var qry = from row in ctx.MainUdf(id)
select new {
Order = ctx.GetOrder(row.OrderId),
InvoiceAddress = ctx.GetAddress(row.InvoiceAddressId),
DeliveryAddress = ctx.GetAddress(row.DeliveryAddressId)) };
(where the udf just returns the ids - actually, you might have the join to the other udfs, making it even worse).
or something - might be too messy for serious consideration, though.

If you know exactly what columns your result set will include, you should be able to create a new entity type that has properties for each column in the result set. Rather than trying to pack the data into an Order, for example, you can pack it into an OrderWithAddresses, which has exactly the structure your stored procedure would expect. If you're using LINQ to Entities, you should even be able to indicate in your .edmx file that an OrderWithAddresses is an Order with two additional properties. In LINQ to SQL you will have to specify all of the columns as if it were an entirely unrelated data type.
If your columns get generated dynamically by the stored procedure, you will need to try a different approach: Create a new stored procedure that only pulls data from the Orders table, and one that only pulls data from the addresses table. Set up your LINQ mapping to use these stored procedures instead. (Of course, the only reason you're using stored procs is to comply with your company policy). Then, use LINQ to join these data. It should be only slightly less efficient, but it will more appropriately reflect the actual structure of your data, which I think is better programming practice.

I think I understand what you're after, but I could wildy off...
If you mock up classes in a DBML (right-click -> new -> class) that are the same structure as your source tables, you could simply create new objects based on what is read from the stored procedure. Using LINQ to objects, you could still query your selection. It's more code, but it's not that hard to do. For example, mock up your DBML like this:
Pay attention to the associations http://geeksharp.com/screens/orders-dbml.png
Make sure you pay attention to the associations I added. You can expand "Parent Property" and change the name of those associations to "InvoiceAddress" and "DeliveryAddress." I also changed the child property names to "InvoiceOrders" and "DeliveryOrders" respectively. Notice the stored procedure up top called "usp_GetOrders." Now, with a bit of code, you can map the columns manually. I know it's not ideal, especially if the stored proc doesn't expose every member of each table, but it can get you close:
public List<Order> GetOrders()
{
// our DBML classes
List<Order> dbOrders = new List<Order>();
using (OrderSystemDataContext db = new OrderSystemDataContext())
{
// call stored proc
var spOrders = db.usp_GetOrders();
foreach (var spOrder in spOrders)
{
Order ord = new Order();
Address invAddr = new Address();
Address delAddr = new Address();
// set all the properties
ord.OrderID = spOrder.OrderID;
// add the invoice address
invAddr.AddressID = spOrder.InvAddrID;
invAddr.Street = spOrder.InvStreet;
invAddr.ZipCode = spOrder.InvZipCode;
ord.InvoiceAddress = invAddr;
// add the delivery address
delAddr.AddressID = spOrder.DelAddrID;
delAddr.Street = spOrder.DelStreet;
delAddr.ZipCode = spOrder.DelZipCode;
ord.DeliveryAddress = delAddr;
// add to the collection
dbOrders.Add(ord);
}
}
// at this point I have a List of orders I can query...
return dbOrders;
}
Again, I realize this seems cumbersome, but I think the end result is worth a few extra lines of code.

this it isn't very efficient at all, but if all else fails, you could try making two procedure calls from the application one to get the invoice address and then another one to get the delivery address.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio