Am I abusing of Linq to objects? - performance

I think that queries with linq to objects end up very readable and nice. For example:
from person in db.Persons.ToList()
where person.MessageableBy(currentUser) ...
Where MessageableBy is a method that can't be translated into a store expression (sql)
public bool MessageableBy(Person sender)
{
// Sender is system admin
if (sender.IsSystemAdmin())
return true;
// Sender is domain admin of this person's domain
if (sender.Domain.DomainId == this.Domain.DomainId && this.Domain.HasAdmin(sender))
return true;
foreach (Group group in this.Groups)
{
if (group.MessageableBy(sender))
return true;
}
// The person is attorney of someone messageable
if (this.IsAttorney)
{
foreach (Person pupil in this.Pupils)
if (pupil.MessageableBy(sender))
return true;
}
return false;
}
The problem is that I think that this is not going to scale. I'm already noticing that with a few entries in the database, so can't imagine with a large database.
So the question is:
Should I mix linq to entities with linq to objects (ie: apply some of the "where" to the ICollection and some of the "where" to the .ToList() result of that? should I only use linq to entities, ending with a very large sentence?

.ToList() will actually execute the query and fetch all the data in that table, which is not something you'd want unless you know for sure it'll always be few records. So yes, you should do more in the where clause before doing .ToList()

I largely agree with your initial analysis. Mixing Linq to Objects and Linq to Entities is fine, but requires retrieving more data than is necessary, and therefore could lead to scaling problems down the road.
Remember to design you data model to support the critical queries. Perhaps a user could be a person, and person could have a self relationship that determines who can message who. This is just a simple thought, to inspire you to consider other ways of representing your data to allow the MessableBy method to be realized in the query itself.
In the meantime, if it isn't causing performance problems, then I would consider this issue more in terms of model design.

Although this simply paraphrases the statements made by earlier respondents, I believe it is important to enough to truly emphasize:
It is critical for DB application performance to perform as much calculation as possible, and particularly as much filtering and aggregation as possible, on the DB server prior to sending the resulting data to the client.

Related

LatestOfMany() of BelongsToMany() relationship

I've been using latestOfmany() for my hasMany() relation to define them as hasOne() for quite a while now. Lately I've been in need of the similar application but for belongsToMany() relationships. Laravel doesn't have this feature unfortunately.
My codebase as follows:
Document
id
upload_date
identifier_code
Person
id
name
DocumentPerson (pivot)
id
person_id
person_id
token
My objective is: define relationship for fetching the first document (according to upload_date) of Person. As you can see it's a many-to-many relationship.
What I have tried so far:
public function firstDocument()
{
return $this->hasOne(DocumentPerson::class)->oldestOfMany('document.upload_date');
//this was my safe bet but oldestOfMany() and ofMany() doesn't allow aggregating on relationship column.
}
public function firstDocument()
{
return $this->belongToMany(Document::class)->oldestOfMany('upload_date')
}
public function firstDocument()
{
return $this->belongToMany(Document::class)->oldest()->limit(1);
}
public function firstDocument()
{
return $this->hasOneThrough(Document::class, DocumentPerson::class, 'id', 'document_id', 'id', 'person_id')->latestOfMany('upload_date');
}
At this point I'm almost positive current relationship base doesn't support something like this, so I'm elaborating alternative methods to solve this. My two choices:
Add a column called first_document_id on Person table, go through that with belongsTo() simple and fast performance-wise. But downside is I'll have to implement so many event-listeners to make sure it is always consistent with actual relationships. What if Document's upload_date is updates etc. (basically database inconsistency)
Add a order column on pivot (document_person) table, which will hold order of related Documents by upload_date. This way I can do hasOne(DocumentPerson::class)->oldestOfMany('order');//or just ofMany() and be done with it. This one also poses the risk of database inconsistency.
It's fair to say I'm at a crossroads here. Any idea and suggestion is welcomed and appreciated. Thank you. Please read the restrictions to prevent suggesting things that are not feasible for my situation.
Restrictions:
(Please)
It should strictly be a relationship. I'll be using it on various places, it definitely has to be relationship so I can eager load and query it. My next objective involves querying by this relationship so it is imperative.
Don't suggest accessors, it won't do well with my case.
Don't suggest collection methods, it needs to be done in query.
Don't suggest ->limit() or ->take() or ->first(), those are prone to cause inconsistent results with eager loading.
Update 1
Q: Why first document of a person has to be a relationship ?
A: Because further down the line I'll be querying it in various different instances. Example queries where it'll be utilized:
Get all the users whose first document (according to upload_date) upload_date between 2022-01-01 and 2022-06-08. (along with 10 other scopes and filters)
Get all the users whose first document (according to upload_date) identifier_code starts with "Lorem" and id bigger than 100.
These are just to name a few, there are many cases where I really gotta query it in various fashions. This is the reason that I desperately need it to be a relationship, so I can query it with ease using Person::whereHas('firstDocument',function($subQuery){ return $subQuery->someScope1()->anotherScope2()->where(...); }
If I only needed to display it, yeah sure eager loading with closure would do well, or even collection methods, or accessors would suffice. But since ability to query it is the need, relationship is of the essence. Keep in mind Person table has around 500k record, hence the need for querying it on the database layer.
Alright here's the solution I've elected to go with (among my choices, explained in the question). I implemented the "adding order column on pivot" table. Because it scales better and is rather flexible compared to other options. It allows for querying the last document, first document, third document etc. Whilst it doesn't even require any aggregate functions (Max, min like ->latestOfMany() applies) which is a performance boost. Given these constraints this solution was the way to go. Here's how I applied it in case someone else is thinking about something similar.
Currently the only noticeable downside to this approach is inability to access any additional pivot data.
Added new column for order:
//migration
$table->unsignedTinyInteger('document_upload_date_order')->nullable()->after('token');
$table->index('document_upload_date_order');//for performance
Person.php (Model)
//... other stuff
public function personalDocuments()
{//my old relationship, which I'll still keep for display/index purposes.
return $this->belongsToMany(Document::class)->withPivot('token')->where('type_slug','personal');
}
//NEW RELATIONSHIP
public function firstDocument()
{//Eloquent relationship, allows for querying and eager loading
return $this->hasOneThrough(
Document::class,
DocumentPerson::class,//pivot class for the pivot table
'person_id',
'id',
'id',
'document_id')
->where('document_upload_date_order',1);//magic here
SomeService.php
public function determineDocumentUploadDateOrders(Person $person){
$sortLogic=[
['upload_date', 'asc'],
['created_at', 'asc'],
];
$documentsOrdered=$person->documents->sortBy($sortLogic)->values();//values() is for re-indexing the array keys
foreach ($documentsOrdered as $index=>$document){
//updating through pivot tables ORM model
DocumentPerson::where('id',$document->pivot->id)->update([
'document_upload_date_order'=>$index+1,
'document_id'=>$document->id,
'person_id'=>$document->pivot->person_id,
]);
}
}
I hooked determineDocumentUploadDateOrders() into various event-listeners and model events so whenever association/disassociation occurs, or upload_date of a document changes I simply call determineDocumentUploadDateOrders() with corresponding Person and this way it is always kept in sync with actual.
Implemented it fully and it is providing consistent results with great performance. Of course it brought a bit of an overhead with keeping it in sync. But nonetheless, It did the job whilst meeting the requirements. Honestly I found this approach far more reliable than some in-official eloquent relationships and similar alternatives.
I have encountered a similar situation years back.
the best workaround on a situation like this is to use #staudenmeir package eager limit
Load the trait use \Staudenmeir\EloquentEagerLimit\HasEagerLimit; on both model (parent and related model)
then try the code below
public function firstDocument() {
return $this->documents()->latest()->limit(1);
}
public function documents() {
return $this->belongsToMany(Document::class);
}
just to add, Eager loading with limit does not work with built laravel eloquent, you would have to build your own raw queries to achieve it which can turn into a nightmare. that eager limit package from staudenmeir should have been merge with laravel source code 😆

Memory/Efficiency with Linq and large data sets

So you know the background I'm coming from, I've been a professional programmer for over twelve years. My best language by far is C# but I've done C, C++, and most recently objectiveC. I've done a lot of work accessing data in databases but I haven't done as much UI work as most people (Except in IOS).
Recently I've begun using the Entity framework in C# for a job and I must say I wish I'd discovered it sooner. I wouldn't say it's the best thing since sliced bread but it's pretty damned close. After using it for a while it got me thinking about best practices and usage as compared to the old school method of using IDBConnections and IDBCommands for everything.
I was coding for a situation where I was going to be listing the contents of a table of users from a database in a bound data grid with the intention of giving the user the ability to do standard CRUD stuff. I started off by making an User class and a IUserManager interface with a corresponding implementation. Each user is assigned to a department and naturally there'd need to be a way to perform CRUD on departments too so I added a Department class, an IDepartmentManager interface and an implementation for that too. I set it up so that the grid bound on the results of the .GetAll() method on the IUserManager interface. Then I started filling in the guts.
I don't have the code in front of me any more but I basically used IDBConnection to tap into the datastore with an IDBCommand using a SQL query. Then I called command.ExecuteReader() and iterated the .Read() method on the IDataReader object. Using the ordinal for each column I pulled out the data, validated it and slipped it into a User class and added the class to a Dictionary that the method would then return. All the DB classes are of course IDisposable so wrapping them in a using takes care of cleaning up the mess.
Pretty standard stuff, I've done it a bazillion times.
That's when I realized that the departmentId I was pulling from the DB wasn't what I wanted to display in my grid. Telling someone 'this guy is in department 7' isn't as useful as saying 'this guy is in accounting'. So I first toyed with modding my query to get both the departmentId and name, and storing the name on the user object for display later. Then I decided to give the user a Department class instance that it would hang onto during it's lifetime that would be populated. That's when I converted the guts to linq.
public Dictionary<int, User> GetAll()
{
var result = new Dictionary<int, User>();
using (var datastore = new myEntities())
{
result = (from user in datastore.userInfoes
join department in datastore.userDepartmentInfoes on user.departmentID equals department.departmentID
select new User()
{
UserIndex = user.id,
FirstName = user.firstName,
LastName = user.lastName,
Department = new Department()
{
DepartmentId = user.departmentID.Value,
DepartmentName = department.departmentName,
},
Username = user.userName,
}
).ToDictionary(x => x.UserIndex, x => x);
}
return result;
}
That's where I started thinking (read: over-analysing probably)
The implementation I had would work just fine. It would even work pretty well for a small dataset. It'll even work fine for a largish dataset (say 10,000). Even if you counted every person in the company I currently work for five times over you'd have less than a thousand people.
But what if for a second I worked for a really big honking company that had 10 million employees? That would result in the departmentName strings being duplicated potentially millions of times.
That also got me thinking that unlike IOS's MVC implementation this particular situation wasn't going to query just enough users to fill the screen and then handle paging and stuff. As soon as the calling code refresh the data binding it was going to pull all 10 million users all at once and pass back the collection. That's going to be slow.
So that leaves me with the idea in my head that this method is both slow and inefficient with larger data sets. Not only that but the fact that there might be 2 million instances of 'Accounting' held with this data set it is going to be a major memory hog. We're also kind of defeating the purpose of a relational database here because of the Department class inside the User. In the DB you just have a departmentId int foreign key referencing an entry in another table. The link only occurs when you cross reference to the other table and even then there's really only one 'Accounting' string at any one time. In the above code you're going to have a whole lot of 'Accounting' strings floating around waiting to be cleaned up.
An MVC scenario would basically 'know' that it takes X number of entries to fill the grid's viewable area. It would only query X at a time starting from index Y and as the user navigated it would query and display additional records as needed. That's a heck of a lot better than querying all 10 million and letting them hang out somewhere whether they're displayed or not.
Like I said, I may very well be over-analysing this. I might also be incorrect in some of my assumptions with the way linq works. But in the interest of learning I figured I had to ask: What is the best way to do something like this? Is this sort of thing ok for small datasets? Would the whole thing be better off as an MCV implementation rather than pulling in the entire dataset to be displayed in the grid?
If you need the whole set of data in memory - you will have to load it anyway. I am sure you will not list 10kk users in a grid, right? The techniques that comes up is paging. Check this article from msdn with examples.
As for departments objects, does your UserInfo has a foreign key to the department? If so you should just have userInfo.Department available to you and no joins are needed.
If you bind the department data to the grid columns, why having the property of Department type? I assume your User class is something you bind to UI. Flatten it out into:
class User
{
Username
UserIndex
FirstName
LastName
DepartmentId
DepartmentName
}
What is the purpose of GetAll()? You return a dictionary and it feels like you need to enable lookups by id. Or do you use the result to enumerate the users?
For lookups, consider talking to the database to get you a single user data when needed. Implement caching if makes sense next.
For enumeration, do not return dictionary - that is all-in-memory object, return IEnumerable with yielded (paged?) results or even better IQueryable so that calling GetAll() doesn't execute the sql call right away, and the calling code can scope the call down by adding necessary filters

Is it possible to map Linq queries from one Data Model to a query over a different data model?

I would like to provide an OData interface for my application. The examples that I have seen use EF to map the LINQ queries to SQL queries.
IMHO it this approach pretty much exposes the physical database model to the world (I know EF/NH give some flexibility, but it is limited).
What I would like the be able to do, is the following:
Define my Data Contract via some DTOs.
Have a OData Service that will let users query over my Data Contract Dtos.
Have some translation layer to translate the queries over the DTOs to queries over, let's say, EF model or NH.
Execute the translated query.
Map the results back to my Data Contracts.
Am I out of my mind or is there a solution to this problem?
I have 2 models, the "contract" model and the "persisted" model. The persisted model is what Entity Framework is mapped to. The Get method that returns an IQueryable returns a IQueryable which is just something along the lines of:
return dbContext.PersistedCustomers.Select(x => new Customer(Name = x.OtherName, ...));
At least when using DbContext as opposed to ObjectContext, Where criteria based on the contract model get translated automatically into Where criteria of the PersistedModel to be executed against the database. Hopefully the differences between the two aren't that complex that you need some weird data massaging. I'm sure there's limits to the reversal it does.
One way of doing it would be to create a ViewModel that will represent your Model and then use AutoMapper to map between them. You can use like this:
var address = _Context.Addresses.Where(p => p.AddressID == addressID).Single();
AddressVM result = Mapper.Map<AddressVM>(address);

Performace issue using Foreach in LINQ

I am using an IList<Employee> where i get the records more then 5000 by using linq which could be better? empdetailsList has 5000
Example :
foreach(Employee emp in empdetailsList)
{
Employee employee=new Employee();
employee=Details.GetFeeDetails(emp.Emplid);
}
The above example takes a lot of time in order to iterate each empdetails where i need to get corresponding fees list.
suggest me anybody what to do?
Linq to SQL/Linq to Entities use a deferred execution pattern. As soon as you call For Each or anything else that indirectly calls GetEnumerator, that's when your query gets translated into SQL and performed against the database.
The trick is to make sure your query is completely and correctly defined before that happens. Use Where(...), and the other Linq filters to reduce as much as possible the amount of data the query will retrieve. These filters are built into a single query before the database is called.
Linq to SQL/Linq to Entities also both use Lazy Loading. This is where if you have related entities (like Sales Order --> has many Sales Order Lines --> has 1 Product), the query will not return them unless it knows it needs to. If you did something like this:
Dim orders = entities.SalesOrders
For Each o in orders
For Each ol in o.SalesOrderLines
Console.WriteLine(ol.Product.Name)
Next
Next
You will get awful performance, because at the time of calling GetEnumerator (the start of the For Each), the query engine doesn't know you need the related entities, so "saves time" by ignoring them. If you observe the database activity, you'll then see hundreds/thousands of database roundtrips as each related entity is then retrieved 1 at a time.
To avoid this problem, if you know you'll need related entities, use the Include() method in Entity Framework. If you've got it right, when you profile the database activity you should only see a single query being made, and every item being retrieved by that query should be used for something by your application.
If the call to Details.GetFeeDetails(emp.Emplid); involves another round-trip of some sort, then that's the issue. I would suggest altering your query in this case to return fee details with the original IList<Employee> query.

LINQ Projection in Entity Framework

I posted a couple of questions about filtering in an eager loading query, and I guess the EF does not support filtering inside of the Include statement, so I came up with this.
I want to perform a simple query where get a ChildProdcut by sku number and it PriceTiers that are filtered for IsActive.
Dim ChildProduct = ChildProductRepository.Query.
Where(Function(x) x.Sku = Sku).
Select(Function(x) New With {
.ChildProduct = x,
.PriceTiers = x.PriceTiers.
Where(Function(y) y.IsActive).
OrderBy(Function(y) y.QuantityStart)
}).Select(Function(x) x.ChildProduct).Single
Is there a more efficient way of doing this? I am on the right track at all? It does work.
Another thing I really don't understand is why does this work? Do you just have to load an object graph and the EF will pick up on that and see that these collections belong to the ChildProduct even though they are inside of an anonymous type?
Also, what are the standards for formatting a long LINQ expression?
Is there a more efficient way of doing this? I am on the right track at all?
Nope, that's about the way you do this in EF and yes, you're on the right track.
Another thing I really don't understand is why does this work?
This is considered to be a bit of a hack, but it works because EF analyzes the whole expression and generates one query (it would look about the same as if you just used Include, but with the PriceTiers collection filtered). As a result, you get your ChildProducts with the PriceTiers populated (and correctly filtered). Obviously, you don't need the PriceTiers property of your anonymous class (you discard it by just selecting x.ChildProduct), but adding it to the LINQ query tells EF to add the join and the extra where to the generated SQL. As a result, the ChildProduct contains all you need.
If this functionality is critcal, create a stored procedure and link entity framework to it.

Resources