Getting max value on server (Entity Framework) - performance

I'm using EF Core but I'm not really an expert with it, especially when it comes to details like querying tables in a performant manner...
So what I try to do is simply get the max-value of one column from a table with filtered data.
What I have so far is this:
protected override void ReadExistingDBEntry()
{
using Model.ResultContext db = new();
// Filter Tabledata to the Rows relevant to us. the whole Table may contain 0 rows or millions of them
IQueryable<Measurement> dbMeasuringsExisting = db.Measurements
.Where(meas => meas.MeasuringInstanceGuid == Globals.MeasProgInstance.Guid
&& meas.MachineId == DBMatchingItem.Id);
if (dbMeasuringsExisting.Any())
{
// the max value we're interested in. Still dbMeasuringsExisting could contain millions of rows
iMaxMessID = dbMeasuringsExisting.Max(meas => meas.MessID);
}
}
The equivalent SQL to what I want would be something like this.
select max(MessID)
from Measurement
where MeasuringInstanceGuid = Globals.MeasProgInstance.Guid
and MachineId = DBMatchingItem.Id;
While the above code works (it returns the correct value), I think it has a performance issue when the database table is getting larger, because the max filtering is done at the client-side after all rows are transferred, or am I wrong here?
How to do it better? I want the database server to filter my data. Of course I don't want any SQL script ;-)

This can be addressed by typing the return as nullable so that you do not get a returned error and then applying a default value for the int. Alternatively, you can just assign it to a nullable int. Note, the assumption here of an integer return type of the ID. The same principal would apply to a Guid as well.
int MaxMessID = dbMeasuringsExisting.Max(p => (int?)p.MessID) ?? 0;
There is no need for the Any() statement as that causes an additional trip to the database which is not desirable in this case.

Related

Slow query over large collection

I'm working on an audit log which saves sessions in RavenDB. Initially, the website for querying the audit logs was responsive enough but as the amount of logged data has increased, the search page became unusable (it times out before returning using default settings - regardless of the query used). Right now we have about 45mil sessions in the table that gets queried but steady state is expected to be around 150mil documents.
The problem is that with this much live data, playing around to test things has become impractical. I hope some one can give me some ideas what would be the most productive areas to investigate.
The index looks like this:
public AuditSessions_WithSearchParameters()
{
Map = sessions => from session in sessions
select new Result
{
ApplicationName = session.ApplicationName,
SessionId = session.SessionId,
StartedUtc = session.StartedUtc,
User_Cpr = session.User.Cpr,
User_CprPersonId = session.User.CprPersonId,
User_ApplicationUserId = session.User.ApplicationUserId
};
Store(r => r.ApplicationName, FieldStorage.Yes);
Store(r => r.StartedUtc, FieldStorage.Yes);
Store(r => r.User_Cpr, FieldStorage.Yes);
Store(r => r.User_CprPersonId, FieldStorage.Yes);
Store(r => r.User_ApplicationUserId, FieldStorage.Yes);
}
The essense of the query is this bit:
// Query input paramters
var fromDateUtc = fromDate.ToUniversalTime();
var toDateUtc = toDate.ToUniversalTime();
sessionQuery = sessionQuery
.Where(s =>
s.ApplicationName == applicationName &&
s.StartedUtc >= fromDateUtc &&
s.StartedUtc <= toDateUtc
);
var totalItems = Count(sessionQuery);
var sessionData =
sessionQuery
.OrderByDescending(s => s.StartedUtc)
.Skip((page - 1) * PageSize)
.Take(PageSize)
.ProjectFromIndexFieldsInto<AuditSessions_WithSearchParameters.ResultWithAuditSession>()
.Select(s => new
{
s.SessionId,
s.SessionGroupId,
s.ApplicationName,
s.StartedUtc,
s.Type,
s.ResourceUri,
s.User,
s.ImpersonatingUser
})
.ToList();
First, to determine the number of pages of results, I count the number of results in my query using this method:
private static int Count<T>(IRavenQueryable<T> results)
{
RavenQueryStatistics stats;
results.Statistics(out stats).Take(0).ToArray();
return stats.TotalResults;
}
This turns out to be very expensive in itself, so optimizations are relevant both here and in the rest of the query.
The query time is not related to the amount of result items in any relevant way. If I use a different value for the applicationName parameter than any of the results, it is just as slow.
One area of improvement could be to use sequential IDs for the sessions. For reasons not relevant to this post, I found it most practical to use guid based ids. I'm not sure if I can easily change IDs of the existing values (with this much data) and I would prefer not to drop the data (but might if the expected impact is large enough). I understand that sequential ids result in better behaving b-trees for the indexes, but I have no idea how significant the impact is.
Another approach could be to include a timestamp in the id and query for documents with ids starting with the string matching enough of the time to filter the result. An example id could be AuditSessions/2017-12-31-24-31-42/bc835d6c-2fba-4591-af92-7aab96339d84. This also requires me to update or drop all the existing data. This of course also has the benefits of mostly sequential ids.
A third approach could be to move old data into a different collection over time, in recognition of the fact that you would most often look at the most recent data. This requires a background job and support for querying across collection time boundaries. It also has the issue that the collection with the old sessions is still slow if you need to access it.
I'm hoping there is something simpler than these solutions, such as modifying the query or the indexed fields in a way that avoids a lot of work.
At a glance, it is probably related to the range query on the StartedUtc.
I'm assuming that you are using exact numbers, so you have a LOT of distinct values there.
If you can, you can dramatically reduce the cost by changing the index to index on a second / minute granularity (which is usually what you are querying on), and then use Ticks, which allow us to use numeric range query.
StartedUtcTicks = new Datetime(session.StartedUtc.Year, session.StartedUtc.Month, session.StartedUtc.Day, session.StartedUtc.Hour, session.StartedUtc.Minute, session.StartedUtc.Second).Ticks,
And then query by the date ticks.

How to combine collection of linq queries into a single sql request

Thanks for checking this out.
My situation is that I have a system where the user can create custom filtered views which I build into a linq query on the request. On the interface they want to see the counts of all the views they have created; pretty straight forward. I'm familiar with combining multiple queries into a single call but in this case I don't know how many queries I have initially.
Does anyone know of a technique where this loop combines the count queries into a single query that I can then execute with a ToList() or FirstOrDefault()?
//TODO Performance this isn't good...
foreach (IMeetingViewDetail view in currentViews)
{
view.RecordCount = GetViewSpecificQuery(view.CustomFilters).Count();
}
Here is an example of multiple queries combined as I'm referring to. This is two queries which I then combine into an anonymous projection resulting in a single request to the sql server.
IQueryable<EventType> eventTypes = _eventTypeService.GetRecords().AreActive<EventType>();
IQueryable<EventPreferredSetup> preferredSetupTypes = _eventPreferredSetupService.GetRecords().AreActive<EventPreferredSetup>();
var options = someBaseQuery.Select(x => new
{
EventTypes = eventTypes.AsEnumerable(),
PreferredSetupTypes = preferredSetupTypes.AsEnumerable()
}).FirstOrDefault();
Well, for performance considerations, I would change the interface from IEnumerable<T> to a collection that has a Count property. Both IList<T> and ICollection<T> have a count property.
This way, the collection object is keeping track of its size and you just need to read it.
If you really wanted to avoid the loop, you could redefine the RecordCount to be a lazy loaded integer that calls GetViewSpecificQuery to get the count once.
private int? _recordCount = null;
public int RecordCount
{
get
{
if (_recordCount == null)
_recordCount = GetViewSpecificQuery(view.CustomFilters).Count;
return _recordCount.Value;
}
}

NHibernate IQueryable doesn't seem to delay execution

I'm using NHibernate 3.2 and I have a repository method that looks like:
public IEnumerable<MyModel> GetActiveMyModel()
{
return from m in Session.Query<MyModel>()
where m.Active == true
select m;
}
Which works as expected. However, sometimes when I use this method I want to filter it further:
var models = MyRepository.GetActiveMyModel();
var filtered = from m in models
where m.ID < 100
select new { m.Name };
Which produces the same SQL as the first one and the second filter and select must be done after the fact. I thought the whole point in LINQ is that it formed an expression tree that was unravelled when it's needed and therefore the correct SQL for the job could be created, saving my database requests.
If not, it means all of my repository methods have to return exactly what is needed and I can't make use of LINQ further down the chain without taking a penalty.
Have I got this wrong?
Updated
In response to the comment below: I omitted the line where I iterate over the results, which causes the initial SQL to be run (WHERE Active = 1) and the second filter (ID < 100) is obviously done in .NET.
Also, If I replace the second chunk of code with
var models = MyRepository.GetActiveMyModel();
var filtered = from m in models
where m.Items.Count > 0
select new { m.Name };
It generates the initial SQL to retrieve the active records and then runs a separate SQL statement for each record to find out how many Items it has, rather than writing something like I'd expect:
SELECT Name
FROM MyModel m
WHERE Active = 1
AND (SELECT COUNT(*) FROM Items WHERE MyModelID = m.ID) > 0
You are returning IEnumerable<MyModel> from the method, which will cause in-memory evaluation from that point on, even if the underlying sequence is IQueryable<MyModel>.
If you want to allow code after GetActiveMyModel to add to the SQL query, return IQueryable<MyModel> instead.
You're running IEnumerable's extension method "Where" instead of IQueryable's. It will still evaluate lazily and give the same output, however it evaluates the IQueryable on entry and you're filtering the collection in memory instead of against the database.
When you later add an extra condition on another table (the count), it has to lazily fetch each and every one of the Items collections from the database since it has already evaluated the IQueryable before it knew about the condition.
(Yes, I would also like to be the extensive extension methods on IEnumerable to instead be virtual members, but, alas, they're not)

Is possible in siena to order by a calculated field?

I'm trying to get a query returned ordered on a filed which is calculated in Play.
This is the query I'm using.
return all().order("points").fetch();
where points is defined as
public Integer points;
and is retrieve thanks to this getter
public int getPoints(){
List<EventVote> votesP = votes.filter("isPositive", true).fetch();
List<EventVote> votesN = votes.filter("isPositive", false).fetch();
this.points= votesP.size()-votesN.size();
return this.points;
}
The getter is correctly called when I do
int votes=objectWithPoints.points;
I have the feeling I'm pretending a bit too much out of siena, but I would love this to work (or some similar code). Currently it just skips the order condition. Ordering on any other field works correctly.
I think you're true when you say you await a bit too much :)
The Siena query all().order("points").fetch() performs a request to the DB.
So it will order the values stored into the DB not into your program.
From what you say, I see that you have a getter getPoints which computes a value.
Yet, if you don't store this value into the database, the ordering can't be performed by Siena.
So either you compute the value, set it in your object and save the object to the DB.
objectWithPoints.points = getPoints();
objectWithPoints.save();
Either you order values by yourself in your program after computing them.

LINQ performance

I am reading records from database and check some conditions and store in List<Result>. Result is a class. Then performing LINQ query in List<Result> like grouping, counting etc. So there may be chance that min 50,000 records in List<Result>, so in this whether its better to go for LINQ (or) reinsert the records to db and perform the queries?
Why not store it in an IQueryable instead of a List and using LINQ to SQL or LINQ to Entities, the actual dataset will never be pulled into memory, and the queries will actually go down to the database to run.
Example:
Database db = new Database(); // this is what L2E gives you...
var children = db.Person.Where(p => p.Age < 21); // no actual database query performed
// will do : "select count(*) from Person where Age < 21"
int numChildren = children.Count();
var grouped = children.GroupBy(p => p.Age); // no actual query
int youngest = children.Min(p => p.Age); // performs query
int numYoungest = youngest.Count(p => p.Age == youngest); // performs query.
var youngestNames = children.Where(p => p.Age == youngest).Select(p => p.Name); // no query
var anArray = youngestNames.ToArray(); // performs query
string names = string.join(", ", anArray); // no query of course
I'm currently asking the same kind of thing right now. I don't really know the exact answer either, but from what I know, LINQ is not well know to be fast on objects. Also, since List is not indexed, when you do advance query on them, the backend will probably need to do a lot of computing to get what you asked for. Also, this code is generic, so it means slower execution.
The best thing would be, if you are able, do everything in one query, or even do a startproc to do your processing. Or another possibility, if you are always checking the same initial condition, create a view and do your query directly on this table (instead of reinserting from the client). I think that if you have more than 50,000 results, probably using a list is not a good idea (Memory and Performance).
It probably doesn't answer your question directly, but other than doing benchmark, you won't know. It really depends on what you are doing with the data.

Resources