I have a parent-child table relationship. In a repository, I'm doing this:
return (from p in _ctx.Parents
.Include( "Children" )
select p).AsQueryable<Parent>();
Then in a filter, I want to filter the parent by a list of child ids:
IQueryable<Parent> qry; // from above
List<int> ids; // huge list (8500)
var filtered =
from p in qry.Where( p => p.Children.Any(c => ids.Contains(c.ChildId)) ) select s;
My list of ids is huge. This generates a simple SQL statement that does have a huge list of ids "in (1,2,3...)", but it takes no appreciable time to run by itself. EF, however, takes about a full minute just to generate the statement. I proved this by setting a breakpoint and calling:
((ObjectQuery<Parent>)filtered).ToTraceString();
This takes all the time. Is the problem in my last linq statement? I don't know any other way to do the equivalent of Child.ChildId in (ids). And even if my linq statement is bad, how in the world should this take so long?
Unfortunately, building queries in Linq to Entities is a pretty heavy hit, but I've found it usually saves time due to the ability to build queries from their component pieces before actually hitting the database.
It is likely that the way they implement the Contains method uses an algorithm that assumes that Contains is generally used for a relatively small set of data. According to my tests, the amount of time it takes per ID in the list begins to skyrocket at around 8000.
So it might help to break your query into pieces. Group them into groups of 1000 or less, and concatenate a bunch of Where expressions.
var idGroups = ids.GroupBy(i => i / 1000);
var q = Parents.Include("Children").AsQueryable();
var newQ = idGroups.Aggregate(q,
(s, g) => s.Concat(
q.Where(w => w.Children.Any(wi => g.Contains(wi.ChildId)))));
This speeds things up significantly, but it might not be significantly enough for your purposes, in which case you'll have to resort to a stored procedure. Unfortunately, this particular use case just doesn't fit into "the box" of expected Entity Framework behavior. If your list of ids could begin as a query from the same Entity Context, Entity Framework would have worked fine.
Re-write your query in Lambda syntax and it will cut the time by as much as 3 seconds (or at least it did for my EF project).
return _ctx.Parents.Include( "Children" ).AsQueryable<Parent>();
and
IQueryable<Parent> qry; // from above
List<int> ids; // huge list (8500)
var filtered = qry.Where( p => p.Children.Any(c => ids.Contains(c.ChildId)) );
Related
I have inherited this project where it is using dynamic proxies for EF6. From a repo it returns IqueryAble(proxyObject).
I can watch SQL profiler and see it returns 6000+ records.. great! so far so good.
At this point I then create 3 lists aginst that dataset (say 3 X 2000 records)
Because each of those has filter logic I can also see a call to the db to return the list. Great! So far 4 calls to the DB and 6000 records.
THE PROBLEM
Every time I run this group by... I get 2000 calls to the DB! One Call for each record in Sublist. My guess it because it needs to inflate the object each time? Its terribly slow however.
var lts = Sublist.GroupBy(p => p.proxyObject.ProvinceCode)
.Select(n => new CountModel()
{
TypeName = n.Key,
ItemCount = n.Count()
}).ToList();
PresentationModel.AddRange(lts);
I ended up resolving this issue by Selecting into a new POCO directly. What was happening was that because the entity was a combination of a bunch of underlying repos...it ended up that EF was just creating a ton of individual queries and then rolling them up
var last3Months =
ProxyEntity.Where(l => DateTime.now() <= l.EffectiveDate)
.Select(l => new ModelMicro()
{
x= l.x,
y= l.y,
z= l.z
});
This resulted in one single call to the db returning 6000 rows that I could then shift where i needed. I then grouped on z in my underlying return. Most likely ways to make it better.
Code added.
I have searched endlessly for a solution here and cannot find one, please help!
I have three objects (A, B, and C). A has a lookup to B, and B is the master to C (detail). Both A and C have many records related to each B record.
I want to have a job run that gets a subset of records from object C (it will usually be around 5,000 records). Then go through each of those and get the records on Object A that lookup to the same Object B record, summarize an Object A number field, and put that on the C record.
I have successfully gotten this to work in small scale, <100 Object C records. But each Object C record requires a new SOQL query since I am iterating through them in a for loop after I get all the Object C records. Plus I know this it is not best practice to ever have a query in a loop.
How can I get this to work? Since the records share the relationship with Object B, is there another way to get the data from the Object A records that match? Or is there some way to pull two lists, one Object C and one Object A. Then summarize the Object A records and line the lists up some how?
Thanks in advance!
Code:
public class nightlyJob {
public static void updateNumbers(){
integer I = 29;
List<ObjectC__c> CUpdateList = new List<ObjectC__c>();
List<ObjectC__c> CpullList =
[SELECT ID, Index__c, ObjectB__r.id
FROM ObjectC__c
WHERE Index__c = :I];
for(ObjectC__c s : CpullList){
List<ObjectA__c> AList =
[SELECT ObjectB__c, Number__c
FROM ObjectA__c
WHERE ObjectB__c = :s.ObjectB__r.Id];
decimal NumSum = 0;
for(ObjectA__c a : AList){
NumSum = a.Number__c + NumSum;
}
s.Num__c = NumSum;
CUpdateList.add(s);
}
update CUpdateList;
}
}
It looks like you are really missing several fundamental concepts at the moment.
The biggest problem you are up against in SFDC development is that "database" operations are very expensive and are strictly limited. It's not just a matter of "best practice": if in a single transaction you exceed these limits -- number of SOQL calls, number of records returned, number of records updated, number of DML statements, etc. -- your transaction will fail. For details, search online for "Salesforce Execution Governors and Limits".
You can write code that works within these limitations, but there is a bit of a learning curve.
First, learn to use collections with SOQL queries to get your SOQL queries out of loops. This is a.k.a. "bulkfication" and it fundamental to SFDC development:
List<ObjectC__c> CpullList =
[SELECT ID, Index__c, ObjectB__r.id
FROM ObjectC__c
WHERE Index__c = :I];
// Create a map with the results of this query.
// key=ObjectC__c.Id, value = Object__c record
Map<Id, ObjectC__c> objCmap = Map<Id, ObjectC__c>(CpullList);
// Build a set of all the Object_B id's from this result set
Set<Id> objBids = new Set<Id>();
for (ObjectC__c record : CpullList) {
objBids.add(record.ObjectB__r.id);
}
// Now you can use only one SOQL query instead of a loop
List<ObjectA> AList = [SELECT ObjectB__c, Number__c
FROM ObjectA__c
WHERE ObjectB__c in:objBids];
Next, use "SOQL aggregate functions" whenever you can. Example: in your code here, you could use "SUM()" and "group by" instead of performing these calculations with loops:
// Get the sum of ObjectA__c.Number__c for each Object B in objBIds
AggregateResult[] groupedResults = [select ObjectB__c,
sum(Number__c) sumA
from ObjectA__c
where ObjectB__c in: objBids
group by ObjectB__c];
for (AggregateResult ar : groupedResults) {
System.debug('Object B Id' + ar.get('Objectb__c'));
System.debug('Sum of ObjectA__c.Number__c' + ar.get('sumA'));
// Here, you might want to build a Map<Id, Integer> sumAmap:
// key=Object B ID, value=sumA
// and then use it along with objCmap to build a collection of Object C's
// for your update statement...
}
You can continue this process and apply these ideas to make the code more efficient.
But even after you have your methods working as efficiently as possible, you still may run into limits due to the number of records you're dealing with. At that point, you will need to learn about the Batchable interface, the Queuable interface and #future calls (how to process a larger number of records, split across transactions) That's really too much to information to cover in a single SO answer.
I have this Linq query that translates very oddly to SQL. I get the correct results but there must be a better way. So question 1 is:
Why is it that in SQL I get no group by, no count and all of the
columns are returned instead of just 2; and then the results in C# are correct? (I checked with profiler).
and question 2 is:
I would like to modify the query slightly so that I get also the
results where count is 0. At the moment I only get where counts > 0
because of the group by.
LINQ:
List<Tuple<string, int>> countPerType = db1.Audits
.OrderBy(p => p.CreatedBy)
.GroupBy(o => new { o.Type, o.CreatedBy })
.ToList()
.Select(g => new Tuple<string, int>(g.Select(f => f.CreatedBy + ',' + f.Type).FirstOrDefault(),
(int?)g.Count() ?? 0))
.ToList();
Note that if I remove the .ToList() in the middle, I get exception "only parameterless constructors and initializers are supported in linq to entities".
Thanks for your input
You run into several problems. I think the cause of this is that you aren't aware of the difference between queries that are AsEnumerable and queries that are AsQueryable.
AsEnumerable queries contain all information to enumerate over the elements in the query. The query will be executed by your process.
An AsQueryable query, contains a Expression and a Provider. The Provider knows who will execute the query, and how to communicate with this executer. Quite often the executer will be a database, but it can be other things, like internet queries, jswon files etc.
In your case the executer will be a database, the language will be SQL.
When the GetEnumerator() function of your IQueryable is called, the Provider is ordered to translate the Expression into the language that the executor knows. The translated query is sent to the executor and the returned data is put into an Enumerator (not IEnumerable!)
Of course SQL does not know what a System.Tuple is, nor does it know functions like String.operator+
Therefore your Provider can't translate your expression into SQL. That is the reason you have to do your first ToList()
You can't make queries as IQueryable with any of your own functions, and only a limited amount of .NET functions.
See this list of supported and unsupported Linq methods
It is not advise to use ToList() in this stadium of your query, because it enumerates all elements of your sequence, will in fact you only need an enumerator. It could be that during the rest of your query you'd only want a few elements. In that case it would be a waste to enumerate over all of them to create a list, and then to enumerate again to do the rest of your LINQ.
Instead of ToList() use Enumerable.AsEnumerable(). This will bring all data of the query to local memory and create an IEnumerable of it: the elements are not enumerated yet. This will allow you to call local functions with the rest of your query.
Another problem is that you transport way more data to local memory than you plan to use. One of the slower parts of database queries is the transport of data to your process. You should minimize the amount of data.
You took all Audits, and created groups of Audits that have the same values for (Type, CreatedBy). In other words: all Audits in the same group have the same values for (Type, CreatedBy). This value is also the Key of the group.
You don't want all Audits locally, you only want the Key of the group and the number of elements of this group (= the number of audits that have (Type, CreatedBy) equal to the key.
This is the only data you need to transport to local memory: Type, CreatedBy and the number of audits in the group:
var result = db1.Audits.GroupBy(o => new { o.Type, o.CreatedBy })
.Select(group => new
{
Type = group.Key.Type,
CreatedBy = group.Key.CreatedBy,
AuditCount = group.Count(),
})
.OrderBy(item => item.CreatedBy)
// the data that is left is the data you need locally
// bring to local memory:
.AsEnumerable()
// if you want you can put Type and CreatedBy into one string
.Select(item => new
{
AuditType = item.Type + item.CreatedBy,
AuditCount = item.AuditCount,
});
I chose not to put the result in a Tuple, because you would lose the help from the compiler if you mix up fields. But if you really want to suit yourself.
I'm working on an audit log which saves sessions in RavenDB. Initially, the website for querying the audit logs was responsive enough but as the amount of logged data has increased, the search page became unusable (it times out before returning using default settings - regardless of the query used). Right now we have about 45mil sessions in the table that gets queried but steady state is expected to be around 150mil documents.
The problem is that with this much live data, playing around to test things has become impractical. I hope some one can give me some ideas what would be the most productive areas to investigate.
The index looks like this:
public AuditSessions_WithSearchParameters()
{
Map = sessions => from session in sessions
select new Result
{
ApplicationName = session.ApplicationName,
SessionId = session.SessionId,
StartedUtc = session.StartedUtc,
User_Cpr = session.User.Cpr,
User_CprPersonId = session.User.CprPersonId,
User_ApplicationUserId = session.User.ApplicationUserId
};
Store(r => r.ApplicationName, FieldStorage.Yes);
Store(r => r.StartedUtc, FieldStorage.Yes);
Store(r => r.User_Cpr, FieldStorage.Yes);
Store(r => r.User_CprPersonId, FieldStorage.Yes);
Store(r => r.User_ApplicationUserId, FieldStorage.Yes);
}
The essense of the query is this bit:
// Query input paramters
var fromDateUtc = fromDate.ToUniversalTime();
var toDateUtc = toDate.ToUniversalTime();
sessionQuery = sessionQuery
.Where(s =>
s.ApplicationName == applicationName &&
s.StartedUtc >= fromDateUtc &&
s.StartedUtc <= toDateUtc
);
var totalItems = Count(sessionQuery);
var sessionData =
sessionQuery
.OrderByDescending(s => s.StartedUtc)
.Skip((page - 1) * PageSize)
.Take(PageSize)
.ProjectFromIndexFieldsInto<AuditSessions_WithSearchParameters.ResultWithAuditSession>()
.Select(s => new
{
s.SessionId,
s.SessionGroupId,
s.ApplicationName,
s.StartedUtc,
s.Type,
s.ResourceUri,
s.User,
s.ImpersonatingUser
})
.ToList();
First, to determine the number of pages of results, I count the number of results in my query using this method:
private static int Count<T>(IRavenQueryable<T> results)
{
RavenQueryStatistics stats;
results.Statistics(out stats).Take(0).ToArray();
return stats.TotalResults;
}
This turns out to be very expensive in itself, so optimizations are relevant both here and in the rest of the query.
The query time is not related to the amount of result items in any relevant way. If I use a different value for the applicationName parameter than any of the results, it is just as slow.
One area of improvement could be to use sequential IDs for the sessions. For reasons not relevant to this post, I found it most practical to use guid based ids. I'm not sure if I can easily change IDs of the existing values (with this much data) and I would prefer not to drop the data (but might if the expected impact is large enough). I understand that sequential ids result in better behaving b-trees for the indexes, but I have no idea how significant the impact is.
Another approach could be to include a timestamp in the id and query for documents with ids starting with the string matching enough of the time to filter the result. An example id could be AuditSessions/2017-12-31-24-31-42/bc835d6c-2fba-4591-af92-7aab96339d84. This also requires me to update or drop all the existing data. This of course also has the benefits of mostly sequential ids.
A third approach could be to move old data into a different collection over time, in recognition of the fact that you would most often look at the most recent data. This requires a background job and support for querying across collection time boundaries. It also has the issue that the collection with the old sessions is still slow if you need to access it.
I'm hoping there is something simpler than these solutions, such as modifying the query or the indexed fields in a way that avoids a lot of work.
At a glance, it is probably related to the range query on the StartedUtc.
I'm assuming that you are using exact numbers, so you have a LOT of distinct values there.
If you can, you can dramatically reduce the cost by changing the index to index on a second / minute granularity (which is usually what you are querying on), and then use Ticks, which allow us to use numeric range query.
StartedUtcTicks = new Datetime(session.StartedUtc.Year, session.StartedUtc.Month, session.StartedUtc.Day, session.StartedUtc.Hour, session.StartedUtc.Minute, session.StartedUtc.Second).Ticks,
And then query by the date ticks.
I am reading records from database and check some conditions and store in List<Result>. Result is a class. Then performing LINQ query in List<Result> like grouping, counting etc. So there may be chance that min 50,000 records in List<Result>, so in this whether its better to go for LINQ (or) reinsert the records to db and perform the queries?
Why not store it in an IQueryable instead of a List and using LINQ to SQL or LINQ to Entities, the actual dataset will never be pulled into memory, and the queries will actually go down to the database to run.
Example:
Database db = new Database(); // this is what L2E gives you...
var children = db.Person.Where(p => p.Age < 21); // no actual database query performed
// will do : "select count(*) from Person where Age < 21"
int numChildren = children.Count();
var grouped = children.GroupBy(p => p.Age); // no actual query
int youngest = children.Min(p => p.Age); // performs query
int numYoungest = youngest.Count(p => p.Age == youngest); // performs query.
var youngestNames = children.Where(p => p.Age == youngest).Select(p => p.Name); // no query
var anArray = youngestNames.ToArray(); // performs query
string names = string.join(", ", anArray); // no query of course
I'm currently asking the same kind of thing right now. I don't really know the exact answer either, but from what I know, LINQ is not well know to be fast on objects. Also, since List is not indexed, when you do advance query on them, the backend will probably need to do a lot of computing to get what you asked for. Also, this code is generic, so it means slower execution.
The best thing would be, if you are able, do everything in one query, or even do a startproc to do your processing. Or another possibility, if you are always checking the same initial condition, create a view and do your query directly on this table (instead of reinserting from the client). I think that if you have more than 50,000 results, probably using a list is not a good idea (Memory and Performance).
It probably doesn't answer your question directly, but other than doing benchmark, you won't know. It really depends on what you are doing with the data.