How to intercept Linq filters - linq

In LINQ to Objects, Is there anyway to identify which entities/objects qualified/disqualified at each filter?
For e.g) Let's say I have an entity called "Product' (Id , name) and if i input 100 Products into a Linq query which has 5 "where" conditon and get 20 Products as output.
Is there any way to identify which product got filtered at which where condition ?

This can probably be generalized but you can do this. I just don't see the use case for it.
Use ToLookup() to partition your queries. The "disqualified" items would be lumped under the false group and you can continue your query with the true group.
e.g.,
var numbers = Enumerable.Range(0, 100);
var p1 = numbers.ToLookup(n1 => n1 < 50);
// p1[false] -> [ 50, 51, 52, ... ]
var p2 = p1[true].ToLookup(n2 => n2 % 2 == 0);
// p2[false] -> [ 1, 3, 5, 7, ... ]
var p3 = p2[true]... // and so on

While everybody so far showed you how to do the obvious thing, i.e. group your data into chunks matching one condition, nobody addressed what you're actually asking
identify which product got filtered at which where condition
The tough question is how are going to log the where condition?
It may seem trivial to do this, but you're going to fail because you won't be able to obtain a string representation of the Func object that was used. It is a delegate, i.e. compiled code, and you'll have to reverse-engineer it at runtime it to obtain the source code. This in itself is hard enough, but If the compiler chose to optimize the code you're lost.
Only if you're willing to create your own extension method(s) that use expressions in stead of Funcs you'll be able to log the where condition, because an expression consists of tokens that can easily be 'ToString-ed' at runtime.
For instance:
public static IEnumerable<T> WhereEx<T>(this IEnumerable<T> sequence, Expression<Func<T, bool>> condition)
{
var logString = condition.Body.ToString();
foreach (T item in sequence.Where(condition.Compile()))
{
yield return item;
// logging hook here, this one simply dumps in Linqpad.
string.Format("Item '{0}' meets '{1}'", item, logString).Dump();
}
}
Now you're really intercepting the filter. But you have no clue where in the code the filter is applied. If you also want to log stack frames, performance will likely become an issue (reflection!), because it's already under pressure by compiling the expression each time. Then logging should be an async operation involving a thread-safe logging queue.

Related

How to get a SOQL query out of a for loop

Code added.
I have searched endlessly for a solution here and cannot find one, please help!
I have three objects (A, B, and C). A has a lookup to B, and B is the master to C (detail). Both A and C have many records related to each B record.
I want to have a job run that gets a subset of records from object C (it will usually be around 5,000 records). Then go through each of those and get the records on Object A that lookup to the same Object B record, summarize an Object A number field, and put that on the C record.
I have successfully gotten this to work in small scale, <100 Object C records. But each Object C record requires a new SOQL query since I am iterating through them in a for loop after I get all the Object C records. Plus I know this it is not best practice to ever have a query in a loop.
How can I get this to work? Since the records share the relationship with Object B, is there another way to get the data from the Object A records that match? Or is there some way to pull two lists, one Object C and one Object A. Then summarize the Object A records and line the lists up some how?
Thanks in advance!
Code:
public class nightlyJob {
public static void updateNumbers(){
integer I = 29;
List<ObjectC__c> CUpdateList = new List<ObjectC__c>();
List<ObjectC__c> CpullList =
[SELECT ID, Index__c, ObjectB__r.id
FROM ObjectC__c
WHERE Index__c = :I];
for(ObjectC__c s : CpullList){
List<ObjectA__c> AList =
[SELECT ObjectB__c, Number__c
FROM ObjectA__c
WHERE ObjectB__c = :s.ObjectB__r.Id];
decimal NumSum = 0;
for(ObjectA__c a : AList){
NumSum = a.Number__c + NumSum;
}
s.Num__c = NumSum;
CUpdateList.add(s);
}
update CUpdateList;
}
}
It looks like you are really missing several fundamental concepts at the moment.
The biggest problem you are up against in SFDC development is that "database" operations are very expensive and are strictly limited. It's not just a matter of "best practice": if in a single transaction you exceed these limits -- number of SOQL calls, number of records returned, number of records updated, number of DML statements, etc. -- your transaction will fail. For details, search online for "Salesforce Execution Governors and Limits".
You can write code that works within these limitations, but there is a bit of a learning curve.
First, learn to use collections with SOQL queries to get your SOQL queries out of loops. This is a.k.a. "bulkfication" and it fundamental to SFDC development:
List<ObjectC__c> CpullList =
[SELECT ID, Index__c, ObjectB__r.id
FROM ObjectC__c
WHERE Index__c = :I];
// Create a map with the results of this query.
// key=ObjectC__c.Id, value = Object__c record
Map<Id, ObjectC__c> objCmap = Map<Id, ObjectC__c>(CpullList);
// Build a set of all the Object_B id's from this result set
Set<Id> objBids = new Set<Id>();
for (ObjectC__c record : CpullList) {
objBids.add(record.ObjectB__r.id);
}
// Now you can use only one SOQL query instead of a loop
List<ObjectA> AList = [SELECT ObjectB__c, Number__c
FROM ObjectA__c
WHERE ObjectB__c in:objBids];
Next, use "SOQL aggregate functions" whenever you can. Example: in your code here, you could use "SUM()" and "group by" instead of performing these calculations with loops:
// Get the sum of ObjectA__c.Number__c for each Object B in objBIds
AggregateResult[] groupedResults = [select ObjectB__c,
sum(Number__c) sumA
from ObjectA__c
where ObjectB__c in: objBids
group by ObjectB__c];
for (AggregateResult ar : groupedResults) {
System.debug('Object B Id' + ar.get('Objectb__c'));
System.debug('Sum of ObjectA__c.Number__c' + ar.get('sumA'));
// Here, you might want to build a Map<Id, Integer> sumAmap:
// key=Object B ID, value=sumA
// and then use it along with objCmap to build a collection of Object C's
// for your update statement...
}
You can continue this process and apply these ideas to make the code more efficient.
But even after you have your methods working as efficiently as possible, you still may run into limits due to the number of records you're dealing with. At that point, you will need to learn about the Batchable interface, the Queuable interface and #future calls (how to process a larger number of records, split across transactions) That's really too much to information to cover in a single SO answer.

How Should Complex ReQL Queries be Composed?

Are there any best practices or ReQL features that that help with composing complex ReQL queries?
In order to illustrate this, imagine a fruits table. Each document has the following structure.
{
"id": 123,
"name": "name",
"colour": "colour",
"weight": 5
}
If we wanted to retrieve all green fruits, we might use the following query.
r
.db('db')
.table('fruits')
.filter({colour: 'green'})
However, in more complex cases, we might wish to use a variety of complex command combinations. In such cases, bespoke queries could be written for each case, but this could be difficult to maintain and could violate the Don't Repeat Yourself (DRY) principle. Instead, we might wish to write bespoke queries which could chain custom commands, thus allowing complex queries to be composed in a modular fashion. This might take the following form.
r
.db('db')
.table('fruits')
.custom(component)
The component could be a function which accepts the last entity in the command chain as its argument and returns something, as follows.
function component(chain)
{
return chain
.filter({colour: 'green'});
};
This is not so much a feature proposal as an illustration of the problem of complex queries, although such a feature does seem intuitively useful.
Personally, my own efforts in resolving this problem have involved the creation of a compose utility function. It takes an array of functions as its main argument. Each function is called, passed a part of the query chain, and is expected to return an amended version of the query chain. Once the iteration is complete, a composition of the query components is returned. This can be viewed below.
function compose(queries, parameters)
{
if (queries.length > 1)
{
let composition = queries[0](parameters);
for (let index = 1; index < queries.length; index++)
{
let query = queries[index];
composition = query(composition, parameters);
};
return composition;
}
else
{
throw 'Must be two or more queries.';
};
};
function startQuery()
{
return RethinkDB;
};
function filterQuery1(query)
{
return query.filter({name: 'Grape'});
};
function filterQuery2(query)
{
return query.filter({colour: 'Green'});
};
function filterQuery3(query)
{
return query.orderBy(RethinkDB.desc('created'));
};
let composition = compose([startQuery, filterQuery1, filterQuery2, filterQuery3]);
composition.run(connection);
It would be great to know whether something like this exists, whether there are best practises to handle such cases, or whether this is an area where ReQL could benefit from improvements.
In RethinkDB doc, they state it clearly: All ReQL queries are chainable
Queries are constructed by making function calls in the programming
language you already know. You don’t have to concatenate strings or
construct specialized JSON objects to query the database. All ReQL
queries are chainable. You begin with a table and incrementally chain
transformers to the end of the query using the . operator
You do not have to compose another thing which just implicit your code, which gets it more difficult to read and be unnecessary eventually.
The simple way is assign the rethinkdb query and filter into the variables, anytime you need to add more complex logic, add directly to these variables, then run() it when your query is completed
Supposing I have to search a list of products with different filter inputs and getting pagination. The following code is exposed in javascript (This is simple code for illustration only)
let sorterDirection = 'asc';
let sorterColumnName = 'created_date';
var buildFilter = r.row('app_id').eq(appId).and(r.row('status').eq('public'))
// if there is no condition to start up, you could use r.expr(true)
// append every filter into the buildFilter var if they are positive
if (escapedKeyword != "") {
buildFilter = buildFilter.and(r.row('name').default('').downcase().match(escapedKeyword))
}
// you may have different filter to add, do the same to append them into buildFilter.
// start to make query
let query = r.table('yourTableName').filter(buildFilter);
query.orderBy(r[sorterDirection](sorterColumnName))
.slice(pageIndex * pageSize, (pageIndex * pageSize) + pageSize).run();

Why is Entity Framework's AsEnumerable() downloading all data from the server?

What is the explanation for EF downloading all result rows when AsEnumerable() is used?
What I mean is that this code:
context.Logs.AsEnumerable().Where(x => x.Id % 2 == 0).Take(100).ToList();
will download all the rows from the table before passing any row to the Where() method and there could be millions of rows in the table.
What I would like it to do, is to download only enough to gather 100 rows that would satisfy the Id % 2 == 0 condition (most likely just around 200 rows).
Couldn't EF do on demand loading of rows like you can with plain ADO.NET using Read() method of SqlDataReader and save time and bandwidth?
I suppose that it does not work like that for a reason and I'd like to hear a good argument supporting that design decision.
NOTE: This is a completely contrived example and I know normally you should not use EF this way, but I found this in some existing code and was just surprised my assumptions turned out to be incorrect.
The short answer: The reason for the different behaviors is that, when you use IQueryable directly, a single SQL query can be formed for your entire LINQ query; but when you use IEnumerable, the entire table of data must be loaded.
The long answer: Consider the following code.
context.Logs.Where(x => x.Id % 2 == 0)
context.Logs is of type IQueryable<Log>. IQueryable<Log>.Where is taking an Expression<Func<Log, bool>> as the predicate. The Expression represents an abstract syntax tree; that is, it's more than just code you can run. Think of it as being represented in memory, at runtime, like this:
Lambda (=>)
Parameters
Variable: x
Body
Equals (==)
Modulo (%)
PropertyAccess (.)
Variable: x
Property: Id
Constant: 2
Constant: 0
The LINQ-to-Entities engine can take context.Logs.Where(x => x.Id % 2 == 0) and mechanically convert it into a SQL query that looks something like this:
SELECT *
FROM "Logs"
WHERE "Logs"."Id" % 2 = 0;
If you change your code to context.Logs.Where(x => x.Id % 2 == 0).Take(100), the SQL query becomes something like this:
SELECT *
FROM "Logs"
WHERE "Logs"."Id" % 2 = 0
LIMIT 100;
This is entirely because the LINQ extension methods on IQueryable use Expression instead of just Func.
Now consider context.Logs.AsEnumerable().Where(x => x.Id % 2 == 0). The IEnumerable<Log>.Where extension method is taking a Func<Log, bool> as a predicate. That is only runnable code. It cannot be analyzed to determine its structure; it cannot be used to form a SQL query.
Entity Framework and Linq use lazy loading. It means (among other things) that they will not run the query until they need to enumerate the results: for instance using ToList() or AsEnumerable(), or if the result is used as an enumerator (in a foreach for instance).
Instead, it builds a query using predicates, and returns IQueryable objects to further "pre-filter" the results before actually returning them. You can find more infos here for instance. Entity framework will actually build a SQL query depending on the predicates you have passed it.
In your example:
context.Logs.AsEnumerable().Where(x => x.Id % 2 == 0).Take(100).ToList();
From the Logs table in the context, it fetches all, returns a IEnumerable with the results, then filters the result, takes the first 100, then lists the results as a List.
On the other hand, just removing the AsEnumerable solves your problem:
context.Logs.Where(x => x.Id % 2 == 0).Take(100).ToList();
Here it will build a query/filter on the result, then only once the ToList() is executed, query the database.
It also means that you can dynamically build a complex query without actually running it on the DB it until the end, for instance:
var logs = context.Logs.Where(a); // first filter
if (something) {
logs = logs.Where(b); // second filter
}
var results = logs.Take(100).ToList(); // only here is the query actually executed
Update
As mentionned in your comment, you seem to already know what I just wrote, and are just asking for a reason.
It's even simpler: since AsEnumerable casts the results to another type (a IQueryable<T> to IEnumerable<T> in this case), it has to convert all the results rows first, so it has to fetch the data first. It's basically a ToList in this case.
Clearly, you understand why it's better to avoid using AsEnumerable() the way you do in your question.
Also, some of the other answers have made it very clear why calling AsEnumerable() changes the way the query is performed and read. In short, it's because you are then invoking IEnumrable<T> extension methods rather than the IQueryable<T> extension methods, the latter allowing you to combine predicates before executing the query in the database.
However, I still feel that this doesn't answer your actual question, which is a legitimate question. You said (emphasis mine):
What I mean is that this code:
context.Logs.AsEnumerable().Where(x => x.Id % 2 == 0).Take(100).ToList();
will download all the rows from the table before passing any row to the Where() method and there could be millions of rows in the table.
My question to you is: what made you conclude that this is true?
I would argue that, because you are using IEnumrable<T> instead of IQueryable<T>, it's true that the query being performed in the database will be a simple:
select * from logs
... without any predicates, unlike what would have happened if you had used IQueryable<T> to invoke Where and Take.
However, the AsEnumerable() method call does not fetch all the rows at that moment, as other answers have implied. In fact, this is the implementation of the AsEnumerable() call:
public static IEnumerable<TSource> AsEnumerable<TSource>(this IEnumerable<TSource> source)
{
return source;
}
There is no fetching going on there. In fact, even the calls to IEnumerable<T>.Where() and IEnumerable<T>.Take() don't actually start fetching any rows at that moment. They simply setup wrapping IEnumerables that will filter results as they are iterated on. The fetching and iterating of the results really only begins when ToList() is called.
So when you say:
Couldn't EF do on demand loading of rows like you can with plain ADO.NET using Read() method of SqlDataReader and save time and bandwidth?
... again, my question to you would be: doesn't it do that already?
If your table had 1,000,000 rows, I would still expect your code snippet to only fetch up to 100 rows that satisfy your Where condition, and then stop fetching rows.
To prove the point, try running the following little program:
static void Main(string[] args)
{
var list = PretendImAOneMillionRecordTable().Where(i => i < 500).Take(10).ToList();
}
private static IEnumerable<int> PretendImAOneMillionRecordTable()
{
for (int i = 0; i < 1000000; i++)
{
Console.WriteLine("fetching {0}", i);
yield return i;
}
}
... when I run it, I only get the following 10 lines of output:
fetching 0
fetching 1
fetching 2
fetching 3
fetching 4
fetching 5
fetching 6
fetching 7
fetching 8
fetching 9
It doesn't iterate through the whole set of 1,000,000 "rows" even though I am chaining Where() and Take() calls on IEnumerable<T>.
Now, you do have to keep in mind that, for your little EF code snippet, if you test it using a very small table, it may actually fetch all the rows at once, if all the rows fit within the value for SqlConnection.PacketSize. This is normal. Every time SqlDataReader.Read() is called, it never only fetches a single row at a time. To reduce the amount of network call roundtrips, it will always try to fetch a batch of rows at a time. I wonder if this is what you observed, and this mislead you into thinking that AsEnumerable() was causing all rows to be fetched from the table.
Even though you will find that your example doesn't perform nearly as bad as you thought, this would not be a reason not to use IQueryable. Using IQueryable to construct more complex database queries will almost always provide better performance, because you can then benefit from database indexes, etc to fetch results more efficiently.
AsEnumerable() eagerly loads the DbSet<T> Logs
You probably want something like
context.Logs.Where(x => x.Id % 2 == 0).AsEnumerable();
The idea here is that you're applying a predicate filter to the collection before actually loading it from the database.
An impressive subset of the world of LINQ is supported by EF. It will translate your beautiful LINQ queries into SQL expressions behind the scenes.
I have come across this before.
The context command is not executed until a linq function is called, because you have done
context.Logs.AsEnumerable()
it has assumed you have finished with the query and therefore compiled it and returns all rows.
If you changed this to:
context.Logs.Where(x => x.Id % 2 == 0).AsEnumerable()
It would compile a SQL statement that would get only the rows where the id is modular 2.
Similarly if you did
context.Logs.Where(x => x.Id % 2 == 0).Take(100).ToList();
that would create a statement that would get the top 100...
I hope that helps.
LinQ to Entities has a store expression formed by all the Linq methods before It goes to an enumeration.
When you use AsEnumerable() and then Where() like this:
context.Logs.Where(...).AsEnumerable()
The Where() knows that the previous chain call has a store expression so he appends his predicate to It for lazy loading.
The overload of Where that is being called is different if you call this:
context.Logs.AsEnumerable().Where(...)
Here the Where() only knows that his previous method is an enumeration (it could be any kind of "enumerable" collection) and the only way that he can apply his condition is iterating over the collection with the IEnumerable implementation of the DbSet class, which must to retrieve the records from the database first.
I don't think you should ever use this:
context.Logs.AsEnumerable().Where(x => x.Id % 2 == 0).Take(100).ToList();
The correct way of doing things would be:
context.Logs.AsQueryable().Where(x => x.Id % 2 == 0).Take(100).ToList();
Answer with explanations here:
What's the difference(s) between .ToList(), .AsEnumerable(), AsQueryable()?
Why use AsQueryable() instead of List()?

How to avoid Query Plan re-compilation when using IEnumerable.Contains in Entity Framework LINQ queries?

I have the following LINQ query executed using Entity Framework (v6.1.1):
private IList<Customer> GetFullCustomers(IEnumerable<int> customersIds)
{
IQueryable<Customer> fullCustomerQuery = GetFullQuery();
return fullCustomerQuery.Where(c => customersIds.Contains(c.Id)).ToList();
}
This query is translated into fairly nice SQL:
SELECT
[Extent1].[Id] AS [Id],
[Extent1].[FirstName] AS [FirstName]
-- ...
FROM [dbo].[Customer] AS [Extent1]
WHERE [Extent1].[Id] IN (1, 2, 3, 5)
However, I get a very significant performance hit on a query compilation phase. Calling:
ELinqQueryState.GetExecutionPlan(MergeOption? forMergeOption)
Takes ~50% of the time of each request. Digging deeper, it turned out that query gets re-compiled every time I pass different customersIds.
According to MSDN article, this is an expected behavior because IEnumerable that is used in a query is considered volatile and is part of SQL that is cached. That's why SQL is different for every different combination of customersIds and it always has different hash that is used to get compiled query from cache.
Now the question is: How can I avoid this re-compilation while still querying with multiple customersIds?
This is a great question. First of all, here are a couple of workarounds that come to mind (they all require changes to the query):
First workaround
This one maybe a bit obvious and unfortunately not generally applicable: If the selection of items you would need to pass over to Enumerable.Contains already exists in a table in the database, you can write a query that calls Enumerable.Contains on the corresponding entity set in the predicate instead of bringing the items into memory first. An Enumerable.Contains call over data in the database should result in some kind of JOIN-based query that can be cached. E.g. assuming no navigation properties between Customers and SelectedCustomers, you should be able to write the query like this:
var q = db.Customers.Where(c =>
db.SelectedCustomers.Select(s => s.Id).Contains(c.Id));
The syntax of the query with Any is a bit simpler in this case:
var q = db.Customers.Where(c =>
db.SelectedCustomers.Any(s => s.Id == c.Id));
If you don't already have the necessary selection data stored in the database, you will probably don't want the overhead of having to store it, so you should consider the next workaround.
Second workaround
If you know beforehand that you will have a relatively manageable maximum number of elements in the list you can replace Enumerable.Contains with a tree of OR-ed equality comparisons, e.g.:
var list = new [] {1,2,3};
var q = db.Customers.Where(c =>
list[0] == c.Id ||
list[1] == c.Id ||
list[2] == c.Id );
This should produce a parameterized query that can be cached. If the list varies in size from query to query, this should produce a different cache entry for each list size. Alternatively you could use a list with a fixed size and pass some sentinel value that you know will never match the value argument, e.g. 0, -1, or alternatively just repeat one of the other values. In order to produce such predicate expression programmatically at runtime based on a list, you might want to consider using something like PredicateBuilder.
Potential fixes and their challenges
On one hand, changes necessary to support caching of this kind of query using CompiledQuery explicitly would be pretty complex in the current version of EF. The key reason is that the elements in the IEnumerable<T> passed to the Enumerable.Contains method would have to translate into a structural part of the query for the particular translation we produce, e.g.:
var list = new [] {1,2,3};
var q = db.Customers.Where(c => list.Contains(c.Id)).ToList();
The enumerable “list” looks like a simple variable in C#/LINQ but it needs to be translated to a query like this (simplified for clarity):
SELECT * FROM Customers WHERE Id IN(1,2,3)
If list changes to new [] {5,4,3,2,1}, and we would have to generate the SQL query again!
SELECT * FROM Customers WHERE Id IN(5,4,3,2,1)
As a potential solution, we have talked about leaving generated SQL queries open with some kind of special place holder, e.g. store in the query cache that just says
SELECT * FROM Customers WHERE Id IN(<place holder>)
At execution time, we could pick this SQL from the cache and finish the SQL generation with the actual values. Another option would be to leverage a Table-Valued Parameter for the list if the target database can support it. The first option would probably work ok only with constant values, the latter requires a database that supports a special feature. Both are very complex to implement in EF.
Auto compiled queries
On the other hand, for automatic compiled queries (as opposed to explicit CompiledQuery) the issue becomes somewhat artificial: in this case we compute the query cache key after the initial LINQ translation, hence any IEnumerable<T> argument passed should have already been expanded into DbExpression nodes: a tree of OR-ed equality comparisons in EF5, and usually a single DbInExpression node in EF6. Since the query tree already contains a distinct expression for each distinct combination of elements in the source argument of Enumerable.Contains (and therefore for each distinct output SQL query), it is possible to cache the queries.
However even in EF6 these queries are not cached even in the auto compiled queries case. The key reason for that is that we expect the variability of elements in a list to be high (this has to do with the variable size of the list but is also exacerbated by the fact that we normally don't parameterize values that appear as constants to the query, so a list of constants will be translated into constant literals in SQL), so with enough calls to a query with Enumerable.Contains you could produce considerable cache pollution.
We have considered alternative solutions to this as well, but we haven't implemented any yet. So my conclusion is that you would be better off with the second workaround in most cases if as I said, you know the number of elements in the list will remain small and manageable (otherwise you will face performance issues).
Hope this helps!
As of now, this is still a problem in Entity Framework Core when using the SQL Server Database Provider.
💡 Still on Entity Framework 6 (non-core)? skip to the next section.
I wrote QueryableValues to solve this problem in a flexible and performant way; with it you can compose the values from an IEnumerable<T> in your query, like if it were another entity in your DbContext.
In contrast to other solutions out there, QueryableValues achieves this level of performance by:
Resolving with a single round-trip to the database.
Preserving the query's execution plan regardless of the provided values.
Usage example:
// Sample values.
IEnumerable<int> values = Enumerable.Range(1, 10);
// Using a Join.
var myQuery1 =
from e in dbContext.MyEntities
join v in dbContext.AsQueryableValues(values) on e.Id equals v
select new
{
e.Id,
e.Name
};
// Using Contains.
var myQuery2 =
from e in dbContext.MyEntities
where dbContext.AsQueryableValues(values).Contains(e.Id)
select new
{
e.Id,
e.Name
};
You can also compose complex types!
It's available as a nuget package and the project can be found here. It's distributed under the MIT license.
The benchmarks speak for themselves.
An Alternative for Entity Framework 6 (non-core)
🎉 NEW! QueryableValues EF6 Edition has arrived!
I'll explain how to manually provide some of the functionality of QueryableValues on this legacy version of Entity Framework, specifically, the ability to compose an IEnumerable<int> with any of your entities in the same way that QueryableValues does on EF Core. You can use this same technique to support collections of other simple types like long, string, etc.
Requirements
Must use the SQL Server provider
Must use the database-first strategy OR you already have a way to map a TVF using the code-first strategy
Instructions Summary
Create a method that takes an IEnumerable<int> and returns XML.
Create a TVF in your database that takes XML and returns a rowset.
Add the TVF to the EDMX using the designer.
Encapsulate the code that glues the functions created on step 1 and 2 and return an IQueryable<int>.
Use the IQueryable<int> in your queries as desired.
Instructions
1. Create a method that takes a IEnumerable<int> and returns XML
This method will serialize the provided values as XML, so later on it can be transmitted as a parameter in your query.
static string GetXml<T>(IEnumerable<T> values)
{
var sb = new StringBuilder();
using (var stringWriter = new System.IO.StringWriter(sb))
{
var settings = new System.Xml.XmlWriterSettings
{
ConformanceLevel = System.Xml.ConformanceLevel.Fragment
};
using (var xmlWriter = System.Xml.XmlWriter.Create(stringWriter, settings))
{
xmlWriter.WriteStartElement("R");
foreach (var value in values)
{
xmlWriter.WriteStartElement("V");
xmlWriter.WriteValue(value);
xmlWriter.WriteEndElement();
}
xmlWriter.WriteEndElement();
}
}
return sb.ToString();
}
If the above method is provided with new[] { 1, 2, 3 }, it will return a XML string with the following structure:
<R><V>1</V><V>2</V><V>3</V></R>
2. Create a TVF in your database that takes XML and returns a rowset
The following table-valued function (TVF) will take the XML created by the previous function and project it as a rowset with a single column (V), that can then be used from SQL Server's side in your query. Must be created in the database associated with your EDMX file, so it can be added to your EDMX model in the next step.
CREATE FUNCTION dbo.udf_GetIntValuesFromXml
(
#Values XML
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT I.value('. cast as xs:integer?', 'int') AS V
FROM #Values.nodes('/R/V') N(I)
)
The above function when provided with the <R><V>1</V><V>2</V><V>3</V></R> XML, will return the following rowset:
V
1
2
3
3. Add the TVF to the EDMX using the designer
Table-Valued Functions (TVFs) - EF Docs
After adding this function to your EDMX model, ensure to save the changes to the EDMX file so that your DbContext generated code is up to date.
4. Encapsulate the code that glues the functions created on step 1 and 2 and return an IQueryable<int>
The following code encapsulates the XML serializer function explained above and everything else you need on the .NET side to make this work:
using System.Collections.Generic;
using System.Linq;
public static class QueryableValuesClassicDbContextExtensions
{
private static string GetXml<T>(IEnumerable<T> values)
{
var sb = new StringBuilder();
using (var stringWriter = new System.IO.StringWriter(sb))
{
var settings = new System.Xml.XmlWriterSettings
{
ConformanceLevel = System.Xml.ConformanceLevel.Fragment
};
using (var xmlWriter = System.Xml.XmlWriter.Create(stringWriter, settings))
{
xmlWriter.WriteStartElement("R");
foreach (var value in values)
{
xmlWriter.WriteStartElement("V");
xmlWriter.WriteValue(value);
xmlWriter.WriteEndElement();
}
xmlWriter.WriteEndElement();
}
}
return sb.ToString();
}
public static IQueryable<int> AsQueryableValues(this IQueryableValuesClassicDbContext dbContext, IEnumerable<int> values)
{
return dbContext.GetIntValuesFromXml(GetXml(values));
}
}
public interface IQueryableValuesClassicDbContext
{
IQueryable<int> GetIntValuesFromXml(string xml);
}
The IQueryableValuesClassicDbContext interface is intended to be explicitly implemented on your DbContext class to provide access to the TVF that was added to the EDMX model.
You can do this by creating a partial class for your DbContext. For example, if your DbContext name is TestDbContext:
using System.Linq;
partial class TestDbContext : IQueryableValuesClassicDbContext
{
IQueryable<int> IQueryableValuesClassicDbContext.GetIntValuesFromXml(string xml)
{
return udf_GetIntValuesFromXml(xml).Select(i => i.Value);
}
}
5. Use the IQueryable<int> in your queries as desired (via AsQueryableValues)
using (var db = new TestDbContext())
{
var valuesQuery = db.AsQueryableValues(new[] { 1, 2, 3, 4, 5 });
var resultsUsingContains = db.MyEntity
.Where(i => valuesQuery.Contains(i.MyEntityID))
.Select(i => new { i.MyEntityID, i.PropA })
.ToList();
var resultsUsingJoin = (
from i in db.MyEntity
join v in valuesQuery on i.MyEntityID equals v
select new { i.MyEntityID, i.PropA }
)
.ToList();
}
Below is the T-SQL generated behind the scenes for the above EF queries. As you can see, it's completely parameterized.
exec sp_executesql N'SELECT
[Extent1].[MyEntityID] AS [MyEntityID],
[Extent1].[PropA] AS [PropA]
FROM [dbo].[MyEntity] AS [Extent1]
WHERE EXISTS (SELECT
1 AS [C1]
FROM [dbo].[udf_GetIntValuesFromXml](#Values) AS [Extent2]
WHERE ([Extent2].[V] = [Extent1].[MyEntityID]) AND ([Extent2].[V] IS NOT NULL)
)',N'#Values nvarchar(4000)',#Values=N'<R><V>1</V><V>2</V><V>3</V><V>4</V><V>5</V></R>'
exec sp_executesql N'SELECT
[Extent1].[MyEntityID] AS [MyEntityID],
[Extent1].[PropA] AS [PropA]
FROM [dbo].[MyEntity] AS [Extent1]
INNER JOIN [dbo].[udf_GetIntValuesFromXml](#Values) AS [Extent2] ON [Extent1].[MyEntityID] = [Extent2].[V]',N'#Values nvarchar(4000)',#Values=N'<R><V>1</V><V>2</V><V>3</V><V>4</V><V>5</V></R>'
Limitations
The provided IEnumerable<int> is enumerated at query build time, not at execution time.
The final query cannot reference more than one IQueryable<T> returned by the AsQueryableValues extension method. This is another limitation around composing the same TVF more than once. EF will create two parameters with the same name, which is illegal and you will get the following error:
A parameter named 'Values' already exists in the parameter collection. Parameter names must be unique in the parameter collection.
Incorrect type used for the XML type parameter of the TVF (notice the use of nvarchar instead of xml in the T-SQL above). This is a deficiency in the EF infrastructure (ObjectParameter) that's used to compose the TVF. Not using the correct parameter type has a detrimental effect in performance due to the implicit casting that must be done by SQL Server.
Conclusion
Despite the limitations, this is still a robust solution when compared to not using parameterized T-SQL queries. To understand the underlying issue that this mitigates you can continue reading here.
Legal Stuff
Feel free to use the code and examples above as you wish. I'm releasing it under the MIT license:
MIT License
Copyright (c) Carlos Villegas (yv989c)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
I had this exact challenge. Here is how I tackled this problem for either strings or longs in an extension method for IQueryables.
To limit the caching pollution we create the same query with a multitude n of m (configurable) parameters, so 1 * m, 2 * m etc. So if the setting is 15; The queryplans would have either 15, 30, 45 etc parameters, depending on the number of elements in the contains (we don't know in advance, but probably less than 100) limiting the number of query plans to 3 if the biggest contains is less than or equal to 45.
The remaining parameters are filled with a placeholdervalue that (we know) doesn't exists in the database. In this case '-1'
Resulting query part;
... WHERE [Filter1].[SomeProperty] IN (#p__linq__0,#p__linq__1, (...) ,#p__linq__19)
... #p__linq__0='SomeSearchText1',#p__linq__1='SomeSearchText2',#p__linq__2='-1',
(...) ,#p__linq__19='-1'
Usage:
ICollection<string> searchtexts = .....ToList();
//or
//ICollection<long> searchIds = .....ToList();
//this is the setting that is relevant for the resulting multitude of possible queryplans
int itemsPerSet = 15;
IQueryable<MyEntity> myEntities = (from c in dbContext.MyEntities
select c)
.WhereContains(d => d.SomeProperty, searchtexts, "-1", itemsPerSet);
The extension method:
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Linq.Expressions;
namespace MyCompany.Something.Extensions
{
public static class IQueryableExtensions
{
public static IQueryable<T> WhereContains<T, U>(this IQueryable<T> source, Expression<Func<T,U>> propertySelector, ICollection<U> identifiers, U placeholderThatDoesNotExistsAsValue, int cacheLevel)
{
if(!(propertySelector.Body is MemberExpression))
{
throw new ArgumentException("propertySelector must be a MemberExpression", nameof(propertySelector));
}
var propertyExpression = propertySelector.Body as MemberExpression;
var propertyName = propertyExpression.Member.Name;
return WhereContains(source, propertyName, identifiers, placeholderThatDoesNotExistsAsValue, cacheLevel);
}
public static IQueryable<T> WhereContains<T, U>(this IQueryable<T> source, string propertyName, ICollection<U> identifiers, U placeholderThatDoesNotExistsAsValue, int cacheLevel)
{
return source.Where(ContainsPredicateBuilder<T, U>(identifiers, propertyName, placeholderThatDoesNotExistsAsValue, cacheLevel));
}
public static Expression<Func<T, bool>> ContainsPredicateBuilder<T,U>(ICollection<U> ids, string propertyName, U placeholderValue, int cacheLevel = 20)
{
if(cacheLevel < 1)
{
throw new ArgumentException("cacheLevel must be greater than or equal to 1", nameof(cacheLevel));
}
Expression<Func<T, bool>> predicate;
var propertyIsNullable = Nullable.GetUnderlyingType(typeof(T).GetProperty(propertyName).PropertyType) != null;
// fill a list of cachableLevel number of parameters for the property, equal the selected items and padded with the placeholder value to fill the list.
Expression finalExpression = Expression.Constant(false);
var parameter = Expression.Parameter(typeof(T), "x");
/* factor makes sure that this query part contains a multitude of m parameters (i.e. 20, 40, 60, ...),
* so the number of query plans is limited even if lots of users have more than m items selected */
int factor = Math.Max(1, (int)Math.Ceiling((double)ids.Count / cacheLevel));
for (var i = 0; i < factor * cacheLevel; i++)
{
U id = placeholderValue;
if (i < ids.Count)
{
id = ids.ElementAt(i);
}
var temp = new { id };
var constant = Expression.Constant(temp);
var field = Expression.Property(constant, "id");
var member = Expression.Property(parameter, propertyName);
if (propertyIsNullable)
{
member = Expression.Property(member, "Value");
}
var expression = Expression.Equal(member, field);
finalExpression = Expression.OrElse(finalExpression, expression);
}
predicate = Expression.Lambda<Func<T, bool>>(finalExpression, parameter);
return predicate;
}
}
}
This is really a huge problem, and there's no one-size-fits-all answer. However, when most lists are relatively small, diverga's "Second Workaround" works well. I've built a library distributed as a NuGet package to perform this transformation with as little modification to the query as possible:
https://github.com/bchurchill/EFCacheContains
It's been tested out in one project, but feedback and user experiences would be appreciated! If any issues come up please report on github so that I can follow-up.

Optimizing away OrderBy() when using Any()

So I have a fairly standard LINQ-to-Object setup.
var query = expensiveSrc.Where(x=> x.HasFoo)
.OrderBy(y => y.Bar.Count())
.Select(z => z.FrobberName);
// ...
if (!condition && !query.Any())
return; // seems to enumerate and sort entire enumerable
// ...
foreach (var item in query)
// ...
This enumerates everything twice. Which is bad.
var queryFiltered = expensiveSrc.Where(x=> x.HasFoo);
var query = queryFiltered.OrderBy(y => y.Bar.Count())
.Select(z => z.FrobberName);
if (!condition && !queryFiltered.Any())
return;
// ...
foreach (var item in query)
// ...
Works, but is there a better way?
Would there be any non-insane way to "enlighten" Any() to bypass the non-required operations? I think I remember this sort of optimisation going into EduLinq.
Why not just get rid of the redundant:
if (!query.Any())
return;
It really doesn't seem to be serving any purpose - even without it, the body of the foreach won't execute if the query yields no results. So with the Any() check in, you save nothing in the fast path, and enumerate twice in the slow path.
On the other hand, if you must know if there were any results found after the end of the loop, you might as well just use a flag:
bool itemFound = false;
foreach (var item in query)
{
itemFound = true;
... // Rest of the loop body goes here.
}
if(itemFound)
{
// ...
}
Or you could use the enumerator directly if you're really concerned about the redundant flag-setting in the loop body:
using(var erator = query.GetEnumerator())
{
bool itemFound = erator.MoveNext();
if(itemFound)
{
do
{
// Do something with erator.Current;
} while(erator.MoveNext())
}
// Do something with itemFound
}
There is not much information that can be extracted from an enumerable, so maybe it's better to turn the query into an IQueryable? This Any extension method walks down its expression tree skipping all irrelevant operations, then it turns the important branch into a delegate that can be called to obtain an optimized IQueryable. Standard Any method applied to it explicitly to avoid recursion. Not sure about corner cases, and maybe it makes sense to cache compiled queries, but with simple queries like yours it seems to work.
static class QueryableHelper {
public static bool Any<T>(this IQueryable<T> source) {
var e = source.Expression;
while (e is MethodCallExpression) {
var mce = e as MethodCallExpression;
switch (mce.Method.Name) {
case "Select":
case "OrderBy":
case "ThenBy": break;
default: goto dun;
}
e = mce.Arguments.First();
}
dun:
var d = Expression.Lambda<Func<IQueryable<T>>>(e).Compile();
return Queryable.Any(d());
}
}
Queries themselves must be modified like this:
var query = expensiveSrc.AsQueryable()
.Where(x=> x.HasFoo)
.OrderBy(y => y.Bar.Count())
.Select(z => z.FrobberName);
Would there be any non-insane way to "enlighten" Any() to bypass the non-required operations? I think I remember this sort of optimisation going into EduLinq.
Well I'm not going to ignore any question which mentions Edulinq :)
In this case, Edulinq might well be faster than LINQ to Objects, as its OrderBy implementation is as lazy as it can be - it only sorts as much as it needs to in order to retrieve the elements it returns.
However, fundamentally it still has to read the whole sequence in before it returns anything. After all, the last element in the sequence could be the first one which has to be returned.
If you're in control of the whole stack, you could make Any() detect that it's being called on your "known" IOrderedEnumerable implementation, and go straight to the original source. Note that this does create a change in the observed behaviour though - if iterating over the whole sequence throws an exception (or has any other side effect) then that side-effect would be lost by the optimization. You could argue that's okay, of course - what counts as "valid" optimization in LINQ is a decidedly tricky area.
One other possibility which is pretty horrible but which would solve this particular problem would be to make the iterator returned from the IOrderedEnumerable just take the first value of MoveNext() from the source. That's enough for the normal implementation of Any, and at that point we don't need to know what the first element is. We could defer the actual sorting until the first time the Current property is used.
That's a pretty special-case optimization though - and one which I'd be wary to implement. I think Ani's approach is the better one - just use the fact that iterating over query using foreach will never go into the body of the loop if the query results are empty.
Edit (revised): This answer adressess the issue of the query executing twice, which I believe is the key issue. See below why:
Making Any() smarter is something that only the Linq implementers can do, IMO... Or it would be some dirty adventure using reflection.
Using a class as shown below, you can cache the output of the original enumerable, and let it be enumerated twice:
public class CachedEnumerable<T>
{
public CachedEnumerable(IEnumerable<T> enumerable)
{
_source = enumerable.GetEnumerator();
}
public IEnumerable<T> Enumerate()
{
int itemIndex = 0;
while (true)
{
if (itemIndex < _cache.Count)
{
yield return _cache[itemIndex];
itemIndex++;
continue;
}
if (!_source.MoveNext())
yield break;
var current = _source.Current;
_cache.Add(current);
yield return current;
itemIndex++;
}
}
private List<T> _cache = new List<T>();
private IEnumerator<T> _source;
}
This way you keep the lazy aspect of LINQ, keep the code readable and generic. It wil be slower that directly using IEnumerator<>. There are lots of opportunities to extend, and optimize this class, such as a policy for discarding old items, getting rid of the coroutine etc. But that is beyond the point of this question I think.
Oh, and the class is not thread safe as it is now. This wasn't asked, but I can imagine people trying. I think this could be easily added, if the source enumerable has no thread affinity..
Why would this be optimal?
Let's consider two possibilites: the enumeration could containt elements or it does not.
If it contains elements, this approach is optimal as the query is
only run once.
If it contains no elements, you would be tempted
to eliminate the OrderBy and Select part of your queries, as they add
no value. But.. if there are zero items after the Where() clause, there are zero items to sort, which will cost zero time (well, almost). The same goes for the Select() clause.
What if this is not fast enough yet? In that case my strategy would be to bypass Linq. Now, I really love linq, but it's elegance comes at a price. So for every 100 times of using Linq, there typically will be one or two computations that are important to execute really fast, which I write with good old for loops and lists. Part of mastering a technology is recognizing where it is not appropriate. Linq is no exception to that rule.
Try this:
var items = expensiveSrc.Where(x=> x.HasFoo)
.OrderBy(y => y.Bar.Count())
.Select(z => z.FrobberName).ToList();
// ...
if (!condition && items.Count == 0)
return; // Just check the count
// ...
foreach (var item in items)
// ...
The query is executed just once.
but I've lost the streaming/lazy loading that's half the point of linq
Lazy loading (deferred execution), and 2 LINQ queries with disparate results cannot be optimized (reduced) to 1 query execution.
why are you not using a .ToArray()
var query = expensiveSrc.Where(x=> x.HasFoo)
.OrderBy(y => y.Bar.Count())
.Select(z => z.FrobberName).ToArray();
if there are not elements, sorting and selecting should not give much overhead. if you are sorting, then you need anyway a cache where to store the data, so the overhead .ToArray produces should not be so much.
if you decompile the OrderedEnumerable class, you find that there an int[] array containing the references is formed, so you just create by using .ToArray (or .ToList) a new reference array.
BUT
if expensiveSrc comes from a database, other strategies could be better. if the ordering can be done in the database, this would give to you quite lot of overhead because the data is stored twice.

Resources