LINQ uses a Deferred Execution model which means that resulting sequence is not returned at the time the Linq operators are called, but instead these operators return an object which then yields elements of a sequence only when we enumerate this object.
While I understand how deferred queries work, I'm having some trouble understanding the benefits of deferred execution:
1) I've read that deferred query executing only when you actually need the results can be of great benefit. So what is this benefit?
2) Other advantage of deferred queries is that if you define a query once, then each time you enumerate the results, you will get different results if the data changes.
a) But as seen from the code below, we're able to achieve the same effect ( thus each time we enumerate the resource, we get different result if data changed ) even without using deferred queries:
List<string> sList = new List<string>( new[]{ "A","B" });
foreach (string item in sList)
Console.WriteLine(item); // Q1 outputs AB
sList.Add("C");
foreach (string item in sList)
Console.WriteLine(item); // Q2 outputs ABC
3) Are there any other benefits of deferred execution?
The main benefit is that this allows filtering operations, the core of LINQ, to be much more efficient. (This is effectively your item #1).
For example, take a LINQ query like this:
var results = collection.Select(item => item.Foo).Where(foo => foo < 3).ToList();
With deferred execution, the above iterates your collection one time, and each time an item is requested during the iteration, performs the map operation, filters, then uses the results to build the list.
If you were to make LINQ fully execute each time, each operation (Select / Where) would have to iterate through the entire sequence. This would make chained operations very inefficient.
Personally, I'd say your item #2 above is more of a side effect rather than a benefit - while it's, at times, beneficial, it also causes some confusion at times, so I would just consider this "something to understand" and not tout it as a benefit of LINQ.
In response to your edit:
In your particular example, in both cases Select would iterate collection and return an IEnumerable I1 of type item.Foo. Where() would then enumerate I1 and return IEnumerable<> I2 of type item.Foo. I2 would then be converted to List.
This is not true - deferred execution prevents this from occurring.
In my example, the return type is IEnumerable<T>, which means that it's a collection that can be enumerated, but, due to deferred execution, it isn't actually enumerated.
When you call ToList(), the entire collection is enumerated. The result ends up looking conceptually something more like (though, of course, different):
List<Foo> results = new List<Foo>();
foreach(var item in collection)
{
// "Select" does a mapping
var foo = item.Foo;
// "Where" filters
if (!(foo < 3))
continue;
// "ToList" builds results
results.Add(foo);
}
Deferred execution causes the sequence itself to only be enumerated (foreach) one time, when it's used (by ToList()). Without deferred execution, it would look more like (conceptually):
// Select
List<Foo> foos = new List<Foo>();
foreach(var item in collection)
{
foos.Add(item.Foo);
}
// Where
List<Foo> foosFiltered = new List<Foo>();
foreach(var foo in foos)
{
if (foo < 3)
foosFiltered.Add(foo);
}
List<Foo> results = new List<Foo>();
foreach(var item in foosFiltered)
{
results.Add(item);
}
Another benefit of deferred execution is that it allows you to work with infinite series. For instance:
public static IEnumerable<ulong> FibonacciNumbers()
{
yield return 0;
yield return 1;
ulong previous = 0, current = 1;
while (true)
{
ulong next = checked(previous + current);
yield return next;
previous = current;
current = next;
}
}
(Source: http://chrisfulstow.com/fibonacci-numbers-iterator-with-csharp-yield-statements/)
You can then do the following:
var firstTenOddFibNumbers = FibonacciNumbers().Where(n=>n%2 == 1).Take(10);
foreach (var num in firstTenOddFibNumbers)
{
Console.WriteLine(num);
}
Prints:
1
1
3
5
13
21
55
89
233
377
Without deferred execution, you would get an OverflowException or if the operation wasn't checked it would run infinitely because it wraps around (and if you called ToList on it would cause an OutOfMemoryException eventually)
An important benefit of deferred execution is that you receive up-to-date data. This may be a hit on performance (especially if you are dealing with absurdly large data sets) but equally the data might have changed by the time your original query returns a result. Deferred execution makes sure you will get the latest information from the database in scenarios where the database is updated rapidly.
Related
I'm having trouble understanding why multiple calls of Contains return different values for the same parameter on the same enumerable.
While I understand that the collection can be modified, thus changing the result in a subsequent call, this can be ruled out here.
Consider the following (stripped-down) code in an MVC view.
The purpose of this will be to display a list of checkboxes (as there's no HTML-helper for that), and determining through the model's properties which ones should be checked when opening the view.
#foreach (var d in Model.AllDomains) {
bool isChecked = Model.Project.Domains.Contains(d.ID);
<input #(isChecked ? "checked=\"checked\" " : "")type="checkbox" value="#d.ID" />
// more stuff here
}
Changing this to use an actual List makes the whole thing work as expected:
var tmp = Model.Project.Domains.ToList();
#foreach (var d in Model.AllDomains) {
bool isChecked = tmp.Contains(d.ID);
<input #(isChecked ? "checked=\"checked\" " : "")type="checkbox" value="#d.ID" />
// more stuff here
}
The following is the model that is bound to my view (again simplified to make it more readable):
public ProjectVM GetByID(int id) {
return new ProjectVM {
Project = new Project {
... // Other properties here
Domains = from d in MyObjectModel.Projects[id].Domains
select d.ID
},
AllDomains = from d in MyObjectModel.Domains
orderby d.Name
select new {
ID = d.ID,
Name = d.Name
}
};
}
Now, while from debugging I know that Model.Project.Domains will contain the correct number of entries, as well as the correct values, calling .Contains() on the method returns an arbitrary result - either true or false.
In fact, if I put the line with the Contains() call into the debugger's "Watch" tab multiple times, even with an hard coded argument (e.g. 4) the result will alternate from true to false with every call.
What is happening here, what am I overlooking?
Because of the way that Model.Project.Domains is instantiated, its actual type is a WhereSelectEnumerableIterator<T>, but this implements IEnumerable<T> so that shouldn't be an issue...
It seems that the root cause of the problem was a sloppy/unusual implementation of the Enumerator in the foundation classes of our object model which made GetEnumerator() return an iterator which was already used in the previous call
Since Contains() stops iterating over the collection after the first match is found, it would return false on such an Enumerator if the seeked value was in the part which had already been searched in the previous iteration.
A negative result of Contains() caused the enumerator to reset internally, which explained the "toggling" result described in my original post.
I am querying 200k records and using up all the server's memory (no surprise). I am new to LINQ so I found the following code that should help me but I don't know how to use it:
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection, int batchSize)
{
List<T> nextbatch = new List<T>(batchSize);
foreach (T item in collection)
{
nextbatch.Add(item);
if (nextbatch.Count == batchSize)
{
yield return nextbatch;
nextbatch = new List<T>(batchSize);
}
}
if (nextbatch.Count > 0)
yield return nextbatch;
}
Source: http://goo.gl/aQZIj
Here is my code which creates the "out of memory" error. How do I incorporate the new Batch function into my code?
var crmMetrics = _crmDbContext.tpm_metricsSet.Where(a => a.ModifiedOn >= lastRunDate);
foreach (var crmMetric in crmMetrics)
{
metric = new Metric();
metric.ProductKey = crmMetric.tpm_Product.Id;
dbContext.Metrics.Add(metric);
dbContext.SaveChanges();
}
It's an extension method, so if it is part of a static class and there is a reference to the class's namespace in your code you could do:
var crmMetricsBatches = _crmDbContext.tpm_metricsSet
.Where(a => a.ModifiedOn >= lastRunDate)
.AsEnumerable() // !!
.Batch(20);
Except it wouldn't help. By the .AsEnumerable(), you still fetch all data in memory but now in chunks of 20. This is because you can't use the method directly against IQueryable: Entity Framework will try to translate it to SQL but of course has no clue how to do that.
As said by TGH, Skip and Take are more made for this:
var crmMetricsPage = _crmDbContext.tpm_metricsSet
.Where(a => a.ModifiedOn >= lastRunDate)
.OrderBy(a => a.??) // some property you choose
.Skip(pageNo * pageSize)
.Take(pageSize);
where pageNo counts from 0 to the number of pages (- 1) you're going to need. Skip and Take are expressions, and EF knows how to convert these to SQL. The OrderBy is required for EF to know where to start skipping.
In this process, called paging, you always get pageSize records at a time. The number of queries is greater, but resources are spared. One condition is that you can determine a pageSize in advance. I don't know if this fits with your logic.
If you can't use paging you should try to narrow the filter (Where(a => a.ModifiedOn >= lastRunDate), e.g. try to get the data in batches of one day or week.
I would use Linq's Skip and Take to get the batches
Check this out:
http://www.c-sharpcorner.com/UploadFile/3d39b4/take-and-skip-operator-in-linq-to-sql/
In this code:
static bool Spin(int WaitTime)
{
Console.WriteLine("Running task {0} : thread {1}]",
Task.CurrentId, Thread.CurrentThread.ManagedThreadId);
Thread.Sleep(WaitTime);
return true;
}
public void DemoPLINQLong()
{
var SomeBigNumber = 1000000;
var sequence = Enumerable.Range(0, SomeBigNumber);
var sw = new Stopwatch();
sw.Start();
sequence.Where(i => Spin(SomeBigNumber));
sw.Stop();
var synchTime = sw.Elapsed;
sw.Restart();
sequence.Where(i => Spin(SomeBigNumber));
sw.Stop();
var asynchTime = sw.Elapsed;
Console.WriteLine("Synchronous: {0} Asynchronous: {1}",
synchTime.ToString(), asynchTime.ToString());
}
The results are consistent:
Synchronous: 00:00:00.0021800 Asynchronous: 00:00:00.0000076
Why is the second LINQ query hundreds of times faster? Is there some kind of caching going on? How?
DotNet caches and creates performance optimizations the first time anything is executed; this is known as a Just In Time environment (JIT). Upon subsequent calls to the same code, the run time environment can re-use the existing optimizations which is why you'll frequently see the first run of nearly anything being much slower than subsequent runs of the same code.
A couple of side notes about the posted code:
Not sure what the "Synchronous" and "Asynchronous" terms are referring to; both examples are the exact same thing and there is nothing Asynchronous about them.
If you're not aware, none of the LINQ is being evaluated in the example due to the nature of LINQ's deferred execution. You can see this behavior if you change the example from: sequence.Where(i => Spin(SomeBigNumber)) to sequence.Where(i => Spin(SomeBigNumber)).ToList(). Where, ToList() will force the evaluation of the LINQ predicate and the Console.WriteLine will be written to the console in the Spin method.
Is there a (logical/performance) difference to writing:
ATable.Where(x=> condition1 && condition2 && condition3)
or
ATable.Where(x=>condition1).Where(x=>condition2).Where(x=>condition3)
I've been using the former but realised that with the latter, I can read and copy parts of a query out to use somewhere else easier.
Any thoughts?
Short answer
You should do what you feel is more readable and maintainable in your application as both will evaluate to the same collection.
Long answer quite long
Linq To Objects
ATable.Where(x=> condition1 && condition2 && condition3)
For this example Since there is only one predicate statement the compiler will only needs to generate one delegate and one compiler generated method.
From reflector
if (CS$<>9__CachedAnonymousMethodDelegate4 == null)
{
CS$<>9__CachedAnonymousMethodDelegate4 = new Func<ATable, bool>(null, (IntPtr) <Main>b__0);
}
Enumerable.Where<ATable>(tables, CS$<>9__CachedAnonymousMethodDelegate4).ToList<ATable>();
The compiler generated method:
[CompilerGenerated]
private static bool <Main>b__0(ATable m)
{
return ((m.Prop1 && m.Prop2) && m.Prop3);
}
As you can see there is only one call into Enumerable.Where<T> with the delegate as expected since there was only one Where extension method.
ATable.Where(x=>condition1).Where(x=>condition2).Where(x=>condition3) now for this example a lot more code is generated.
if (CS$<>9__CachedAnonymousMethodDelegate5 == null)
{
CS$<>9__CachedAnonymousMethodDelegate5 = new Func<ATable, bool>(null, (IntPtr) <Main>b__1);
}
if (CS$<>9__CachedAnonymousMethodDelegate6 == null)
{
CS$<>9__CachedAnonymousMethodDelegate6 = new Func<ATable, bool>(null, (IntPtr) <Main>b__2);
}
if (CS$<>9__CachedAnonymousMethodDelegate7 == null)
{
CS$<>9__CachedAnonymousMethodDelegate7 = new Func<ATable, bool>(null, (IntPtr) <Main>b__3);
}
Enumerable.Where<ATable>(Enumerable.Where<ATable>(Enumerable.Where<ATable>(tables, CS$<>9__CachedAnonymousMethodDelegate5), CS$<>9__CachedAnonymousMethodDelegate6), CS$<>9__CachedAnonymousMethodDelegate7).ToList<ATable>();
Since we have three chained Extension methods we also get three Func<T>s and also three compiler generated methods.
[CompilerGenerated]
private static bool <Main>b__1(ATable m)
{
return m.Prop1;
}
[CompilerGenerated]
private static bool <Main>b__2(ATable m)
{
return m.Prop2;
}
[CompilerGenerated]
private static bool <Main>b__3(ATable m)
{
return m.Prop3;
}
Now this looks like this should be slower since heck there is a ton more code. However since all execution is deferred until GetEnumerator() is called I doubt any noticeable difference will present itself.
Some Gotchas that could effect performance
Any call to GetEnumerator in the chain will cause a the collection to be iterated. ATable.Where().ToList().Where().ToList() will result in an iteration of the collection with the first predicate when the ToList is called and then another iteration with the second ToList. Try to keep the GetEnumerator called to the very last moment to reduce the number of times the collection is iterated.
Linq To Entities
Since we are using IQueryable<T> now our compiler generated code is a bit different as we are using Expresssion<Func<T, bool>> instead of our normal Func<T, bool>
Example in all in one.
var allInOneWhere = entityFrameworkEntities.MovieSets.Where(m => m.Name == "The Matrix" && m.Id == 10 && m.GenreType_Value == 3);
This generates one heck of a statement.
IQueryable<MovieSet> allInOneWhere = Queryable.Where<MovieSet>(entityFrameworkEntities.MovieSets, Expression.Lambda<Func<MovieSet, bool>>(Expression.AndAlso(Expression.AndAlso(Expression.Equal(Expression.Property(CS$0$0000 = Expression.Parameter(typeof(MovieSet), "m"), (MethodInfo) methodof(MovieSet.get_Name)), ..tons more stuff...ParameterExpression[] { CS$0$0000 }));
The most notable is that we end up with one Expression tree that is parsed down to Expression.AndAlso pieces. And also like expected we only have one call to Queryable.Where
var chainedWhere = entityFrameworkEntities.MovieSets.Where(m => m.Name == "The Matrix").Where(m => m.Id == 10).Where(m => m.GenreType_Value == 3);
I wont even bother pasting in the compiler code for this, way to long. But in short we end up with Three calls to Queryable.Where(Queryable.Where(Queryable.Where())) and three expressions. This again is expected as we have three chained Where clauses.
Generated Sql
Like IEnumerable<T> IQueryable<T> also does not execute until the enumerator is called. Because of this we can be happy to know that both produce the same exact sql statement:
SELECT
[Extent1].[AtStore_Id] AS [AtStore_Id],
[Extent1].[GenreType_Value] AS [GenreType_Value],
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name]
FROM [dbo].[MovieSet] AS [Extent1]
WHERE (N'The Matrix' = [Extent1].[Name]) AND (10 = [Extent1].[Id]) AND (3 = [Extent1].[GenreType_Value])
Some Gotchas that could effect performance
Any call to GetEnumerator in the chain will cause a call out to sql, e.g. ATable.Where().ToList().Where() will actually query sql for all records matching the first predicate and then filter the list with linq to objects with the second predicate.
Since you mention extracting the predicates to use else where, make sure they are in the form of Expression<Func<T, bool>> and not simply Func<T, bool>. The first can be parsed to an expression tree and converted into valid sql, the second will trigger ALL OBJECTS returned and the Func<T, bool> will execute on that collection.
I hope this was a bit helpful to answer your question.
I ran into a scenario where LINQ to SQL acts very strangely. I would like to know if I'm doing something wrong. But I think there is a real possibility that it's a bug.
The code pasted below isn't my real code. It is a simplified version I created for this post, using the Northwind database.
A little background: I have a method that takes an IQueryable of Product and a "filter object" (which I will describe in a minute). It should run some "Where" extension methods on the IQueryable, based on the "filter object", and then return the IQueryable.
The so-called "filter object" is a System.Collections.Generic.List of an anonymous type of this structure: { column = fieldEnum, id = int }
The fieldEnum is an enum of the different columns of the Products table that I would possibly like to use for the filtering.
Instead of explaining further how my code works, it's easier if you just take a look at it. It's simple to follow.
enum filterType { supplier = 1, category }
public IQueryable<Product> getIQueryableProducts()
{
NorthwindDataClassesDataContext db = new NorthwindDataClassesDataContext();
IQueryable<Product> query = db.Products.AsQueryable();
//this section is just for the example. It creates a Generic List of an Anonymous Type
//with two objects. In real life I get the same kind of collection, but it isn't hard coded like here
var filter1 = new { column = filterType.supplier, id = 7 };
var filter2 = new { column = filterType.category, id = 3 };
var filterList = (new[] { filter1 }).ToList();
filterList.Add(filter2);
foreach(var oFilter in filterList)
{
switch (oFilter.column)
{
case filterType.supplier:
query = query.Where(p => p.SupplierID == oFilter.id);
break;
case filterType.category:
query = query.Where(p => p.CategoryID == oFilter.id);
break;
default:
break;
}
}
return query;
}
So here is an example. Let's say the List contains two items of this anonymous type, { column = fieldEnum.Supplier, id = 7 } and { column = fieldEnum.Category, id = 3}.
After running the code above, the underlying SQL query of the IQueryable object should contain:
WHERE SupplierID = 7 AND CategoryID = 3
But in reality, after the code runs the SQL that gets executed is
WHERE SupplierID = 3 AND CategoryID = 3
I tried defining query as a property and setting a breakpoint on the setter, thinking I could catch what's changing it when it shouldn't be. But everything was supposedly fine. So instead I just checked the underlying SQL after every command. I realized that the first Where runs fine, and query stays fine (meaning SupplierID = 7) until right after the foreach loop runs the second time. Right after oFilter becomes the second anonymous type item, and not the first, the 'query' SQL changes to Supplier = 3. So what must be happening here under-the-hood is that instead of just remembering that Supplier should equal 7, LINQ to SQL remembers that Supplier should equal oFilter.id. But oFilter is a name of a single item of a foreach loop, and it means something different after it iterates.
I have only glanced at your question, but I am 90% sure that you should read the first section of On lambdas, capture, and mutability (which includes links to 5 similar SO questions) and all will become clear.
The basic gist of it is that the variable oFilter in your example has been captured in the closure by reference and not by value. That means that once the loop finishes iterating, the variable's reference is to the last one, so the value as evaluated at lambda execution time is the final one as well.
The cure is to insert a new variable inside the foreach loop whose scope is only that iteration rather than the whole loop:
foreach(var oFilter in filterList)
{
var filter = oFilter; // add this
switch (oFilter.column) // this doesn't have to change, but can for consistency
{
case filterType.supplier:
query = query.Where(p => p.SupplierID == filter.id); // use `filter` here
break;
Now each closure is over a different filter variable that is declared anew inside of each loop, and your code will run as expected.
Working as designed. The issue you are confronting is the clash between lexical closure and mutable variables.
What you probably want to do is
foreach(var oFilter in filterList)
{
var o = oFilter;
switch (o.column)
{
case filterType.supplier:
query = query.Where(p => p.SupplierID == o.id);
break;
case filterType.category:
query = query.Where(p => p.CategoryID == o.id);
break;
default:
break;
}
}
When compiled to IL, the variable oFilter is declared once and used multiply. What you need is a variable declared separately for each use of that variable within a closure, which is what o is now there for.
While you're at it, get rid of that bastardized Hungarian notation :P.
I think this is the clearest explanation I've ever seen: http://blogs.msdn.com/ericlippert/archive/2009/11/12/closing-over-the-loop-variable-considered-harmful.aspx:
Basically, the problem arises because we specify that the foreach loop is a syntactic sugar for
{
IEnumerator<int> e = ((IEnumerable<int>)values).GetEnumerator();
try
{
int m; // OUTSIDE THE ACTUAL LOOP
while(e.MoveNext())
{
m = (int)(int)e.Current;
funcs.Add(()=>m);
}
}
finally
{
if (e != null) ((IDisposable)e).Dispose();
}
}
If we specified that the expansion was
try
{
while(e.MoveNext())
{
int m; // INSIDE
m = (int)(int)e.Current;
funcs.Add(()=>m);
}
then the code would behave as expected.
The problem is that you're not appending to the query, you're replacing it each time through the foreach statement.
You want something like the PredicateBuilder - http://www.albahari.com/nutshell/predicatebuilder.aspx