Linq Select into New Object Performance - performance

I am new to Linq, using C#. I got a big surprise when I executed the following:
var scores = objects.Select( i => new { object = i,
score1 = i.algorithm1(),
score2 = i.algorithm2(),
score3 = i.algorithm3() } );
double avg2 = scores.Average( i => i.score2); // algorithm1() is called for every object
double cutoff2 = avg2 + scores.Select( i => i.score2).StdDev(); // algorithm1() is called for every object
double avg3 = scores.Average( i => i.score3); // algorithm1() is called for every object
double cutoff3 = avg3 + scores.Select( i => i.score3).StdDev(); // algorithm1() is called for every object
foreach( var s in scores.Where( i => i.score2 > cutoff2 | i.score3 > cutoff3 ).OrderBy( i => i.score1 )) // algorithm1() is called for every object
{
Debug.Log(String.Format ("{0} {1} {2} {3}\n", s.object, s.score1, s.score2/avg2, s.score3/avg3));
}
The attributes in my new objects store the function calls rather than the values. Each time I tried to access an attribute, the original function is called. I assume this is a huge waste of time? How can I avoid this?

Yes, you've discovered that LINQ uses deferred execution. This is a normal part of LINQ, and very handy indeed for building up queries without actually executing anything until you need to - which in turn is great for pipelines of multiple operations over potentially huge data sources which can be streamed.
For more details about how LINQ to Objects works internally, you might want to read my Edulinq blog series - it's basically a reimplementation of the whole of LINQ to Objects, one method at a time. Hopefully by the end of that you'll have a much clearer idea of what to expect.
If you want to materialize the query, you just need to call ToList or ToArray to build an in-memory copy of the results:
var scores = objects.Select( i => new { object = i,
score1 = i.algorithm1(),
score2 = i.algorithm2(),
score3 = i.algorithm3() } ).ToList();

Related

LINQ: Improving performance of "query to find all dictionaries from list of dictionaries where given key has at least one value from list of values"

I tried searching for existing questions, but I could not find anything, so apologize if this is duplicate question.
I have following piece of code. This code runs in a loop for different values of key and listOfValues (listOfDict does not change and built only once, key and listOfValues vary for each iteration). This code currently works, but profiler shows that 50% of the execution time is spent in this LINQ query. Can I improve performance - using different LINQ construct perhaps?
// List of dictionary that allows multiple values against one key.
List<Dictionary<string, List<string>>> listOfDict = BuildListOfDict();
// Following code & LINQ query runs in a loop.
List<string> listOfValues = BuildListOfValues();
string key = GetKey();
// LINQ query to find all dictionaries from listOfDict
// where given key has at least one value from listOfValues.
List<Dictionary<string, List<string>>> result = listOfDict
.Where(dict => dict[key]
.Any(lhs => listOfValues.Any(rhs => lhs == rhs)))
.ToList();
Using HashSet will perform significantly better. You can create a HashSet<string> like so:
IEnumerable<string> strings = ...;
var hashSet = new HashSet<string>(strings);
I assume you can change your methods to return HashSets and make them run like this:
List<Dictionary<string, HashSet<string>>> listOfDict = BuildListOfDict();
HashSet<string> listOfValues = BuildListOfValues();
string key = GetKey();
List<Dictionary<string, HashSet<string>>> result = listOfDict
.Where(dict => listOfValues.Overlaps(dict[key]))
.ToList();
Here HashSet's instance method Overlaps is used. HashSet is optimized for set operations like this. In a test using one dictionary of 200 elements this runs in 3% of the time compared to your method.
UPDATED: Per #GertArnold, switched from Any/Contains to HashSet.Overlaps for slight performance improvement.
Depending on whether listOfValues or the average value for a key is longer, you can either convert listOfValues to a HashSet<string> or build your list of dictionaries to have a HashSet<string> for each value:
// optimize testing against listOfValues
var valHS = listOfValues.ToHashSet();
var result2 = listOfDict.Where(dict => valHS.Overlaps(dict[key]))
.ToList();
// change structure to optimize query
var listOfDict2 = listOfDict.Select(dict => dict.ToDictionary(kvp => kvp.Key, kvp => kvp.Value.ToHashSet())).ToList();
var result3 = listOfDict2.Where(dict => dict[key].Overlaps(listOfValues))
.ToList();
Note: if the query is repeated with differing listOfValues, it probably makes more sense to build the HashSet in the dictionaries once, rather than computing a HashSet from each listOfValues.
#LasseVågsætherKarlsen suggestion in comments to invert the structure intrigued me, so with a further refinement to handle the multiple keys, I created an index structure and tested lookups. With my Test Harness, this is about twice as fast as using a HashSet for one of the List<string>s and four times faster than the original method:
var listOfKeys = listOfDict.First().Select(d => d.Key);
var lookup = listOfKeys.ToDictionary(k => k, k => listOfDict.SelectMany(d => d[k].Select(v => (v, d))).ToLookup(vd => vd.v, vd => vd.d));
Now to filter for a particular key and list of values:
var result4 = listOfValues.SelectMany(v => lookup[key][v]).Distinct().ToList();

LINQ Out of Memory Error

I am querying 200k records and using up all the server's memory (no surprise). I am new to LINQ so I found the following code that should help me but I don't know how to use it:
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection, int batchSize)
{
List<T> nextbatch = new List<T>(batchSize);
foreach (T item in collection)
{
nextbatch.Add(item);
if (nextbatch.Count == batchSize)
{
yield return nextbatch;
nextbatch = new List<T>(batchSize);
}
}
if (nextbatch.Count > 0)
yield return nextbatch;
}
Source: http://goo.gl/aQZIj
Here is my code which creates the "out of memory" error. How do I incorporate the new Batch function into my code?
var crmMetrics = _crmDbContext.tpm_metricsSet.Where(a => a.ModifiedOn >= lastRunDate);
foreach (var crmMetric in crmMetrics)
{
metric = new Metric();
metric.ProductKey = crmMetric.tpm_Product.Id;
dbContext.Metrics.Add(metric);
dbContext.SaveChanges();
}
It's an extension method, so if it is part of a static class and there is a reference to the class's namespace in your code you could do:
var crmMetricsBatches = _crmDbContext.tpm_metricsSet
.Where(a => a.ModifiedOn >= lastRunDate)
.AsEnumerable() // !!
.Batch(20);
Except it wouldn't help. By the .AsEnumerable(), you still fetch all data in memory but now in chunks of 20. This is because you can't use the method directly against IQueryable: Entity Framework will try to translate it to SQL but of course has no clue how to do that.
As said by TGH, Skip and Take are more made for this:
var crmMetricsPage = _crmDbContext.tpm_metricsSet
.Where(a => a.ModifiedOn >= lastRunDate)
.OrderBy(a => a.??) // some property you choose
.Skip(pageNo * pageSize)
.Take(pageSize);
where pageNo counts from 0 to the number of pages (- 1) you're going to need. Skip and Take are expressions, and EF knows how to convert these to SQL. The OrderBy is required for EF to know where to start skipping.
In this process, called paging, you always get pageSize records at a time. The number of queries is greater, but resources are spared. One condition is that you can determine a pageSize in advance. I don't know if this fits with your logic.
If you can't use paging you should try to narrow the filter (Where(a => a.ModifiedOn >= lastRunDate), e.g. try to get the data in batches of one day or week.
I would use Linq's Skip and Take to get the batches
Check this out:
http://www.c-sharpcorner.com/UploadFile/3d39b4/take-and-skip-operator-in-linq-to-sql/

Truncating a collection using Linq query

I want to extract part of a collection to another collection.
I can easily do the same using a for loop, but my linq query is not working for the same.
I am a neophyte in Linq, so please help me correcting the query (if possible with explanation / beginners tutorial link)
Legacy way of doing :
Collection<string> testColl1 = new Collection<string> {"t1", "t2", "t3", "t4"};
Collection<string> testColl2 = new Collection<string>();
for (int i = 0; i < newLength; i++)
{
testColl2.Add(testColl1[i]);
}
Where testColl1 is the source & testColl2 is the desired truncated collection of count = newLength.
I have used the following linq queries, but none of them are working ...
var result = from t in testColl1 where t.Count() <= newLength select t;
var res = testColl1.Where(t => t.Count() <= newLength);
Use Enumerable.Take:
var testColl2 = testColl1.Take(newLength).ToList();
Note that there's a semantic difference between your for loop and the version using Take. The for loop will throw with IndexOutOfRangeException exception if there are less than newLength items in testColl1, whereas the Take version will silently ignore this fact and just return as many items up to newLength items.
The correct way is by using Take:
var result = testColl1.Take(newLength);
An equivalent way using Where is:
var result = testColl1.Where((i, item) => i < newLength);
These expressions will produce an IEnumerable, so you might also want to attach a .ToList() or .ToArray() at the end.
Both ways return one less item than your original implementation does because it is more natural (e.g. if newLength == 0 no items should be returned).
You could convert to for loop to something like this:
testColl1.Take(newLength)
Use Take:
var result = testColl1.Take(newLength);
This extension method returns the first N elements from the collection where N is the parameter you pass, in this case newLength.

Getting count with NHibernate + Linq + Future

I want to do paging with NHibernate when writing a Linq query. It's easy to do something like this:
return session.Query<Payment>()
.OrderByDescending(payment => payment.Created)
.Skip((page - 1)*pageSize)
.Take(pageSize)
.ToArray();
But with this I don't get any info about the total number of items. And if I just do a simple .Count(), that will generate a new call to the database.
I found this answer which solved it by using future. But it uses Criteria. How can I do this with Linq?
The difficulty with using Futures with LINQ is that operations like Count execute immediately.
As #vandalo found out, Count() after ToFuture() actually runs the Count in memory, which is bad.
The only way to get the count in a future LINQ query is to use GroupBy in an invariant field. A good choice would be something that is already part of your filters (like an "IsActive" property)
Here's an example assuming you have such a property in Payment:
//Create base query. Filters should be specified here.
var query = session.Query<Payment>().Where(x => x.IsActive == 1);
//Create a sorted, paged, future query,
//that will execute together with other statements
var futureResults = query.OrderByDescending(payment => payment.Created)
.Skip((page - 1) * pageSize)
.Take(pageSize)
.ToFuture();
//Create a Count future query based on the original one.
//The paged query will be sent to the server in the same roundtrip.
var futureCount = query.GroupBy(x => x.IsActive)
.Select(x => x.Count())
.ToFutureValue();
//Get the results.
var results = futureResults.ToArray();
var count = futureCount.Value;
Of course, the alternative is doing two roundtrips, which is not that bad anyway. You can still reuse the original IQueryable, which is useful when you want to do paging in a higher-level layer:
//Create base query. Filters should be specified here.
var query = session.Query<Payment>();
//Create a sorted, paged query,
var pagedQuery = query.OrderByDescending(payment => payment.Created)
.Skip((page - 1) * pageSize)
.Take(pageSize);
//Get the count from the original query
var count = query.Count();
//Get the results.
var results = pagedQuery.ToArray();
Update (2011-02-22): I wrote a blog post about this issue and a much better solution.
The following blog post has an implementation of ToFutureValue that works with LINQ.
http://sessionfactory.blogspot.com.br/2011/02/getting-row-count-with-future-linq.html
It has a small error on the following line that must be changed from this.
var provider = (NhQueryProvider)source.Provider;
To this:
var provider = (INhQueryProvider)source.Provider;
After apply the change you can use que queries in this way:
var query = session.Query<Foo>();
var futureCount = query.ToFutureValue(x => x.Count());
var page = query.Skip(pageIndex * pageSize).Take(pageSize).ToFuture();
var query = Session.QueryOver<Payment>()
.OrderByDescending(payment => payment.Created)
.Skip((page -1 ) * pageSize)
.Take(pageSize)
This is something I just discovered that the Linq to NH handles just fine, the ToRowCountQuery removes take/skip from the query and does a future row count.
var rowCount = query.ToRowCountQuery().FutureValue<int>();
var result = query.Future();
var asArray = result.ToArray();
var count = rowCount.Value();
Ok, it seems it should be working in your case, but I not tested:
return session.QueryOver<Payment>()
.Skip((page - 1) * pageSize)
.Take(pageSize)
.SelectList(r => r.SelectCount(f => f.Id))
.List<object[]>().First();
Test first before upvoting ;)
UPD: sorry, as I understand you now, you need to get Count of all items. Then you need to run the query without paging:
return session.QueryOver<Payment>()
.SelectList(r => r.SelectCount(f => f.Id))
.List<object[]>().First();

LINQ to SQL bug (or very strange feature) when using IQueryable, foreach, and multiple Where

I ran into a scenario where LINQ to SQL acts very strangely. I would like to know if I'm doing something wrong. But I think there is a real possibility that it's a bug.
The code pasted below isn't my real code. It is a simplified version I created for this post, using the Northwind database.
A little background: I have a method that takes an IQueryable of Product and a "filter object" (which I will describe in a minute). It should run some "Where" extension methods on the IQueryable, based on the "filter object", and then return the IQueryable.
The so-called "filter object" is a System.Collections.Generic.List of an anonymous type of this structure: { column = fieldEnum, id = int }
The fieldEnum is an enum of the different columns of the Products table that I would possibly like to use for the filtering.
Instead of explaining further how my code works, it's easier if you just take a look at it. It's simple to follow.
enum filterType { supplier = 1, category }
public IQueryable<Product> getIQueryableProducts()
{
NorthwindDataClassesDataContext db = new NorthwindDataClassesDataContext();
IQueryable<Product> query = db.Products.AsQueryable();
//this section is just for the example. It creates a Generic List of an Anonymous Type
//with two objects. In real life I get the same kind of collection, but it isn't hard coded like here
var filter1 = new { column = filterType.supplier, id = 7 };
var filter2 = new { column = filterType.category, id = 3 };
var filterList = (new[] { filter1 }).ToList();
filterList.Add(filter2);
foreach(var oFilter in filterList)
{
switch (oFilter.column)
{
case filterType.supplier:
query = query.Where(p => p.SupplierID == oFilter.id);
break;
case filterType.category:
query = query.Where(p => p.CategoryID == oFilter.id);
break;
default:
break;
}
}
return query;
}
So here is an example. Let's say the List contains two items of this anonymous type, { column = fieldEnum.Supplier, id = 7 } and { column = fieldEnum.Category, id = 3}.
After running the code above, the underlying SQL query of the IQueryable object should contain:
WHERE SupplierID = 7 AND CategoryID = 3
But in reality, after the code runs the SQL that gets executed is
WHERE SupplierID = 3 AND CategoryID = 3
I tried defining query as a property and setting a breakpoint on the setter, thinking I could catch what's changing it when it shouldn't be. But everything was supposedly fine. So instead I just checked the underlying SQL after every command. I realized that the first Where runs fine, and query stays fine (meaning SupplierID = 7) until right after the foreach loop runs the second time. Right after oFilter becomes the second anonymous type item, and not the first, the 'query' SQL changes to Supplier = 3. So what must be happening here under-the-hood is that instead of just remembering that Supplier should equal 7, LINQ to SQL remembers that Supplier should equal oFilter.id. But oFilter is a name of a single item of a foreach loop, and it means something different after it iterates.
I have only glanced at your question, but I am 90% sure that you should read the first section of On lambdas, capture, and mutability (which includes links to 5 similar SO questions) and all will become clear.
The basic gist of it is that the variable oFilter in your example has been captured in the closure by reference and not by value. That means that once the loop finishes iterating, the variable's reference is to the last one, so the value as evaluated at lambda execution time is the final one as well.
The cure is to insert a new variable inside the foreach loop whose scope is only that iteration rather than the whole loop:
foreach(var oFilter in filterList)
{
var filter = oFilter; // add this
switch (oFilter.column) // this doesn't have to change, but can for consistency
{
case filterType.supplier:
query = query.Where(p => p.SupplierID == filter.id); // use `filter` here
break;
Now each closure is over a different filter variable that is declared anew inside of each loop, and your code will run as expected.
Working as designed. The issue you are confronting is the clash between lexical closure and mutable variables.
What you probably want to do is
foreach(var oFilter in filterList)
{
var o = oFilter;
switch (o.column)
{
case filterType.supplier:
query = query.Where(p => p.SupplierID == o.id);
break;
case filterType.category:
query = query.Where(p => p.CategoryID == o.id);
break;
default:
break;
}
}
When compiled to IL, the variable oFilter is declared once and used multiply. What you need is a variable declared separately for each use of that variable within a closure, which is what o is now there for.
While you're at it, get rid of that bastardized Hungarian notation :P.
I think this is the clearest explanation I've ever seen: http://blogs.msdn.com/ericlippert/archive/2009/11/12/closing-over-the-loop-variable-considered-harmful.aspx:
Basically, the problem arises because we specify that the foreach loop is a syntactic sugar for
{
IEnumerator<int> e = ((IEnumerable<int>)values).GetEnumerator();
try
{
int m; // OUTSIDE THE ACTUAL LOOP
while(e.MoveNext())
{
m = (int)(int)e.Current;
funcs.Add(()=>m);
}
}
finally
{
if (e != null) ((IDisposable)e).Dispose();
}
}
If we specified that the expansion was
try
{
while(e.MoveNext())
{
int m; // INSIDE
m = (int)(int)e.Current;
funcs.Add(()=>m);
}
then the code would behave as expected.
The problem is that you're not appending to the query, you're replacing it each time through the foreach statement.
You want something like the PredicateBuilder - http://www.albahari.com/nutshell/predicatebuilder.aspx

Resources