Optimizing away OrderBy() when using Any() - linq

So I have a fairly standard LINQ-to-Object setup.
var query = expensiveSrc.Where(x=> x.HasFoo)
.OrderBy(y => y.Bar.Count())
.Select(z => z.FrobberName);
// ...
if (!condition && !query.Any())
return; // seems to enumerate and sort entire enumerable
// ...
foreach (var item in query)
// ...
This enumerates everything twice. Which is bad.
var queryFiltered = expensiveSrc.Where(x=> x.HasFoo);
var query = queryFiltered.OrderBy(y => y.Bar.Count())
.Select(z => z.FrobberName);
if (!condition && !queryFiltered.Any())
return;
// ...
foreach (var item in query)
// ...
Works, but is there a better way?
Would there be any non-insane way to "enlighten" Any() to bypass the non-required operations? I think I remember this sort of optimisation going into EduLinq.

Why not just get rid of the redundant:
if (!query.Any())
return;
It really doesn't seem to be serving any purpose - even without it, the body of the foreach won't execute if the query yields no results. So with the Any() check in, you save nothing in the fast path, and enumerate twice in the slow path.
On the other hand, if you must know if there were any results found after the end of the loop, you might as well just use a flag:
bool itemFound = false;
foreach (var item in query)
{
itemFound = true;
... // Rest of the loop body goes here.
}
if(itemFound)
{
// ...
}
Or you could use the enumerator directly if you're really concerned about the redundant flag-setting in the loop body:
using(var erator = query.GetEnumerator())
{
bool itemFound = erator.MoveNext();
if(itemFound)
{
do
{
// Do something with erator.Current;
} while(erator.MoveNext())
}
// Do something with itemFound
}

There is not much information that can be extracted from an enumerable, so maybe it's better to turn the query into an IQueryable? This Any extension method walks down its expression tree skipping all irrelevant operations, then it turns the important branch into a delegate that can be called to obtain an optimized IQueryable. Standard Any method applied to it explicitly to avoid recursion. Not sure about corner cases, and maybe it makes sense to cache compiled queries, but with simple queries like yours it seems to work.
static class QueryableHelper {
public static bool Any<T>(this IQueryable<T> source) {
var e = source.Expression;
while (e is MethodCallExpression) {
var mce = e as MethodCallExpression;
switch (mce.Method.Name) {
case "Select":
case "OrderBy":
case "ThenBy": break;
default: goto dun;
}
e = mce.Arguments.First();
}
dun:
var d = Expression.Lambda<Func<IQueryable<T>>>(e).Compile();
return Queryable.Any(d());
}
}
Queries themselves must be modified like this:
var query = expensiveSrc.AsQueryable()
.Where(x=> x.HasFoo)
.OrderBy(y => y.Bar.Count())
.Select(z => z.FrobberName);

Would there be any non-insane way to "enlighten" Any() to bypass the non-required operations? I think I remember this sort of optimisation going into EduLinq.
Well I'm not going to ignore any question which mentions Edulinq :)
In this case, Edulinq might well be faster than LINQ to Objects, as its OrderBy implementation is as lazy as it can be - it only sorts as much as it needs to in order to retrieve the elements it returns.
However, fundamentally it still has to read the whole sequence in before it returns anything. After all, the last element in the sequence could be the first one which has to be returned.
If you're in control of the whole stack, you could make Any() detect that it's being called on your "known" IOrderedEnumerable implementation, and go straight to the original source. Note that this does create a change in the observed behaviour though - if iterating over the whole sequence throws an exception (or has any other side effect) then that side-effect would be lost by the optimization. You could argue that's okay, of course - what counts as "valid" optimization in LINQ is a decidedly tricky area.
One other possibility which is pretty horrible but which would solve this particular problem would be to make the iterator returned from the IOrderedEnumerable just take the first value of MoveNext() from the source. That's enough for the normal implementation of Any, and at that point we don't need to know what the first element is. We could defer the actual sorting until the first time the Current property is used.
That's a pretty special-case optimization though - and one which I'd be wary to implement. I think Ani's approach is the better one - just use the fact that iterating over query using foreach will never go into the body of the loop if the query results are empty.

Edit (revised): This answer adressess the issue of the query executing twice, which I believe is the key issue. See below why:
Making Any() smarter is something that only the Linq implementers can do, IMO... Or it would be some dirty adventure using reflection.
Using a class as shown below, you can cache the output of the original enumerable, and let it be enumerated twice:
public class CachedEnumerable<T>
{
public CachedEnumerable(IEnumerable<T> enumerable)
{
_source = enumerable.GetEnumerator();
}
public IEnumerable<T> Enumerate()
{
int itemIndex = 0;
while (true)
{
if (itemIndex < _cache.Count)
{
yield return _cache[itemIndex];
itemIndex++;
continue;
}
if (!_source.MoveNext())
yield break;
var current = _source.Current;
_cache.Add(current);
yield return current;
itemIndex++;
}
}
private List<T> _cache = new List<T>();
private IEnumerator<T> _source;
}
This way you keep the lazy aspect of LINQ, keep the code readable and generic. It wil be slower that directly using IEnumerator<>. There are lots of opportunities to extend, and optimize this class, such as a policy for discarding old items, getting rid of the coroutine etc. But that is beyond the point of this question I think.
Oh, and the class is not thread safe as it is now. This wasn't asked, but I can imagine people trying. I think this could be easily added, if the source enumerable has no thread affinity..
Why would this be optimal?
Let's consider two possibilites: the enumeration could containt elements or it does not.
If it contains elements, this approach is optimal as the query is
only run once.
If it contains no elements, you would be tempted
to eliminate the OrderBy and Select part of your queries, as they add
no value. But.. if there are zero items after the Where() clause, there are zero items to sort, which will cost zero time (well, almost). The same goes for the Select() clause.
What if this is not fast enough yet? In that case my strategy would be to bypass Linq. Now, I really love linq, but it's elegance comes at a price. So for every 100 times of using Linq, there typically will be one or two computations that are important to execute really fast, which I write with good old for loops and lists. Part of mastering a technology is recognizing where it is not appropriate. Linq is no exception to that rule.

Try this:
var items = expensiveSrc.Where(x=> x.HasFoo)
.OrderBy(y => y.Bar.Count())
.Select(z => z.FrobberName).ToList();
// ...
if (!condition && items.Count == 0)
return; // Just check the count
// ...
foreach (var item in items)
// ...
The query is executed just once.

but I've lost the streaming/lazy loading that's half the point of linq
Lazy loading (deferred execution), and 2 LINQ queries with disparate results cannot be optimized (reduced) to 1 query execution.

why are you not using a .ToArray()
var query = expensiveSrc.Where(x=> x.HasFoo)
.OrderBy(y => y.Bar.Count())
.Select(z => z.FrobberName).ToArray();
if there are not elements, sorting and selecting should not give much overhead. if you are sorting, then you need anyway a cache where to store the data, so the overhead .ToArray produces should not be so much.
if you decompile the OrderedEnumerable class, you find that there an int[] array containing the references is formed, so you just create by using .ToArray (or .ToList) a new reference array.
BUT
if expensiveSrc comes from a database, other strategies could be better. if the ordering can be done in the database, this would give to you quite lot of overhead because the data is stored twice.

Related

How Should Complex ReQL Queries be Composed?

Are there any best practices or ReQL features that that help with composing complex ReQL queries?
In order to illustrate this, imagine a fruits table. Each document has the following structure.
{
"id": 123,
"name": "name",
"colour": "colour",
"weight": 5
}
If we wanted to retrieve all green fruits, we might use the following query.
r
.db('db')
.table('fruits')
.filter({colour: 'green'})
However, in more complex cases, we might wish to use a variety of complex command combinations. In such cases, bespoke queries could be written for each case, but this could be difficult to maintain and could violate the Don't Repeat Yourself (DRY) principle. Instead, we might wish to write bespoke queries which could chain custom commands, thus allowing complex queries to be composed in a modular fashion. This might take the following form.
r
.db('db')
.table('fruits')
.custom(component)
The component could be a function which accepts the last entity in the command chain as its argument and returns something, as follows.
function component(chain)
{
return chain
.filter({colour: 'green'});
};
This is not so much a feature proposal as an illustration of the problem of complex queries, although such a feature does seem intuitively useful.
Personally, my own efforts in resolving this problem have involved the creation of a compose utility function. It takes an array of functions as its main argument. Each function is called, passed a part of the query chain, and is expected to return an amended version of the query chain. Once the iteration is complete, a composition of the query components is returned. This can be viewed below.
function compose(queries, parameters)
{
if (queries.length > 1)
{
let composition = queries[0](parameters);
for (let index = 1; index < queries.length; index++)
{
let query = queries[index];
composition = query(composition, parameters);
};
return composition;
}
else
{
throw 'Must be two or more queries.';
};
};
function startQuery()
{
return RethinkDB;
};
function filterQuery1(query)
{
return query.filter({name: 'Grape'});
};
function filterQuery2(query)
{
return query.filter({colour: 'Green'});
};
function filterQuery3(query)
{
return query.orderBy(RethinkDB.desc('created'));
};
let composition = compose([startQuery, filterQuery1, filterQuery2, filterQuery3]);
composition.run(connection);
It would be great to know whether something like this exists, whether there are best practises to handle such cases, or whether this is an area where ReQL could benefit from improvements.
In RethinkDB doc, they state it clearly: All ReQL queries are chainable
Queries are constructed by making function calls in the programming
language you already know. You don’t have to concatenate strings or
construct specialized JSON objects to query the database. All ReQL
queries are chainable. You begin with a table and incrementally chain
transformers to the end of the query using the . operator
You do not have to compose another thing which just implicit your code, which gets it more difficult to read and be unnecessary eventually.
The simple way is assign the rethinkdb query and filter into the variables, anytime you need to add more complex logic, add directly to these variables, then run() it when your query is completed
Supposing I have to search a list of products with different filter inputs and getting pagination. The following code is exposed in javascript (This is simple code for illustration only)
let sorterDirection = 'asc';
let sorterColumnName = 'created_date';
var buildFilter = r.row('app_id').eq(appId).and(r.row('status').eq('public'))
// if there is no condition to start up, you could use r.expr(true)
// append every filter into the buildFilter var if they are positive
if (escapedKeyword != "") {
buildFilter = buildFilter.and(r.row('name').default('').downcase().match(escapedKeyword))
}
// you may have different filter to add, do the same to append them into buildFilter.
// start to make query
let query = r.table('yourTableName').filter(buildFilter);
query.orderBy(r[sorterDirection](sorterColumnName))
.slice(pageIndex * pageSize, (pageIndex * pageSize) + pageSize).run();

Using LINQ to change values in collection

I think I'm not undertanding LINQ well. I want to do:
foreach (MyObjetc myObject in myObjectCollection)
{
myObjet.MyProperty = newValue
}
Just change all values for a property in all elements of my collection.
Using LINQ wouldn't be this way?
myObjectCollection.Select(myObject => myObject.MyProperty = newValue)
It doesn't work. The property value is not changed. Why?
Edit:
Sorry, guys. Certainly, foreach is the right way. But,in my case, I must repeat the foreach in many collections, and I didn't want to repeat the loop. So, finally, I have found an 'inermediate' solution, the 'foreach' method, just similar to the 'Select':
myObjectCollection.ForEach(myObject => myObject.MyProperty = newValue)
Anyway may be it's not as clear as the more simple:
foreach (MyObjetc myObject in myObjectCollection) myObjet.MyProperty = newValue;
First off, this is not a good idea. See below for arguments against it.
It doesn't work. The property value is not changed. Why?
It doesn't work because Select() doesn't actually iterate through the collection until you enumerate it's results, and it requires an expression that evaluates to a value.
If you make your expression return a value, and add something that will fully evaluate the query, such as ToList(), to the end, then it will "work", ie:
myObjectCollection.Select(myObject => { myObject.MyProperty = newValue; return myObject;}).ToList();
That being said, ToList() has some disadvantages - mainly, it's doing a lot of extra work (to create a List<T>) that's not needed, which adds a large cost. Avoiding it would require enumerating the collection:
foreach(var obj in myObjectCollection.Select(myObject => { myObject.MyProperty = newValue; return myObject; }))
{ }
Again, I wouldn't recommend this. At this point, the Select option is far uglier, more typing, etc. It's also a violation of the expectations involved with LINQ - LINQ is about querying, which suggests there shouldn't be side effects when using LINQ, and the entire purpose here is to create side effects.
But then, at this point, you're better off (with less typing) doing it the "clear" way:
foreach (var obj in myObjectCollection)
{
obj.MyProperty = newValue;
}
This is shorter, very clear in its intent, and very clean.
Note that you can add a ForEach<T> extension method which performs an action on each object, but I would still recommend avoiding that. Eric Lippert wrote a great blog post about the subject that is worth a read: "foreach" vs "ForEach".
As sircodesalot mentioned, you can't use linq to do something like that. Remember that linq is a querying language, which means all you can do is query it. Doing changes must be done in other logic.
What you could do, if you don't want to do it the first way and if your collection is a list already (but not an IEnumerable) you can use the ForEach extension method in linq to do what you're asking.
One other point I should mention is that the Select method does a projection of some specific information to return to an IEnumerable. So, if you wanted to grab only a specific property from a collection you would use that. That's all it does.

How to combine collection of linq queries into a single sql request

Thanks for checking this out.
My situation is that I have a system where the user can create custom filtered views which I build into a linq query on the request. On the interface they want to see the counts of all the views they have created; pretty straight forward. I'm familiar with combining multiple queries into a single call but in this case I don't know how many queries I have initially.
Does anyone know of a technique where this loop combines the count queries into a single query that I can then execute with a ToList() or FirstOrDefault()?
//TODO Performance this isn't good...
foreach (IMeetingViewDetail view in currentViews)
{
view.RecordCount = GetViewSpecificQuery(view.CustomFilters).Count();
}
Here is an example of multiple queries combined as I'm referring to. This is two queries which I then combine into an anonymous projection resulting in a single request to the sql server.
IQueryable<EventType> eventTypes = _eventTypeService.GetRecords().AreActive<EventType>();
IQueryable<EventPreferredSetup> preferredSetupTypes = _eventPreferredSetupService.GetRecords().AreActive<EventPreferredSetup>();
var options = someBaseQuery.Select(x => new
{
EventTypes = eventTypes.AsEnumerable(),
PreferredSetupTypes = preferredSetupTypes.AsEnumerable()
}).FirstOrDefault();
Well, for performance considerations, I would change the interface from IEnumerable<T> to a collection that has a Count property. Both IList<T> and ICollection<T> have a count property.
This way, the collection object is keeping track of its size and you just need to read it.
If you really wanted to avoid the loop, you could redefine the RecordCount to be a lazy loaded integer that calls GetViewSpecificQuery to get the count once.
private int? _recordCount = null;
public int RecordCount
{
get
{
if (_recordCount == null)
_recordCount = GetViewSpecificQuery(view.CustomFilters).Count;
return _recordCount.Value;
}
}

FirstorDefault() causes lazy loading or eager loading for linq to sql

What is default behavior of FirstOrDefault() when used with Linq to SQL?
For e.g
int value = (from p in context.tableX
select p.Id).FirstOrDefault() // Value will initialized here or
if(value > 0) // query will be executed here????
{
//do something
}
Thanks
What is default behavior of FirstOrDefault() when used with Linq to SQL?
It eagerly computes the result of the query. The easiest way to reckon about this is to realize that the return type is int, not IEnumerable<int> which can be deferred until GetEnumerator is called, but int has no such mechanism.
The phrasing of your question suggests that you're also asking if there is a way to change this behavior. There is, but not directly through FirstOrDefault or any mechanisms within LINQ. But you can defer using Lazy<T>. No compiler handy, so forgive me if this doesn't compile but it should get you very close.
Lazy<int> value = new Lazy<int>(
() => {
var query =
from p in context.tableX
select p.Id;
var result = query.FirstOrDefault();
return result;
}
);
if(value.Value > 0) { // execution will be deferred until here
//
}
All standard Linq operators, which return single, non-enumerable result, are executed immediately at the point where query is declared. So, FirstOrDefault, Count, Sum and other operators which return single value are executed immediately.
Here is nice MSDN article Classification of Standard Query Operators by Manner of Execution
Eager loading!
If you think about it, it just returns a plain int - an int can't possibly represent "a way to go and get an int". (That's what Lazy<int> is for...)
It becomes Eager loading when you use extension methods on the enumerable result.If you don't use those extension methods it will be Lazy loading and you can't actually fetch the values until you enumerate through the linq result

What's better for creating distinct data structures: HashSet or Linq's Distinct()?

I'm wondering whether I can get a consensus on which method is the better approach to creating a distinct set of elements: a C# HashSet or using IEnumerable's .Distinct(), which is a Linq function?
Let's say I'm looping through query results from the DB with DataReader, and my options are to add the objects I construct to a List<SomeObject> or to a HashSet<SomeObject> With the List option, I would wind up having to do something like:
myList = myList.Distinct().ToList<SomeObject>();
With the HashSet, my understanding is that adding elements to it takes care of the non-duplication by itself, assuming you've overrided the GetHashCode() and Equals() methods in SomeObject. I'm concerned mainly with the risks and performance aspects of the options.
Thanks.
Anthony Pegram has said it the best. Use the right tool for the job. I say this because a Distinct or HashSet isn't that big different when it comes to performance. Use a HashSet when the collection should always hold only distinct stuffs. It also tells the programmer that you cant add duplicates to it. Use a normal List<T> and .Distinct() ont it when you will have to add duplicates and remove duplicates later. The intention matters.
In general,
a) a HashSet may not do any good if you're adding new objects from db and you haven't specified a custom Equals of your own. Every object from db can be a new instance for your hashset (if you are just new-ing) and that will lead to duplicates in the collection. In that case use normal List<T>.
b) If you do have an equality comparer defined for hashset, and your collection should always hold only distinct objects, use hashset.
c) If you do have an equality comparer defined for hashset, and you want only distinct objects from db but collection need not always hold only distinct objects (ie duplicates needed to be added later), a faster approach is to get the items from db to a hashset and then return a regular list from that hashset.
d) The best thing you should do is to give the task of removing duplicates to database, thats the right tool And that's first class!
As for performance differences, in my testing I always found HashSet to be faster, but then that's only marginal. That's obvious considering with List approach you have to first add and then do a distinct on it.
Test method: Starting with two general functions,
public static void Benchmark(Action method, int iterations = 10000)
{
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < iterations; i++)
method();
sw.Stop();
MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}
public static List<T> Repeat<T>(this ICollection<T> lst, int count)
{
if (count < 0)
throw new ArgumentOutOfRangeException("count");
var ret = Enumerable.Empty<T>();
for (var i = 0; i < count; i++)
ret = ret.Concat(lst);
return ret.ToList();
}
Implementation:
var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
Benchmark(() =>
{
hash.Clear();
foreach (var item in d)
{
hash.Add(item);
}
});
~3300 ms
var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();
Benchmark(() =>
{
list.Clear();
foreach (var item in d)
{
list.Add(item);
}
list = list.Distinct().ToList();
});
~5800 ms
A difference of 2.5 seconds is not bad for a list of 10000 objects when iterated another 10000 times. For normal cases the difference will be hardly noticeable.
The best approach possibly for you with your current design:
var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();
Benchmark(() =>
{
hash.Clear();
foreach (var item in d)
{
hash.Add(item);
}
list = hash.ToList();
});
~3300 ms
There isn't any significant difference, see..
Partly unrelated - after posting this answer, I was curious to know what's the best approach in removing duplicates, from a normal list.
var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();
Benchmark(() =>
{
hash = new HashSet<int>(d);
});
~3900 ms
var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();
Benchmark(() =>
{
list = d.Distinct().ToList();
});
~3200 ms
Here the right tool Distinct is faster than hackish HashSet! Perhaps its the overhead of creating a hash set.
I have tested with various other combinations like reference types, without duplicates in original list etc. The results are consistent.
What's better is what's the most expressive of describing your intention. The internal implementation details are more or less going to be the same, the difference being "who's writing the code?"
If your intention is to create from the ground up a distinct collection of items from a source that is not a collection of said items, I would argue for the HashSet<T>. You have to create the item, you have to build the collection, you might as well build the right one from the beginning.
Otherwise, if you already have a collection of items and you want to eliminate duplicates, I would argue for invoking Distinct(). You already have a collection, you just want an expressive way to get the distinct items out of it.
"Better" is a tricky word to use - it can mean so many different things to different people.
For readability, I would go for Distinct() as I personally find this more comprehensible.
For performance, I suspect a hand-crafted HashSet implementation might perform mildly quicker - but I doubt it would be very different as the internal implementation of Distinct will no doubt itself use some form of hashing.
For what I think of as "best" implementation... I think you should use Distinct but somehow push this down to the database layer - i.e. change the underlying database SELECT before you fill the DataReader.
For large collections HashSet is likely to be faster. It relies on the hashcode of the objects to quickly determine whether or not an element already exists in the set.
In practice, it (most likely) won't matter (but you should measure if you care).
I instinctively guessed at first that HashSet would be faster, because of the fast hash checking it uses. However, I looked up the current (4.0) implementation of Distinct in the reference sources, and it uses a similar Set class (which also relies on hashing) under the covers. Conclusion; there are no practical performance difference.
For your case, I would go with .Distinct for readability - it clearly conveys the intent of the code. However, I agree with one of the other answers, that you probably should perform this operationn in the DB if possible.
If yor looping through the results of a DbReader adding your resutls to a Hashset would be better than adding it to a List and than doing a Distinct on that. You would save one itteration. (Distinct internally uses a HashSet)
The implementation of Distinct may use HashSet. Take a look at Jon Skeet's Edulinq implementation.

Resources