Deferred execution of List<T> using Linq - linq

Suppose I have a List<T> with 1000 items in it.
I'm then passing this to a method that filters this List.
As it drops through the various cases (for example there could be 50), List<T> may have up to 50 various Linq Where() operations performed on it.
I'm interested in this running as quickly as possible. Therefore, I don't want this List<T> filtered each time a Where() is performed on it.
Essentially I need it to defer the actual manipulation of the List<T> until all the filters have been applied.
Is this done natively by the compiler? Or just when I call .ToList() on the IEnumerable that a List<T>.Where() returns, or should I perform the Where() operations on X (Where X = List.AsQueryable())?
Hope this makes sense.

Yes, deferred execution is supported natively. Every time you apply a query or lambda expression on your List the query stores all the expressions that is executed only when you call .ToList() on the Query.

Each call to Where will create a new object which knows about your filter and the sequence it's being called on.
When this new object is asked for a value (and I'm being deliberately fuzzy between an iterator and an iterable here) it will ask the original sequence for the next value, check the filter, and either return the value or iterate back, asking the original sequence for the next value etc.
So if you call Where 50 times (as in list.Where(...).Where(...).Where(...), you end up with something which needs to go up and down the call stack at least 50 times for each item returned. How much performance impact will that have? I don't know: you should measure it.
One possible alternative is to build an expression tree and then compile it down into a delegate at the end, and then call Where. This would certainly be a bit more effort, but it could end up being more efficient. Effectively, it would let you change this:
list.Where(x => x.SomeValue == 1)
.Where(x => x.SomethingElse != null)
.Where(x => x.FinalCondition)
.ToList()
into
list.Where(x => x.SomeValue == 1 && x.SomethingElse != null && x.FinalCondition)
.ToList()
If you know that you're just going to be combining a lot of "where" filters together, this may end up being more efficient than going via IQueryable<T>. As ever, check performance of the simplest possible solution before doing something more complicated though.

There is so much failure in the question and the comments. The answers are good but don't hit hard enough to break through the failure.
Suppose you have a list and a query.
List<T> source = new List<T>(){ /*10 items*/ };
IEnumerable<T> query = source.Where(filter1);
query = query.Where(filter2);
query = query.Where(filter3);
...
query = query.Where(filter10);
Is [lazy evaluation] done natively by the compiler?
No. Lazy evaluation is due to the implementation of Enumerable.Where
This method is implemented by using deferred execution. The immediate return value is an object that stores all the information that is required to perform the action. The query represented by this method is not executed until the object is enumerated either by calling its GetEnumerator method directly or by using foreach in Visual C# or For Each in Visual Basic.
speed penalty there is on calling List.AsQueryable().ToList()
Don't call AsQueryable, you only need to use Enumerable.Where.
thus won't prevent a 50 calls deep call stack
Depth of call stack is much much less important than having a highly effective filter first. If you can reduce the number of elements early, you reduce the number of method calls later.

Related

Java-8 stream expression to 'OR' several enum values together

I am aggregating a bunch of enum values (different from the ordinal values) in a foreach loop.
int output = 0;
for (TestEnum testEnum: setOfEnums) {
output |= testEnum.getValue();
}
Is there a way to do this in streams API?
If I use a lambda like this in a Stream<TestEnum> :
setOfEnums.stream().forEach(testEnum -> (output |= testEnum.getValue());
I get a compile time error that says, 'variable used in lambda should be effectively final'.
Predicate represents a boolean valued function, you need to use reduce method of stream to aggregate bunch of enum values.
if we consider that you have HashSet as named SetOfEnums :
//int initialValue = 0; //this is effectively final for next stream pipeline if you wont modify this value in that stream
final int initialValue = 0;//final
int output = SetOfEnums.stream()
.map(TestEnum::getValue)
.reduce(initialValue, (e1,e2)-> e1|e2);
You nedd to reduce stream of enums like this:
int output = Arrays.stream(TestEnum.values()).mapToInt(TestEnum::getValue).reduce(0, (acc, value) -> acc | value);
I like the recommendations to use reduction, but perhaps a more complete answer would illustrate why it is a good idea.
In a lambda expression, you can reference variables like output that are in scope where the lambda expression is defined, but you cannot modify the values. The reason for that is that, internally, the compiler must be able to implement your lambda, if it chooses to do so, by creating a new function with your lambda as its body. The compiler may choose to add parameters as needed so that all of the values used in this generated function are available in the parameter list. In your case, such a function would definitely have the lambda's explicit parameter, testEnum, but because you also reference the local variable output in the lambda body, it could add that as a second parameter to the generated function. Effectively, the compiler might generate this function from your lambda:
private void generatedFunction1(TestEnum testEnum, int output) {
output |= testEnum.getValue();
}
As you can see, the output parameter is a copy of the output variable used by the caller, and the OR operation would only be applied to the copy. Since the original output variable wouldn't be modified, the language designers decided to prohibit modification of values passed implicitly to lambdas.
To get around the problem in the most direct way, setting aside for the moment that the use of reduction is a far better approach, you could wrap the output variable in a wrapper (e.g. an int[] array of size 1 or an AtomicInteger. The wrapper's reference would be passed by value to the generated function, and since you would now update the contents of output, not the value of output, output remains effectively final, so the compiler won't complain. For example:
AtomicInteger output = new AtomicInteger();
setOfEnums.stream().forEach(testEnum -> (output.set(output.get() | testEnum.getValue()));
or, since we're using AtomicInteger, we may as well make it thread-safe in case you later choose to use a parallel Stream,
AtomicInteger output = new AtomicInteger();
setOfEnums.stream().forEach(testEnum -> (output.getAndUpdate(prev -> prev | testEnum.getValue())));
Now that we've gone over an answer that most resembles what you asked about, we can talk about the superior solution of using reduction, that other answers have already recommended.
There are two kinds of reduction offered by Stream, stateless reduction (reduce(), and stateful reduction (collect()). To visualize the difference, consider a conveyer belt delivering hamburgers, and your goal is to collect all of the hamburger patties into one big hamburger. With stateful reduction, you would start with a new hamburger bun, and then collect the patty out of each hamburger as it arrives, and you add it to the stack of patties in the hamburger bun you set up to collect them. In stateless reduction, you start out with an empty hamburger bun (called the "identity", since that empty hamburger bun is what you end up with if the conveyer belt is empty), and as each hamburger arrives on the belt, you make a copy of the previous accumulated burger and add the patty from the new one that just arrived, discarding the previous accumulated burger.
The stateless reduction may seem like a huge waste, but there are cases when copying the accumulated value is very cheap. One such case is when accumulating primitive types -- primitive types are very cheap to copy, so stateless reduction is ideal when crunching primitives in applications such as summing, ORing, etc.
So, using stateless reduction, your example might become:
setOfEnums.stream()
.mapToInt(TestEnum::getValue) // or .mapToInt(testEnum -> testEnum.getValue())
.reduce(0, (resultSoFar, testEnum) -> resultSoFar | testEnum);
Some points to ponder:
Your original for loop is probably faster than using streams, except perhaps if your set is very large and you use parallel streams. Don't use streams for the sake of using streams. Use them if they make sense.
In my first example, I showed the use of Stream.forEach(). If you ever find yourself creating a Stream and just calling forEach(), it is more efficient just to call forEach() on the collection directly.
You didn't mention what kind of Set you are using, but I hope you are using EnumSet<TestEnum>. Because it is implemented as a bit field, It performs much better (O(1)) than any other kind of Set for all operations, even copying. EnumSet.noneOf(TestEnum.class) creates an empty Set, EnumSet.allOf(TestEnum.class) gives you a set of all enum values, etc.

Method returns one or more, should it return an Array when there is only one item?

Let's say we have a Ruby method like this:
# Pseudocode
def get(globbed)
a_items = Dir.glob(globbed)
a_items.length == 1 ? a_items.first : a_items
end
The method is meant to return a String containing information about the items in question. If there are many items, it will return an Array. The ternary makes it so that if there is only one item, it just returns that String.
What is the best practice here? Should such a method always return an Array even if there is only one item?
It should always return an array. Returning different things means that whatever method that calls this method would also have to have a condition. That is not good. Whenever you can get rid of a condition, you should. A condition should only be used as a last resort.
As a real example, the jQuery library built on top of JavaScript has the notion of selectors, expressed in the form $(...). This can result in multiple matching dom objects, or a single one. But jQuery always returns an array even if the matched dom object is one. That makes things simple.
It's always about use cases. You have to define what's the responsibility of that method and then decide what makes sense to do.
In this specific case, I would say that, unless there isn't any specific reason to return different types, you should choose the way that is simpler, both to test and to read.
Always returning an array in this case means clearer method interface:
"The method returns an array with the directory content"
instead of the more convoluted
"The method returns an array of directory content if there more than
one object, otherwise return the single object."
So, clarity first of all.
And: testing would result easier. The cyclomatic complexity of the routine is less.
There are cases where the uniformity of return types can't be fulfilled. Just think of the Array method index: it wouldn't be possible to distinguish between "object not found" and "index 0" if the practice here was applied.
Conclusion: here I don't see any reason why to make the method more complex by distinguishing the two cases, so.. KISS.
Hi, ruby provides block, yield and iterator to permit easy array or hash treatment. And it's a good practice to use the same code for one or several numbers of element. Exemple :
a_items.each { |element| file_treatment(element) }
Regards.

C# Performance of LINQ vs. foreach iterator block

1) Do these generate the same byte code?
2) If not, is there any gain in using one over the other in certain circumstances?
// LINQ select statement
return from item in collection
select item.Property;
// foreach in an iterator block
foreach (item in collection)
yield return item.Property;
They don't generate the same code but boil down to the same thing, you get an object implementing IEnumerable<typeof(Property)>. The difference is that linq provides iterator from its library (in this case most likely by using WhereSelectArrayIterator or WhereSelectListIterator) whereas in the second example you yourself generate an iterator block that dissects a collection. An iterator block method is always, by ways of compiler magic, compiled as a separate class implementing IEnumerable<typeof(yield)> which you don't see but instantiate implicitly when you call iterator block method.
Performance wise, #1 should be slightly (but just slightly) faster for indexable collections because when you loop through the resulting IEnumerable you go directly from your foreach into collection retrieval in an optimized linq iterator. In example #2 you go from foreach into your iterator block's foreach and from there into collection retrieval and your performance depends mostly on how smart the compiler is at optimizing yield logic. In any case I would imagine that for any complex collection mechanism the cost of retrieval marginalizes this difference.
IMHO, I would always go with #1, if nothing else it saves me from having to write a separate method just for iterating.
No they don't generate the same byte code. The first one returns a pre-existing class in the framework. The second one returns a compiler generated state machine that returns the items from the collection. That state machine is very similar to the class that exists in the framework already.
I doubt there's much performance difference between the two. Both are doing very similar things in the end.

Ruby equivalent of C#'s 'yield' keyword, or, creating sequences without preallocating memory

In C#, you could do something like this:
public IEnumerable<T> GetItems<T>()
{
for (int i=0; i<10000000; i++) {
yield return i;
}
}
This returns an enumerable sequence of 10 million integers without ever allocating a collection in memory of that length.
Is there a way of doing an equivalent thing in Ruby? The specific example I am trying to deal with is the flattening of a rectangular array into a sequence of values to be enumerated. The return value does not have to be an Array or Set, but rather some kind of sequence that can only be iterated/enumerated in order, not by index. Consequently, the entire sequence need not be allocated in memory concurrently. In .NET, this is IEnumerable and IEnumerable<T>.
Any clarification on the terminology used here in the Ruby world would be helpful, as I am more familiar with .NET terminology.
EDIT
Perhaps my original question wasn't really clear enough -- I think the fact that yield has very different meanings in C# and Ruby is the cause of confusion here.
I don't want a solution that requires my method to use a block. I want a solution that has an actual return value. A return value allows convenient processing of the sequence (filtering, projection, concatenation, zipping, etc).
Here's a simple example of how I might use get_items:
things = obj.get_items.select { |i| !i.thing.nil? }.map { |i| i.thing }
In C#, any method returning IEnumerable that uses a yield return causes the compiler to generate a finite state machine behind the scenes that caters for this behaviour. I suspect something similar could be achieved using Ruby's continuations, but I haven't seen an example and am not quite clear myself on how this would be done.
It does indeed seem possible that I might use Enumerable to achieve this. A simple solution would be to us an Array (which includes module Enumerable), but I do not want to create an intermediate collection with N items in memory when it's possible to just provide them lazily and avoid any memory spike at all.
If this still doesn't make sense, then consider the above code example. get_items returns an enumeration, upon which select is called. What is passed to select is an instance that knows how to provide the next item in the sequence whenever it is needed. Importantly, the whole collection of items hasn't been calculated yet. Only when select needs an item will it ask for it, and the latent code in get_items will kick into action and provide it. This laziness carries along the chain, such that select only draws the next item from the sequence when map asks for it. As such, a long chain of operations can be performed on one data item at a time. In fact, code structured in this way can even process an infinite sequence of values without any kinds of memory errors.
So, this kind of laziness is easily coded in C#, and I don't know how to do it in Ruby.
I hope that's clearer (I'll try to avoid writing questions at 3AM in future.)
It's supported by Enumerator since Ruby 1.9 (and back-ported to 1.8.7). See Generator: Ruby.
Cliche example:
fib = Enumerator.new do |y|
y.yield i = 0
y.yield j = 1
while true
k = i + j
y.yield k
i = j
j = k
end
end
100.times { puts fib.next() }
Your specific example is equivalent to 10000000.times, but let's assume for a moment that the times method didn't exist and you wanted to implement it yourself, it'd look like this:
class Integer
def my_times
return enum_for(:my_times) unless block_given?
i=0
while i<self
yield i
i += 1
end
end
end
10000.my_times # Returns an Enumerable which will let
# you iterate of the numbers from 0 to 10000 (exclusive)
Edit: To clarify my answer a bit:
In the above example my_times can be (and is) used without a block and it will return an Enumerable object, which will let you iterate over the numbers from 0 to n. So it is exactly equivalent to your example in C#.
This works using the enum_for method. The enum_for method takes as its argument the name of a method, which will yield some items. It then returns an instance of class Enumerator (which includes the module Enumerable), which when iterated over will execute the given method and give you the items which were yielded by the method. Note that if you only iterate over the first x items of the enumerable, the method will only execute until x items have been yielded (i.e. only as much as necessary of the method will be executed) and if you iterate over the enumerable twice, the method will be executed twice.
In 1.8.7+ it has become to define methods, which yield items, so that when called without a block, they will return an Enumerator which will let the user iterate over those items lazily. This is done by adding the line return enum_for(:name_of_this_method) unless block_given? to the beginning of the method like I did in my example.
Without having much ruby experience, what C# does in yield return is usually known as lazy evaluation or lazy execution: providing answers only as they are needed. It's not about allocating memory, it's about deferring computation until actually needed, expressed in a way similar to simple linear execution (rather than the underlying iterator-with-state-saving).
A quick google turned up a ruby library in beta. See if it's what you want.
C# ripped the 'yield' keyword right out of Ruby- see Implementing Iterators here for more.
As for your actual problem, you have presumably an array of arrays and you want to create a one-way iteration over the complete length of the list? Perhaps worth looking at array.flatten as a starting point - if the performance is alright then you probably don't need to go too much further.

How does coding with LINQ work? What happens behind the scenes?

For example:
m_lottTorqueTools = (From t In m_lottTorqueTools _
Where Not t.SlotNumber = toolTuple.SlotNumber _
And Not t.StationIndex = toolTuple.StationIndex).ToList
What algorithm occurs here? Is there a nested for loop going on in the background? Does it construct a hash table for these fields? I'm curious.
Query expressions are translated into extension method calls, usually. (They don't have to be, but 99.9% of queries use IEnumerable<T> or IQueryable<T>.)
The exact algorithm of what that method does varies from method to method. Your sample query wouldn't use any hash tables, but joins or grouping operations do, for example.
The simple Where call translates to something like this in C# (using iterator blocks, which aren't available in VB at the moment as far as I'm aware):
public static IEnumerable<T> Where(this IEnumerable<T> source,
Func<T, bool> predicate)
{
// Argument checking omitted
foreach (T element in source)
{
if (predicate(element))
{
yield return element;
}
}
}
The predicate is provided as a delegate (or an expression tree if you're using IQueryable<T>) and is called on each item in the sequence. The results are streamed and execution is deferred - in other words, nothing happens until you start asking for items from the result, and even then it only does as much as it needs to in order to provide the next result. Some operators aren't deferred (basically the ones which return a single value instead of a sequence) and some buffer the input (e.g. Reverse has to read to the end of the sequence before it can return any results, because the last result it reads is the first one it has to yield).
It's beyond the scope of a single answer to give details of every single LINQ operator I'm afraid, but if you have questions about specific ones I'm sure we can oblige.
I should add that if you're using LINQ to SQL or another provider that's based on IQueryable<T>, things are rather different. The Queryable class builds up the query (with the help of the provider, which implements IQueryable<T> to start with) and then the query is generally translated into a more appropriate form (e.g. SQL) by the provider. The exact details (including buffering, streaming etc) will entirely depend on the provider.
LINQ in general has a lot going on behind the scenes. With any query, it is first translated into an expression tree using an IQueryableProvider and from there the query is typically compiled into CIL code and a delegate is generated pointing to this function, which you are essentially using whenever you call the query. That's an extremely simplified overview - if you want to read a great article on the subject, I recommend you look at How LINQ Works - Creating Queries. Jon Skeet also posted a good answer on this site to this question.
What happens not only depends on the methods, but also depends on the LINQ provider in use. In the case where an IQueryable<T> is being returned, it is the LINQ provider that interprets the expression tree and processes it however it likes.

Resources