LINQ to Objects Optimization Techniques? - performance

What LINQ to Objects optimization techniques do you use or have you seen in the wild?
While waiting for "yield foreach" and other language/compiler optimizations to arrive in C# in 201x, I'm interesting in doing everything possible to make using LINQ everywhere less of a performance pain.
One pattern I've seen so far is creating custom IEnumerable implementations for specific combinators such that the enumerable is not being re-enumerated several times.

One that I've spotted a few times - don't use:
if (query.Count() > 0)
... use this instead:
if (query.Any())
That way it only needs to find the first match.
EDIT: You may also be interested in a blog post I recently wrote about optimisations which could be in LINQ to Objects but aren't (or weren't in .NET 3.5).
Additionally, if you're going to do a lot of x.Contains(y) operations and x is the result of an existing query (i.e. it's not already going to be some optimised collection), you should probably consider building a HashSet<T> from x to avoid a linear scan (performing the query to produce x's results) on each iteration.

Related

When to use references versus types versus boxes and slices versus vectors as arguments and return types?

I've been working with Rust the past few days to build a new library (related to abstract algebra) and I'm struggling with some of the best practices of the language. For example, I implemented a longest common subsequence function taking &[&T] for the sequences. I figured this was Rust convention, as it avoided copying the data (T, which may not be easily copy-able, or may be big). When changing my algorithm to work with simpler &[T]'s, which I needed elsewhere in my code, I was forced to put the Copy type constraint in, since it needed to copy the T's and not just copy a reference.
So my higher-level question is: what are the best-practices for passing data between threads and structures in long-running processes, such as a server that responds to queries requiring big data crunching? Any specificity at all would be extremely helpful as I've found very little. Do you generally want to pass parameters by reference? Do you generally want to avoid returning references as I read in the Rust book? Is it better to work with &[&T] or &[T] or Vec<T> or Vec<&T>, and why? Is it better to return a Box<T> or a T? I realize the word "better" here is considerably ill-defined, but hope you'll understand my meaning -- what pitfalls should I consider when defining functions and structures to avoid realizing my stupidity later and having to refactor everything?
Perhaps another way to put it is, what "algorithm" should my brain follow to determine where I should use references vs. boxes vs. plain types, as well as slices vs. arrays vs. vectors? I hesitate to start using references and Box<T> returns everywhere, as I think that'd get me a sort of "Java in Rust" effect, and that's not what I'm going for!

Sequence vs LazyList

I can't wrap my head around the differences between sequence and LazyList. They're both lazy and potentially infinite. While seq<'T> is IEnumerable<'T> from .NET framework, LazyList is included in F# PowerPack. In practice, I encounter sequences much more often than LazyLists.
What are their differences in terms of performance, usage, readability, etc? What are reasons for such a bad reputation of LazyList compared to that of seq?
LazyList computes each element only once regardless of how many times the list is traversed. In this way, it's closer to a sequence returned from Seq.cache (rather than a typical sequence). But, other than caching, LazyList behaves exactly like a list: it uses a list structure under the hood and supports pattern matching. So you might say: use LazyList instead of seq when you need list semantics and caching (in addition to laziness).
Regarding both being infinite, seq's memory usage is constant while LazyList's is linear.
These docs may be worth a read.
In addition to Daniel's answer, I think the main practical difference is how you process the LazyList or seq structures (or computations).
If you want to process LazyList, you would typically write a recursive function using pattern matching (quite similar to processing of normal F# lists)
If you want to process seq, you can either use built-in functions or you have to write imperative code that calls GetEnumerator and then uses the returned enumerator in a loop (which may be written as a recursive function, but it will mutate the enumerator). You cannot use the usual head/tail style (using Seq.tail and Seq.head), because that is extremely inefficient - because seq does not keep the evaluated elements and the result of Seq.head needs to re-iterate from the start.
Regarding the reputation of seq and LazyList, I think that F# library design takes a pragmatic approach - since seq is actually .NET IEnumerable, it is quite convenient for .NET programming (and it is also nice because you can treat other collections as seq). Lazy lists are not as frequent and so normal F# list and seq are sufficient in most of the scenarios.

Are there any cases where LINQ's .Where() will be faster than O(N)?

Think the title describes my thoughts pretty well :)
I've seen a lot of people lately that swear to LINQ, and while I also believe it's awesome, I also think you shouldn't be confused about the fact that on most (all?) IEnumerable types, it's performance is not that great. Am I wrong in thinking this? Especially queries where you nest Where()'s on large datasets?
Sorry if this question is a bit vague, I just want to confirm my thoughts in that you should be "cautious" when using LINQ.
[EDIT] Just to clarify - I'm thinking in terms of Linq to Objects here :)
It depends on the provider. For Linq to Objects, it's going to be O(n), but for Linq to SQL or Entities it might ultimately use indices to beat that. For Objects, if you need the functionality of Where, you're probably going to need O(n) anyway. Linq will almost certainly have a bigger constant, largely due to the function calls.
It depends on how you are using it and to what you compare.
I've seen many implementations using foreaches which would have been much faster with linq. Eg. because they forget to break or because they return too many items. The trick is that the lambda expressions are executed when the item is actually used. When you have First at the end, it could end up it just one single call.
So when you chain Wheres, if an item does not pass the first condition, it will also not be tested for the second condition. it's similar to the && operator, which does not evaluate the condition on the right side if the first is not met.
You could say it's always O(N). But N is not the number of items in the source, but the minimal number of items required to create the target set. That's a pretty good optimization IMHO.
Here's a project that promises to introduce indexing for LINQ2Objects. This should deliver better asymptotic behavior: http://i4o.codeplex.com/

Most powerful and unexpected benefit of Linq in .NET OOP/D?

Since learning about Linq and gaining experience in it, I find myself leaning on it more and more. It’s changing how I think about my classes. Some of these changes were expected (ex. Using collections more) but often unexpected (ex. Getting initial data for a class as an XElement and sometimes just keeping it there, processing it lazily.)
What is the most powerful and unexpected benefit of Linq to .NET OOP/D? I am thinking of Linq-to-objects and Linq-to-xml in particular, but include Linq-to-Entities/SQL too in so far as it has changed your class strategy.
I've noticed a couple of significant benefits from using LINQ:
Maintainability - it's much easier to understand what code does when you read a semantic transformation using LINQ, rather than some confusing looping constructs hand-written by a developer.
Performance - Because of LINQ's deferred and streaming execution, you often end up with code that performs better - either by distributing the workload, or allowing unnecessary transformations to be avoided (particularly when only a subset of results are consumed). In the future, as multicore processing becomes more significant, I expect that many LINQ methods may evolve to support native parallel processing (think multi-core sort) - which should help keep .NET applications scalable in the multi-code future.
There are a couple of other nice benefits:
Awareness of Iterator Generators: Once developers learn about LINQ, some of them go on to learn about how it works. This helps to generate awareness of the yield return syntax in C# - which is a powerful way of writing concise and correct sequence iterators.
Focus on business problems: LINQ frees developers to focus on solving the underlying business problems, rather than trying to optimize loops and algorithms to run in the fewest cycles, or use the least number of lines of code. This goes beyond just the productivity of having a library of powerful sequence transformation routines.
I feel the code is easier to maintain and easier to Test compared to have a solution in SQL stored procedures.
Combining LINQ with extensions I get something like (should maybe use some kind of Fluent Interface.....)
return source.Growth().ShareOfChangeDate();
where Growth and ShareOfChageDate are extensions that I easily can do unit tests on
and as LBushkin says the line above I can present for the customer when we discuss
Issues
I feel i get less controll on the SQL generated and it is a littlebit magic to find performance problems.....

Is LINQ to Everything a good abstraction?

There is a proliferation of new LINQ providers. It is really quite astonishing and an elegant combination of lambda expressions, anonymous types and generics with some syntax sugar on top to make it easy reading. Everything is LINQed now from SQL to web services like Amazon to streaming sensor data to parallel processing. It seems like someone is creating an IQueryable<T> for everything but these data sources can have radically different performance, latency, availability and reliability characteristics.
It gives me a little pause that LINQ makes those performance details transparent to the developer. Is LINQ a solid general purpose abstraction or a RAD tool or both?
To me, LINQ is just a way to make code more readable, and hence more maintainable. LINQ does nothing more than takes standard methods and integrates them into the language (hence the name - language integrated query).
It's nothing but a syntax element around normal interfaces and methods - there is no "magic" here, and LINQ-to-something really should (IMO) be treated as any other 3rd party API - you need to understand the cost/benefits of using it just like any other technology.
That being said, it's a very nice syntax helper - it does a lot for making code cleaner, simpler, and more maintainable, and I believe that's where it's true strengths lie.
I see this as similar to the model of multiple storage engines in an RDBMS accepting a common(-ish) language of SQL, in it's design ... but with the added benefit of integreation into the application language semantics. Of course it is good!
I have not used it that much, but it looks sensible and clear when performance and layers of abstraction are not in a position to have a negative impact on the development process (and trust that standards and models wont change wildly).
It is just an interface and implementation that may fit your needs, like all interfaces, abstractions, libraries and implementations, does it fit?... it is all the same answers.
I suppose - no.
LINQ is just a convenient syntax, but not a common RAD tool. In the big projects with complex logic I noticed that developers do more errors in LINQ that in the same instructions they could do if they write the same thing in .NET 2.0 manner. The code is produced faster, it is smaller, but it is harder to find bugs. Sometimes it is not obvious from the first look, at what point the queried collection turns from IQueryable into IEnumerable... I would say that LINQ requires more skilled and disciplined developers.
Also SQL-like syntax is OK for a functional programming but it is a sidestep from object oriented thinking. Sometimes when you see 2 very similar LINQ queries, they look like copy-paste code, but not always any refactoring is possible (or it is possible only by sacrificing some performance).
I heard that MS is not going to further develop LINQ to SQL, and will give more priority to Entities. Is the ADO.NET Team Abandoning LINQ to SQL? Isn't this fact a signal for us that LINQ is not a panacea for everybody ?
If you are thinking about to build a connector to "something", you can build it without LINQ and, if you like, provide LINQ as an additional optional wrapper around it, like LINQ to Entities. So your customers will decide, whether to use LINQ or not, depending on their needs, required performance etc.
p.s.
.NET 4.0 will come with dynamics, and I expect that everybody will also start to use them as LINQ... without taking into considerations that code simplicity, quality and performance may suffer.

Resources