How does coding with LINQ work? What happens behind the scenes? - linq

For example:
m_lottTorqueTools = (From t In m_lottTorqueTools _
Where Not t.SlotNumber = toolTuple.SlotNumber _
And Not t.StationIndex = toolTuple.StationIndex).ToList
What algorithm occurs here? Is there a nested for loop going on in the background? Does it construct a hash table for these fields? I'm curious.

Query expressions are translated into extension method calls, usually. (They don't have to be, but 99.9% of queries use IEnumerable<T> or IQueryable<T>.)
The exact algorithm of what that method does varies from method to method. Your sample query wouldn't use any hash tables, but joins or grouping operations do, for example.
The simple Where call translates to something like this in C# (using iterator blocks, which aren't available in VB at the moment as far as I'm aware):
public static IEnumerable<T> Where(this IEnumerable<T> source,
Func<T, bool> predicate)
{
// Argument checking omitted
foreach (T element in source)
{
if (predicate(element))
{
yield return element;
}
}
}
The predicate is provided as a delegate (or an expression tree if you're using IQueryable<T>) and is called on each item in the sequence. The results are streamed and execution is deferred - in other words, nothing happens until you start asking for items from the result, and even then it only does as much as it needs to in order to provide the next result. Some operators aren't deferred (basically the ones which return a single value instead of a sequence) and some buffer the input (e.g. Reverse has to read to the end of the sequence before it can return any results, because the last result it reads is the first one it has to yield).
It's beyond the scope of a single answer to give details of every single LINQ operator I'm afraid, but if you have questions about specific ones I'm sure we can oblige.
I should add that if you're using LINQ to SQL or another provider that's based on IQueryable<T>, things are rather different. The Queryable class builds up the query (with the help of the provider, which implements IQueryable<T> to start with) and then the query is generally translated into a more appropriate form (e.g. SQL) by the provider. The exact details (including buffering, streaming etc) will entirely depend on the provider.

LINQ in general has a lot going on behind the scenes. With any query, it is first translated into an expression tree using an IQueryableProvider and from there the query is typically compiled into CIL code and a delegate is generated pointing to this function, which you are essentially using whenever you call the query. That's an extremely simplified overview - if you want to read a great article on the subject, I recommend you look at How LINQ Works - Creating Queries. Jon Skeet also posted a good answer on this site to this question.

What happens not only depends on the methods, but also depends on the LINQ provider in use. In the case where an IQueryable<T> is being returned, it is the LINQ provider that interprets the expression tree and processes it however it likes.

Related

Compiler/Interpreter design: should built in methods have their own Node or should a lookup table be used?

I'm designing a interpreter which uses recursive descent and I've got to the point where I'm starting to implement built in methods.
One example of a method I'm implementing is the print() method which outputs to the console, just like python's print() method and Java's System.out.println().
However it has come to my attention that there are multiple ways of implementing these built in methods. I'm sure there is many more but I have determined 2 viable ways of achieving this and I'm trying to establish which way is the best practise. For context below is the different layers I'm using within my interpreter which is loosely based off of https://www.geeksforgeeks.org/introduction-of-compiler-design/ and other tutorials I've come across.
Lexer
Parser
Semantic Analyser
Interpreter / Code Generator
1. Creating a AST Node for each individual built in method.
This method entails programming the parser to generate a node for each individual method. This means that a unique node will exist for each method. For example:
The parser will look to generate a node when a TPRINT token is found in the lexer.
print : TPRINT TLPAREN expr TRPAREN {$$ = new Print($3);}
;
And this is what the print class looks like.
class Print : public Node {
public:
virtual VariableValue visit_Semantic(SemanticAnalyzer* analyzer) override;
virtual VariableValue visit_Interpreter(Interpreter* interpreter) override;
Node* value;
Print(Node* cvalue) {
value = cvalue;
}
}
From there I define the visit_Semantic and visit_interpreter methods, and visit them using recursion from the top node.
I can think of a couple advantages/disadvantages to using this method:
Advantages
When the code generator walks the tree and visits the Print node's visit_interpreter method, it can straight away execute the response directly as it is programmed into it's visit methods.
Disadvantages
I will have to write a lot of copy paste code. I will have to create a node for each separate method and define it's parser grammar.
2. Creating a Generic AST Node for a method call Node and then use a lookup table to determine which method is being called.
This involves creating one generic node MethodCall and grammar to determine if a method has been called, with some unique identifier such as a string for the method it is referring to. Then, when the MethodCall's visit_Interpreter or visit_Semantic method is called, it looks up in a table which code to execute.
methcall : TIDENTIFIER TLPAREN call_params TRPAREN {$$ = new MethodCall($1->c_str(), $3);}
;
MethodCall Node. Here the unique identifier is std::string methodName:
class MethodCall : public Node {
public:
virtual VariableValue visit_Semantic(SemanticAnalyzer* analyzer) override;
virtual VariableValue visit_Interpreter(Interpreter* interpreter) override;
std::string methodName;
ExprList *params;
MethodCall(std::string cmethodName, ExprList *cparams) {
params = cparams;
methodName = cmethodName;
}
};
Advantages:
One generic grammar/node for all method calls. This makes it more readable
Disadvantages:
At some point the unique identifier std::string methodName will have to be compared in a lookup table to determine a response for it. This is not as efficient as directly programming in a response to a Node's visit methods.
Which practise is the best way of handling methods within a compiler/interpreter? Are there different practises all together which are better, or are there any other disadvantages/advantages I am missing?
I'm fairly new to compiler/interpreter design so please ask me to clarify if I have got some terminology wrong.
As far as I see it, you have to split things up into methods somewhere. The question is, do you want to implement this as part of the parser definition (solution 1) or do you want to implement this on the C++ side (solution 2).
Personally, I would prefer to keep the parser definition simple and move this logic to the C++ side, that is solution 2.
From a runtime perspective of solution 2, I wouldn't worry too much about that. But in the end, it depends on how frequently that method is called and how many identifiers you have. Having just a few identifiers is different from comparing say hundreds of strings in an "else if" manner.
You could first implement it the simple straight-forward way, i.e. match the string identifiers in an "else if" manner, and see if you run into runtime issues.
If you experience runtime issues, you could use a hash function. The "hardcore" way would be to implement an optimal hash function yourself and check hash function optimality offline, since you know the set of string identifiers. But for your application that would be probably be rather overkill or for academic purposes, and I would recommend to just use unordered_map from the STL (which uses hashing under the hood, see also How std::unordered_map is implemented) to map your string identifiers to index numbers, so that you can implement your jump table with an efficient switch operation on those index numbers.
You should most definitely use a table lookup. It makes things much easier for you. Also, think about the functions users define! Then you'll definitely need a table.

Groovy 1.8 :: LINQ Applied

UPDATE 8/31/2011
Guillaume Laforge has nearly done it:
http://gaelyk.appspot.com/tutorial/app-engine-shortcuts#query
Looks like he's doing an AST transform to pull off the:
alias as Entity
bit. Cool stuff, Groovy 1.8 + AST transform = LINQ-esque queries on the JVM. GL's solution needs more work as far as I can see to pull off full query capabilities (like sub queries, join using(field) syntax and the like), but for his Gaelyk project apparently not necessary.
EDIT
As a workaround to achieving pure LINQ syntax, I have decided to def the aliases. Not a huge deal, and removes a major hurdle that would likely require complex AST transforms to pull off.
So, instead of:
from c as Composite
join t as Teams
...
I now define the aliases (note: need to cast to get auto complete on fields):
def(Teams t,Composite c,Schools s) = [Teams.new(),Composite.new(),Schools.new()]
and use map syntax for from, join, etc.
from c:Composite
join t:Teams
...
To solve issue #2 (see original below), add instance level getProperty methods to each pogo alias (whose scope is limited to ORM closure in which it is called, nice). We just return the string property name as we're building an sql statement.
[t,c,s].each{Object o-> o.metaClass.getProperty = { String k-> k } }
Making "good" progress ;-)
Now to figure out what to do about "=", that is tricky since set property is void. May have to go with eq, neq, gt, etc. but would really prefer the literal symbols, makes for closer-to-sql readability.
If interested, LINQ is doing quite a bit behind the scenes. Jon Skeet (praise his name) has a nice SO reply:
How LINQ works internally?
ORIGINAL
Have been checking out LINQ, highly impressed.
// LINQ example
var games =
from t in Teams
from g in t.Games
where g.gameID = 212
select new { g.gameDate,g.gameTime };
// Seeking Groovy Nirvana
latest { Integer teamID->
from c as Composite
join t as Teams
join s as Schools on ( schoolID = {
from Teams
where t.schoolID = s.schoolID } )
where t.teamID = "$teamID"
select c.location, c.gameType, s.schoolName
group c.gameID
order c.gameDate, c.gameTime
}
The proposed Groovy version compiles fine and if I def the aliases c,t,s with their corresponding POGO, I get strong typed IDE auocomplete on fields, nice. However, nowhere near LINQ, where there are no (visible) variable definitions other than the query itself, totally self contained and strongly typed, wow.
OK, so can it be done in Groovy? I think (hope) yes, but am hung up on 2 issues:
1) How to implicitly populate alias variable without def'ing? Currently I am overriding asType() on String so in "from c as Composite", c gets cast to Composite. Great, but the IDE "thinks" that in closure scope undefined c is a string and thus no autocomplete on POGO fields ;-(
2) Since #1 is not solved, I am def'ing the aliases as per above so I can get autocomplete. Fine, hacked (compared to LINQ), but does the trick. Problem here is that in "select c.location, c.gameType...", I'd like the fields to not be evaluated but simply return "c.location" to the ORM select method, and not null (which is its default value). getProperty() should work here, but I need it to only apply to pogo fields when called from ORM scope (e.g. orm field specific methods like select, order, group, etc.). Bit lost there, perhaps there's a way to annotate orm methods, or only invoke "special" pogo getProperty via orm method calls (which is the closure's delegate in above nirvana query).
Should point out that I am not looking to create a comprehensive LINQ for Groovy, but this one particular subset of LINQ I would love to see happen.
One of the biggest reasons for Guillaume to use an AST transform is because of the problem with "=". Even if you use == for the compare as normally done in Groovy, from the compareTo method that is called for it, you cannot make out a difference between ==, !=, <=, >=, <, >. There are two possible paths for this in later versions of Groovy that are in discussion. One is to use for each of those compares a different method, the other is to store a minimal AST which you can access at runtime. This goes in the direction of C# and is a quite powerful tool. The problem is more about how to do this efficiently.

C# Performance of LINQ vs. foreach iterator block

1) Do these generate the same byte code?
2) If not, is there any gain in using one over the other in certain circumstances?
// LINQ select statement
return from item in collection
select item.Property;
// foreach in an iterator block
foreach (item in collection)
yield return item.Property;
They don't generate the same code but boil down to the same thing, you get an object implementing IEnumerable<typeof(Property)>. The difference is that linq provides iterator from its library (in this case most likely by using WhereSelectArrayIterator or WhereSelectListIterator) whereas in the second example you yourself generate an iterator block that dissects a collection. An iterator block method is always, by ways of compiler magic, compiled as a separate class implementing IEnumerable<typeof(yield)> which you don't see but instantiate implicitly when you call iterator block method.
Performance wise, #1 should be slightly (but just slightly) faster for indexable collections because when you loop through the resulting IEnumerable you go directly from your foreach into collection retrieval in an optimized linq iterator. In example #2 you go from foreach into your iterator block's foreach and from there into collection retrieval and your performance depends mostly on how smart the compiler is at optimizing yield logic. In any case I would imagine that for any complex collection mechanism the cost of retrieval marginalizes this difference.
IMHO, I would always go with #1, if nothing else it saves me from having to write a separate method just for iterating.
No they don't generate the same byte code. The first one returns a pre-existing class in the framework. The second one returns a compiler generated state machine that returns the items from the collection. That state machine is very similar to the class that exists in the framework already.
I doubt there's much performance difference between the two. Both are doing very similar things in the end.

Deferred execution of List<T> using Linq

Suppose I have a List<T> with 1000 items in it.
I'm then passing this to a method that filters this List.
As it drops through the various cases (for example there could be 50), List<T> may have up to 50 various Linq Where() operations performed on it.
I'm interested in this running as quickly as possible. Therefore, I don't want this List<T> filtered each time a Where() is performed on it.
Essentially I need it to defer the actual manipulation of the List<T> until all the filters have been applied.
Is this done natively by the compiler? Or just when I call .ToList() on the IEnumerable that a List<T>.Where() returns, or should I perform the Where() operations on X (Where X = List.AsQueryable())?
Hope this makes sense.
Yes, deferred execution is supported natively. Every time you apply a query or lambda expression on your List the query stores all the expressions that is executed only when you call .ToList() on the Query.
Each call to Where will create a new object which knows about your filter and the sequence it's being called on.
When this new object is asked for a value (and I'm being deliberately fuzzy between an iterator and an iterable here) it will ask the original sequence for the next value, check the filter, and either return the value or iterate back, asking the original sequence for the next value etc.
So if you call Where 50 times (as in list.Where(...).Where(...).Where(...), you end up with something which needs to go up and down the call stack at least 50 times for each item returned. How much performance impact will that have? I don't know: you should measure it.
One possible alternative is to build an expression tree and then compile it down into a delegate at the end, and then call Where. This would certainly be a bit more effort, but it could end up being more efficient. Effectively, it would let you change this:
list.Where(x => x.SomeValue == 1)
.Where(x => x.SomethingElse != null)
.Where(x => x.FinalCondition)
.ToList()
into
list.Where(x => x.SomeValue == 1 && x.SomethingElse != null && x.FinalCondition)
.ToList()
If you know that you're just going to be combining a lot of "where" filters together, this may end up being more efficient than going via IQueryable<T>. As ever, check performance of the simplest possible solution before doing something more complicated though.
There is so much failure in the question and the comments. The answers are good but don't hit hard enough to break through the failure.
Suppose you have a list and a query.
List<T> source = new List<T>(){ /*10 items*/ };
IEnumerable<T> query = source.Where(filter1);
query = query.Where(filter2);
query = query.Where(filter3);
...
query = query.Where(filter10);
Is [lazy evaluation] done natively by the compiler?
No. Lazy evaluation is due to the implementation of Enumerable.Where
This method is implemented by using deferred execution. The immediate return value is an object that stores all the information that is required to perform the action. The query represented by this method is not executed until the object is enumerated either by calling its GetEnumerator method directly or by using foreach in Visual C# or For Each in Visual Basic.
speed penalty there is on calling List.AsQueryable().ToList()
Don't call AsQueryable, you only need to use Enumerable.Where.
thus won't prevent a 50 calls deep call stack
Depth of call stack is much much less important than having a highly effective filter first. If you can reduce the number of elements early, you reduce the number of method calls later.

How LINQ works internally?

I love using LINQ in .NET, but I want to know how that works internally?
It makes more sense to ask about a particular aspect of LINQ. It's a bit like asking "How Windows works" otherwise.
The key parts of LINQ are for me, from a C# perspective:
Expression trees. These are representations of code as data. For instance, an expression tree could represent the notion of "take a string parameter, call the Length property on it, and return the result". The fact that these exist as data rather than as compiled code means that LINQ providers such as LINQ to SQL can analyze them and convert them into SQL.
Lambda expressions. These are expressions like this:
x => x * 2
(int x, int y) => x * y
() => { Console.WriteLine("Block"); Console.WriteLine("Lambda"); }
Lambda expressions are converted either into delegates or expression trees.
Anonymous types. These are expressions like this:
new { X=10, Y=20 }
These are still statically typed, it's just the compiler generates an immutable type for you with properties X and Y. These are usually used with var which allows the type of a local variable to be inferred from its initialization expression.
Query expressions. These are expressions like this:
from person in people
where person.Age < 18
select person.Name
These are translated by the C# compiler into "normal" C# 3.0 (i.e. a form which doesn't use query expressions). Overload resolution etc is applied afterwards, which is absolutely key to being able to use the same query syntax with multiple data types, without the compiler having any knowledge of types such as Queryable. The above expression would be translated into:
people.Where(person => person.Age < 18)
.Select(person => person.Name)
Extension methods. These are static methods which can be used as if they were instance methods of the type of the first parameter. For example, an extension method like this:
public static int CountAsciiDigits(this string text)
{
return text.Count(letter => letter >= '0' && letter <= '9');
}
can then be used like this:
string foo = "123abc456";
int count = foo.CountAsciiDigits();
Note that the implementation of CountAsciiDigits uses another extension method, Enumerable.Count().
That's most of the relevant language aspects. Then there are the implementations of the standard query operators, in LINQ providers such as LINQ to Objects and LINQ to SQL etc. I have a presentation about how it's reasonably simple to implement LINQ to Objects - it's on the "Talks" page of the C# in Depth web site.
The way providers such as LINQ to SQL work is generally via the Queryable class. At their core, they translate expression trees into other query formats, and then construct appropriate objects with the results of executing those out-of-process queries.
Does that cover everything you were interested in? If there's anything in particular you still want to know about, just edit your question and I'll have a go.
LINQ is basically a combination of C# 3.0 discrete features of these:
local variable type inference
auto properties (not implemented in VB 9.0)
extension methods
lambda expressions
anonymous type initializers
query comprehension
For more information about the journey to get there (LINQ), see this video of Anders in LANGNET 2008:
http://download.microsoft.com/download/c/e/5/ce5434ca-4f54-42b1-81ea-7f5a72f3b1dd/1-01%20-%20CSharp3%20-%20Anders%20Hejlsberg.wmv
In simple a form, the compiler takes your code-query and converts it into a bunch of generic classes and calls. Underneath, in case of Linq2Sql, a dynamic SQL query gets constructed and executed using DbCommand, DbDataReader etc.
Say you have:
var q = from x in dc.mytable select x;
it gets converted into following code:
IQueryable<tbl_dir_office> q =
dc.mytable.Select<tbl_dir_office, tbl_dir_office>(
Expression.Lambda<Func<mytable, mytable>>(
exp = Expression.Parameter(typeof(mytable), "x"),
new ParameterExpression[] { exp }
)
);
Lots of generics, huge overhead.
Basically linq is a mixture of some language facilities (compiler) and some framework extensions. So when you write linq queries, they get executed using appropriate interfaces such as IQuerable. Also note that the runtime has no role in linq.
But it is difficult to do justice to linq in a short answer. I recommend you read some book to get yourself in it. I am not sure about the book that tells you internals of Linq but Linq in Action gives a good handson about it.

Resources