how to outer join in F# using FLinq? - linq

question pretty much says it all. I have a big flinq query of the following form:
for alias1 in table1 do
for alias2 in table2 do
if alias1.Id = alias2.foreignId
using this form, how can I do a left outer join between these two tables?

I think you can use the groupJoin function available in the Query module. Here is an example using Northwind with Products as the primary table and Categories as the table with foreign key:
open System.Linq
<# Query.groupJoin
db.Products db.Categories
(fun p -> p.CategoryID.Value)
(fun c -> c.CategoryID)
(fun p cats ->
// Here we get a sequence of all categories (which may be empty)
let cat = cats.FirstOrDefault()
// 'cat' will be either a Category or 'null' value
p.ProductName, if cat = null then "(none)" else cat.CategoryName) #>
|> query
There are definitely nicer ways of expressing this using the seq { .. } syntax and by implementing join-like behavior using nested for loops. Unfortunatelly, the quotations to LINQ translator will probably not support these. (Personally, I would prefer writing the code using nested for and using if to check for empty collection).
I was just looking at some improvements in the PowerPack library as part of a contracting work for the F# team, so this will hopefully improve in the future... (but no promises!)

Perhaps you should create a view in the database that performed the left outer join, and then LINQ over that view.

I ended up created separate queries for each outer join and calling that at certain points when looping through the resultset of the outermost query.

Related

LINQ - Deferred Execution in Subqueries

My understanding is that the use of scalar or conversion functions causes immediate execution of a LINQ query. It is also my understanding that subqueries are executed upon demand of the outer query which would typically be once per element. For the following example would I be right in saying that the inner query is executed immediately? If so, as this would produce a scalar value how would this affect how the outer query operates?
IEnumerable<string> outerQuery = names.Where ( n => n.Length == names
.OrderBy(n2 => n2.Length).Select(n2 => n2.Length).First());
I would expect the above query to operate in a similar way as below, ie as if there wasn't a subquery.
int val = names.OrderBy(n2 => n2.Length).Select(n2 => n2.Length).First();
IEnumerable<string> outerQuery = names.Where ( n => n.Length == val );
This example was taken from Joseph and Ben Albahari's C# 4.0 in a Nutshell (Chp 8 P331/332) and my confusion stems from the accompanying diagram which appears to show that the subquery is being evaluated each time the outer query iterates through the elements of names.
Could someone clarify how LINQ works in this setup? Any help would be appreciated!
For the following example would I be right in saying that the inner query is executed immediately?
No, the inner query will be executed for each item in names when the outer query is enumerated. If you want it to be executed only once, use the second code sample.
EDIT: as LukeH pointed out, this is only true of Linq to Objects. Other Linq providers (Linq to SQL, Entity Framework...) might be able to optimize this automatically
What is names? If it's collection (and you use LINQ to Objects) then "subquery" will be executed for each outer query item. If it's actually query object then result depends on actual IQueryable.Provider. For example, for LINQ to SQL you will give SQL query with scalar subquery. And in the most cases subquery actually will be executed only once.

When to prefer joins expressed with SelectMany() over joins expressed with the join keyword in Linq

Linq allows to express inner joins by using the join keyword or by using
SelectMany() (i.e. a couple of from keywords) with a where keyword:
var personsToState = from person in persons
join state in statesOfUS
on person.State equals state.USPS
select new { person, State = state.Name };
foreach (var item in personsToState)
{
System.Diagnostics.Debug.WriteLine(item);
}
// The same query can be expressed with the query operator SelectMany(), which is
// expressed as two from clauses and a single where clause connecting the sequences.
var personsToState2 = from person in persons
from state in statesOfUS
where person.State == state.USPS
select new { person, State = state.Name };
foreach (var item in personsToState2)
{
System.Diagnostics.Debug.WriteLine(item);
}
My question: when is it purposeful to use the join-style and when to use the where-style,
has one style performance advantages over the other style?
For local queries Join is more efficient due to its keyed lookup as Athari mentioned, however for LINQ to SQL (L2S) you'll get more mileage out of SelectMany. In L2S a SelectMany ultimately uses some type of SQL join in the generated SQL depending on your query.
Take a look at questions 11 & 12 of the LINQ Quiz by Joseph/Ben Albahari, authors of C# 4.0 In a Nutshell. They show samples of different types of joins and they state:
With LINQ to SQL, SelectMany-based
joins are the most flexible, and can
perform both equi and non-equi joins.
Throw in DefaultIfEmpty, and you can
do left outer joins as well!
In addition, Matt Warren has a detailed blog post on this topic as it pertains to IQueryable / SQL here: LINQ: Building an IQueryable provider - Part VII.
Back to your question of which to use, you should use whichever query is more readable and allows you to easily express yourself and construct your end goal clearly. Performance shouldn't be an initial concern unless you are dealing with large collections and have profiled both approaches. In L2S you have to consider the flexibility SelectMany offers you depending on the way you need to pair up your data.
Join is more efficient, it uses Lookup class (a variation of Dictionary with multiple values for a single key) to find matching values.

Translate an IQueryable instance to LINQ syntax in a string

I would like to find out if anyone has existing work surrounding formatting an IQueryable instance back into a LINQ C# syntax inside a string. It'd be a nice-to-have feature for an internal LINQ-to-SQL auditing framework I'm building. Once my framework gets the IQueryable instance from a data repository method, I'd like to output something like:
This LINQ query:
from ce in db.EiClassEnrollment
join c in db.EiCourse on ce.CourseID equals c.CourseID
join cl in db.EiClass on ce.ClassID equals cl.ClassID
join t in db.EiTerm on ce.TermID equals t.TermID
join st in db.EiStaff on cl.Instructor equals st.StaffID
where (ce.StudentID == studentID) && (ce.TermID == termID) && (cl.Campus == campusID)
select new { ce, cl, t, c, st };
Generates the following LINQ-to-SQL query:
DECLARE #p0 int;
DECLARE #p1 int;
DECLARE #p2 int;
SET #p0 = 777;
SET #p1 = 778;
SET #p2 = 779;
SELECT [t0].[ClassEnrollmentID], ..., [t4].[Name]
FROM [dbo].[ei_ClassEnrollment] AS [t0]
INNER JOIN [dbo].[ei_Course] AS [t1] ON [t0].[CourseID] = [t1].[CourseID]
INNER JOIN [dbo].[ei_Class] AS [t2] ON [t0].[ClassID] = [t2].[ClassID]
INNER JOIN [dbo].[ei_Term] AS [t3] ON [t0].[TermID] = [t3].[TermID]
INNER JOIN [dbo].[ei_Staff] AS [t4] ON [t2].[Instructor] = [t4].[StaffID]
WHERE ([t0].[StudentID] = #p0) AND ([t0].[TermID] = #p1) AND ([t2].[Campus] = #p2)
I already have the SQL output working as you can see. I just need to find a way to get the IQueryable to translate into a string representing its original LINQ syntax (with an acceptable translation loss). I'm not afraid of writing it myself, but I'd like to see if anyone else has done this first.
Everything IQueryable can be compiled in to an Expression object. Expressions have a Body property representing the body of the lambda expression. You may be able to, while parsing your sources, compile each expression then output the body, which should be normalized.
The best approach to this would be to read up on expression trees in C#. I think you may be able to use a visitor pattern over an IQueryable<T> type to recover the C# syntax. I know there are some implementations available for Expression<Func<T>>, but I can't recall ever seeing this done for a LINQ query.
UPDATE I got curious about this and started doing some research. You can access the underlying Expression Tree through the Expression property of an IQueryable<>. It looks like you would need to implement a LINQ provider that renders C# instead of SQL. This is very far from trivial. In fact I think it would be difficult to justify the amount of work that would be required unless this is an educational (non-commercial) project. But if you're undaunted, here is what looks like an excellent tutorial on LINQ providers. All the source code is available on Codeplex too.
I've done my own implementation for this since I could not find any existing work that was freely available or in source form. I put up a quick blog post about my work and included the entire C# source code for it. You can do some pretty neat stuff with it. Feel free to check it out.
http://bittwiddlers.org/?p=120

Can I force the auto-generated Linq-to-SQL classes to use an OUTER JOIN?

Let's say I have an Order table which has a FirstSalesPersonId field and a SecondSalesPersonId field. Both of these are foreign keys that reference the SalesPerson table. For any given order, either one or two salespersons may be credited with the order. In other words, FirstSalesPersonId can never be NULL, but SecondSalesPersonId can be NULL.
When I drop my Order and SalesPerson tables onto the "Linq to SQL Classes" design surface, the class builder spots the two FK relationships from the Order table to the SalesPerson table, and so the generated Order class has a SalesPerson field and a SalesPerson1 field (which I can rename to SalesPerson1 and SalesPerson2 to avoid confusion).
Because I always want to have the salesperson data available whenever I process an order, I am using DataLoadOptions.LoadWith to specify that the two salesperson fields are populated when the order instance is populated, as follows:
dataLoadOptions.LoadWith<Order>(o => o.SalesPerson1);
dataLoadOptions.LoadWith<Order>(o => o.SalesPerson2);
The problem I'm having is that Linq to SQL is using something like the following SQL to load an order:
SELECT ...
FROM Order O
INNER JOIN SalesPerson SP1 ON SP1.salesPersonId = O.firstSalesPersonId
INNER JOIN SalesPerson SP2 ON SP2.salesPersonId = O.secondSalesPersonId
This would make sense if there were always two salesperson records, but because there is sometimes no second salesperson (secondSalesPersonId is NULL), the INNER JOIN causes the query to return no records in that case.
What I effectively want here is to change the second INNER JOIN into a LEFT OUTER JOIN. Is there a way to do that through the UI for the class generator? If not, how else can I achieve this?
(Note that because I'm using the generated classes almost exclusively, I'd rather not have something tacked on the side for this one case if I can avoid it).
Edit: per my comment reply, the SecondSalesPersonId field is nullable (in the DB, and in the generated classes).
The default behaviour actually is a LEFT JOIN, assuming you've set up the model correctly.
Here's a slightly anonymized example that I just tested on one of my own databases:
class Program
{
static void Main(string[] args)
{
using (TestDataContext context = new TestDataContext())
{
DataLoadOptions dlo = new DataLoadOptions();
dlo.LoadWith<Place>(p => p.Address);
context.LoadOptions = dlo;
var places = context.Places.Where(p => p.ID >= 100 && p.ID <= 200);
foreach (var place in places)
{
Console.WriteLine(p.ID, p.AddressID);
}
}
}
}
This is just a simple test that prints out a list of places and their address IDs. Here is the query text that appears in the profiler:
SELECT [t0].[ID], [t0].[Name], [t0].[AddressID], ...
FROM [dbo].[Places] AS [t0]
LEFT OUTER JOIN (
SELECT 1 AS [test], [t1].[AddressID],
[t1].[StreetLine1], [t1].[StreetLine2],
[t1].[City], [t1].[Region], [t1].[Country], [t1].[PostalCode]
FROM [dbo].[Addresses] AS [t1]
) AS [t2] ON [t2].[AddressID] = [t0].[AddressID]
WHERE ([t0].[PlaceID] >= #p0) AND ([t0].[PlaceID] <= #p1)
This isn't exactly a very pretty query (your guess is as good as mine as to what that 1 as [test] is all about), but it's definitively a LEFT JOIN and doesn't exhibit the problem you seem to be having. And this is just using the generated classes, I haven't made any changes.
Note that I also tested this on a dual relationship (i.e. a single Place having two Address references, one nullable, one not), and I get the exact same results. The first (non-nullable) gets turned into an INNER JOIN, and the second gets turned into a LEFT JOIN.
It has to be something in your model, like changing the nullability of the second reference. I know you say it's configured as nullable, but maybe you need to double-check? If it's definitely nullable then I suggest you post your full schema and DBML so somebody can try to reproduce the behaviour that you're seeing.
If you make the secondSalesPersonId field in the database table nullable, LINQ-to-SQL should properly construct the Association object so that the resulting SQL statement will do the LEFT OUTER JOIN.
UPDATE:
Since the field is nullable, your problem may be in explicitly declaring dataLoadOptions.LoadWith<>(). I'm running a similar situation in my current project where I have an Order, but the order goes through multiple stages. Each stage corresponds to a separate table with data related to that stage. I simply retrieve the Order, and the appropriate data follows along, if it exists. I don't use the dataLoadOptions at all, and it does what I need it to do. For example, if the Order has a purchase order record, but no invoice record, Order.PurchaseOrder will contain the purchase order data and Order.Invoice will be null. My query looks something like this:
DC.Orders.Where(a => a.Order_ID == id).SingleOrDefault();
I try not to micromanage LINQ-to-SQL...it does 95% of what I need straight out of the box.
UPDATE 2:
I found this post that discusses the use of DefaultIfEmpty() in order to populated child entities with null if they don't exist. I tried it out with LINQPad on my database and converted that example to lambda syntax (since that's what I use):
ParentTable.GroupJoin
(
ChildTable,
p => p.ParentTable_ID,
c => c.ChildTable_ID,
(p, aggregate) => new { p = p, aggregate = aggregate }
)
.SelectMany (a => a.aggregate.DefaultIfEmpty (),
(a, c) => new
{
ParentTableEntity = a.p,
ChildTableEntity = c
}
)
From what I can figure out from this statement, the GroupJoin expression relates the parent and child tables, while the SelectMany expression aggregates the related child records. The key appears to be the use of the DefaultIfEmpty, which forces the inclusion of the parent entity record even if there are no related child records. (Thanks for compelling me to dig into this further...I think I may have found some useful stuff to help with a pretty huge report I've got on my pipeline...)
UPDATE 3:
If the goal is to keep it simple, then it looks like you're going to have to reference those salesperson fields directly in your Select() expression. The reason you're having to use LoadWith<>() in the first place is because the tables are not being referenced anywhere in your query statement, so the LINQ engine won't automatically pull that information in.
As an example, given this structure:
MailingList ListCompany
=========== ===========
List_ID (PK) ListCompany_ID (PK)
ListCompany_ID (FK) FullName (string)
I want to get the name of the company associated with a particular mailing list:
MailingLists.Where(a => a.List_ID == 2).Select(a => a.ListCompany.FullName)
If that association has NOT been made, meaning that the ListCompany_ID field in the MailingList table for that record is equal to null, this is the resulting SQL generated by the LINQ engine:
SELECT [t1].[FullName]
FROM [MailingLists] AS [t0]
LEFT OUTER JOIN [ListCompanies] AS [t1] ON [t1].[ListCompany_ID] = [t0].[ListCompany_ID]
WHERE [t0].[List_ID] = #p0

Linq Expression Syntax - How to make it more readable?

I am in the process of writing something that will use Linq to combine results from my database, via Linq2Sql and an in-memory list of objects in order to find out which of my in-memory objects match something on the database.
I've come up with the query in both expression and query syntax.
Expression Syntax
var query = order.Items.Join(productNonCriticalityList,
i => i.ProductID,
p => p.ProductID,
(i, p) => i);
Query Syntax
var query =
from p in productNonCriticalityList
join i in order.Items
on p.ProductID equals i.ProductID
select i;
I realise that we have all the code completion goodness with expression syntax, and I do actually use that more. Mainly because it's easier to create re-usable chunks of filter code that can be chained together to form more complex filters.
But for a join the latter seems far more readable to me, but maybe that is because I am used to writing T-SQL.
So, am I missing a trick or is it just a matter of getting used to it?
I agree with the other responders that the exact question you're asking is simply a matter of preference. Personaly, I mix the two forms depending upon which is clearer for the specific query that I'm writing.
If I have one comment though, I would say that the query looks like it might load all of the items from the order. That might be fine for a single order one time, but if you're looping through lots of orders, it might be more efficient to load all of the items for all of the in one go (you might want to additionally filter by date or customer, or whatever though). If you do that, you might get better results by switching the query around:
var productIds = (from p in productNonCriticalityList
orderby p.productID
select p.ProductID).Distinct();
var orderItems = from i in dc.OrderItems
where productIds.Contains(i.ProductID)
&& // Additional filtering here.
select i;
It's a bit backwards at first glance, but it could save you from loading in all the order items and also from sending lots of queries. It works because the where productIds.Contains(...) call can be converted to where i.ProductID in (1, 2, 3, 4, 5) in SQL. Of course, you'd have to judge it based on the expected number of order items, and the number of product IDs.
It really all comes down to preference. Some people just hate the idea of query like syntax in their code. I for one appreciate the query syntax, it is declarative and quite readable. Like you said though, the chainability of the first example is a nice thing to have. I guess for my money I would keep it query until I felt I needed to begin chaining the call.
I used to feel the same way. Now I find query syntax easier to read and write, particularly when things get complicated. As much as it irked me to type it the first time, 'let' does wonderful things in ways that would not be readable in Expression Syntax.
I prefer the Query syntax when its complex and Expression syntax when its a simple query.
If a DBA were to read the C# code to see what SQL we are using, they would understand and digest the Query syntax easier.
Taking a simple example:
Query
var col = from o in orders
orderby o.Cost ascending
select o;
Expression
var col2 = orders.OrderBy(o => o.Cost);
To me, the Expression syntax is an easier choice to understand here.
Another example:
Query
var col9 = from o in orders
orderby o.CustomerID, o.Cost descending
select o;
Expression
var col6 = orders.OrderBy(o => o.CustomerID).
ThenByDescending(o => o.Cost);
Both are easy to read and understand, however if the query was
//returns same results as above
var col5 = from o in orders
orderby o.Cost descending
orderby o.CustomerID
select o;
//NOTE the ordering of the orderby's
That looks a little confusing to be as the fields are in a different order and it appears a little backwards.
For Joins
Query
var col = from c in customers
join o in orders on
c.CustomerID equals o.CustomerID
select new
{
c.CustomerID,
c.Name,
o.OrderID,
o.Cost
};
Expression:
var col2 = customers.Join(orders,
c => c.CustomerID,o => o.CustomerID,
(c, o) => new
{
c.CustomerID,
c.Name,
o.OrderID,
o.Cost
}
);
I find that Query is better.
My summary would be use whatever looks easiest and fastest to understand given the query at hand. There is no golden rule of which to use. However, if there are a lot of joins, I'd go with Query syntax.
Well, both statements are equivalent. So you could youse them both, depending on the surrounging code and what is more readable. In my project I make the decision which syntax to use dependent on those two conditions.
Personally I would write the expression syntax in one line, but this is a matter of taste.

Resources