LINQ Include vs Join. Are they equivalent? - linq

I have used join in linq to join 2 tables. What is the difference between a join and Include. From what I see, they both behave the same.
Include vs. Join

An Included is intended to retain the original object structures and graphs. A Join is needed to project a flattened representation of the object graph or to join types which are not naturally related through the graph (ie. join the customer's city with a shipping facility's city).
Compare the following:
db.Customers.Include("Orders")
Generates an IEnumerable each of which may contain their corresponding list of Orders in an object graph like this:
Customer 1
Order
Order
Order
Customer 2
Order
Order
In contrast, if you do the same with a join projecting into an anonymous type you could get the following:
from c in db.Customers
join o in db.Orders on c.CustomerId equals o.CustomerId
select new {c, o}
This produces a new IEnumerable<Anonymous<Customer, Order>> where the customer is repeated for each order.
{ Customer1, orderA }
{ Customer1, orderB }
{ Customer1, orderC }
{ Customer2, orderD }
{ Customer2, orderE }
{ Customer2, orderF }
While both may issue the same request to the database, the resulting type may be quite different.

In a sense, yes. Include is implemented as a join. Depending on the nullability of the included link it is an inner or left join.
You can always build an include yourself by using a join, like this:
db.Users.Select(u => new { u, u.City })
This is an "include" for the user's city. It manifests itself as a SQL join.

If you simply need all Orders for some Customers. Good example here for blog application is displaying all Comments below Articles always. Then Include is your way of work.
Join in opposition is more helpful if you need some Customers and filters out them using some data contained in Orders entity. For example you want to sort out Articles to send to the Police Articles with Comments containing vulgar words.
Also if your Orders entity contains a lot of data (many columns) taking a lot of memory and you don't need all of them then join can be much more efficient but here always is a question what "lot of data" or "many columns" means so test first will be the best choice.

Related

Joining two tables and returning multiple records as one row using LINQ

I'm trying to write a LINQ expression that will join two tables and return data in a format similar to what is possible using MySql's GROUP_CONCAT. I tried searching around on Google and SO, but all the results I found used MSSQL or were only using one table. The expression I have written now looks like this:
from d in division
join o in office on d.Id = o.DivisionId
select new
{
id = d.Id,
cell = new string[] { d.DivisionName, o.OfficeName }
}
As expected, this returns a list of every division and what offices correspond to that division. The only problem is that since most divisions will have more than one office, I get a division back for each office in said division. Essentially I'm seeing results like this:
Division1: Office1
Division1: Office2
Division1: Office3
Division2: Office1
When I want to see:
Division1: Office1, Office2, Office3
Division2: Office1
I remember doing something a while ago with MySql that used GROUP_CONCAT, but I can't figure out what the equivalent of that would be using LINQ. I tried writing a method which had an IEnumerable<Office> parameter and built a string using the Aggregate extension method, but the way I have my LINQ expression written now, each Office is passed in rather than an IEnumerable<Office>. Is there a better way to approach this problem than what I'm doing now? I'm rather new to LINQ expressions, so I apologize if this is trivial.
You want a group join, e.g.
from d in division
join o in office on d.Id = o.DivisionId into offices
select new
{
id = d.Id,
divisionName = d.DivisionName,
officeNames = offices.Select(o => o.OfficeName)
}

Can LINQ ToArray return a strongly-typed array in this example?

I've contrived this example because it's an easily digested version of the actual problem I'm trying to solve. Here are the classes and their relationships.
First we have a Country class that contains a Dictionary of State objects indexed by a string (their name or abbreviation for example). The contents of the State class are irrelevant:
class Country
{
Dictionary<string, State> states;
}
class State { ... }
We also have a Company class which contains a Dictionary of zero or more BranchOffice objects also indexed by state names or abbreviations.
class Company
{
Dictionary<string, BranchOffice> branches;
}
class BranchOffice { ... }
The instances we're working with are one Country object and an array of Company objects:
Country usa;
Company companies[];
What I want is an array of the State objects which contain a branch. The LINQ I wrote is below. First it grabs all the companies which actually contain a branch, then joins to the list of states by comparing the keys of both lists.
The problem is that ToArray returns an anonymous type. I understand why anonymous types can't be cast to strong types. I'm trying to figure out whether I could change something to get back a strongly typed array. (And I'm open to suggestions about better ways to write the LINQ overall.)
I've tried casting to BranchOffice all over the place (up front, at list2, at the final select, and other less-likely candidates).
BranchOffice[] offices =
(from cm in companies
where cm.branches.Count > 0
select new {
list2 =
(from br in cm.branches
join st in usa.states on br.Key equals st.Key
select st.Value
)
}
).ToArray();
You can do:
select new MyClassOfSomeType {
..
)
For selection, you can give it a custom class type. You can also then use ToList. With ArrayList, if you need to keep it loosely typed, you can then make it strongly typed later using Cast<>, though only for any select result that doesn't generate an anonymous class.
HTH.
If i understand the problem correctly, the you want just the states that have office brances in them, not the branches too. If so, one posible linq is the following:
State[] offices =
(from cm in companies
where cm.branches.Count > 0
from br in cm.branches
join st in usa.states on br.Key equals st.Key
select st.Value
).Distinct().ToArray();
If you want both the states and the branches, then you will have to do a group by, and the result will be an IEnumerable>, which you can process after.
var statesAndBranches =
from cm in companies
where cm.branches.Count > 0
from br in cm.branches
join st in usa.states on br.Key equals st.Key
group br.Value by st.Value into g
select g;
Just one more thing, even though you have countries and branches declared as dictionaries, they are used as IEnumerable (from keyValuePair in dictionary) so you will not get any perf benefit form them.

When to prefer joins expressed with SelectMany() over joins expressed with the join keyword in Linq

Linq allows to express inner joins by using the join keyword or by using
SelectMany() (i.e. a couple of from keywords) with a where keyword:
var personsToState = from person in persons
join state in statesOfUS
on person.State equals state.USPS
select new { person, State = state.Name };
foreach (var item in personsToState)
{
System.Diagnostics.Debug.WriteLine(item);
}
// The same query can be expressed with the query operator SelectMany(), which is
// expressed as two from clauses and a single where clause connecting the sequences.
var personsToState2 = from person in persons
from state in statesOfUS
where person.State == state.USPS
select new { person, State = state.Name };
foreach (var item in personsToState2)
{
System.Diagnostics.Debug.WriteLine(item);
}
My question: when is it purposeful to use the join-style and when to use the where-style,
has one style performance advantages over the other style?
For local queries Join is more efficient due to its keyed lookup as Athari mentioned, however for LINQ to SQL (L2S) you'll get more mileage out of SelectMany. In L2S a SelectMany ultimately uses some type of SQL join in the generated SQL depending on your query.
Take a look at questions 11 & 12 of the LINQ Quiz by Joseph/Ben Albahari, authors of C# 4.0 In a Nutshell. They show samples of different types of joins and they state:
With LINQ to SQL, SelectMany-based
joins are the most flexible, and can
perform both equi and non-equi joins.
Throw in DefaultIfEmpty, and you can
do left outer joins as well!
In addition, Matt Warren has a detailed blog post on this topic as it pertains to IQueryable / SQL here: LINQ: Building an IQueryable provider - Part VII.
Back to your question of which to use, you should use whichever query is more readable and allows you to easily express yourself and construct your end goal clearly. Performance shouldn't be an initial concern unless you are dealing with large collections and have profiled both approaches. In L2S you have to consider the flexibility SelectMany offers you depending on the way you need to pair up your data.
Join is more efficient, it uses Lookup class (a variation of Dictionary with multiple values for a single key) to find matching values.

Can I force the auto-generated Linq-to-SQL classes to use an OUTER JOIN?

Let's say I have an Order table which has a FirstSalesPersonId field and a SecondSalesPersonId field. Both of these are foreign keys that reference the SalesPerson table. For any given order, either one or two salespersons may be credited with the order. In other words, FirstSalesPersonId can never be NULL, but SecondSalesPersonId can be NULL.
When I drop my Order and SalesPerson tables onto the "Linq to SQL Classes" design surface, the class builder spots the two FK relationships from the Order table to the SalesPerson table, and so the generated Order class has a SalesPerson field and a SalesPerson1 field (which I can rename to SalesPerson1 and SalesPerson2 to avoid confusion).
Because I always want to have the salesperson data available whenever I process an order, I am using DataLoadOptions.LoadWith to specify that the two salesperson fields are populated when the order instance is populated, as follows:
dataLoadOptions.LoadWith<Order>(o => o.SalesPerson1);
dataLoadOptions.LoadWith<Order>(o => o.SalesPerson2);
The problem I'm having is that Linq to SQL is using something like the following SQL to load an order:
SELECT ...
FROM Order O
INNER JOIN SalesPerson SP1 ON SP1.salesPersonId = O.firstSalesPersonId
INNER JOIN SalesPerson SP2 ON SP2.salesPersonId = O.secondSalesPersonId
This would make sense if there were always two salesperson records, but because there is sometimes no second salesperson (secondSalesPersonId is NULL), the INNER JOIN causes the query to return no records in that case.
What I effectively want here is to change the second INNER JOIN into a LEFT OUTER JOIN. Is there a way to do that through the UI for the class generator? If not, how else can I achieve this?
(Note that because I'm using the generated classes almost exclusively, I'd rather not have something tacked on the side for this one case if I can avoid it).
Edit: per my comment reply, the SecondSalesPersonId field is nullable (in the DB, and in the generated classes).
The default behaviour actually is a LEFT JOIN, assuming you've set up the model correctly.
Here's a slightly anonymized example that I just tested on one of my own databases:
class Program
{
static void Main(string[] args)
{
using (TestDataContext context = new TestDataContext())
{
DataLoadOptions dlo = new DataLoadOptions();
dlo.LoadWith<Place>(p => p.Address);
context.LoadOptions = dlo;
var places = context.Places.Where(p => p.ID >= 100 && p.ID <= 200);
foreach (var place in places)
{
Console.WriteLine(p.ID, p.AddressID);
}
}
}
}
This is just a simple test that prints out a list of places and their address IDs. Here is the query text that appears in the profiler:
SELECT [t0].[ID], [t0].[Name], [t0].[AddressID], ...
FROM [dbo].[Places] AS [t0]
LEFT OUTER JOIN (
SELECT 1 AS [test], [t1].[AddressID],
[t1].[StreetLine1], [t1].[StreetLine2],
[t1].[City], [t1].[Region], [t1].[Country], [t1].[PostalCode]
FROM [dbo].[Addresses] AS [t1]
) AS [t2] ON [t2].[AddressID] = [t0].[AddressID]
WHERE ([t0].[PlaceID] >= #p0) AND ([t0].[PlaceID] <= #p1)
This isn't exactly a very pretty query (your guess is as good as mine as to what that 1 as [test] is all about), but it's definitively a LEFT JOIN and doesn't exhibit the problem you seem to be having. And this is just using the generated classes, I haven't made any changes.
Note that I also tested this on a dual relationship (i.e. a single Place having two Address references, one nullable, one not), and I get the exact same results. The first (non-nullable) gets turned into an INNER JOIN, and the second gets turned into a LEFT JOIN.
It has to be something in your model, like changing the nullability of the second reference. I know you say it's configured as nullable, but maybe you need to double-check? If it's definitely nullable then I suggest you post your full schema and DBML so somebody can try to reproduce the behaviour that you're seeing.
If you make the secondSalesPersonId field in the database table nullable, LINQ-to-SQL should properly construct the Association object so that the resulting SQL statement will do the LEFT OUTER JOIN.
UPDATE:
Since the field is nullable, your problem may be in explicitly declaring dataLoadOptions.LoadWith<>(). I'm running a similar situation in my current project where I have an Order, but the order goes through multiple stages. Each stage corresponds to a separate table with data related to that stage. I simply retrieve the Order, and the appropriate data follows along, if it exists. I don't use the dataLoadOptions at all, and it does what I need it to do. For example, if the Order has a purchase order record, but no invoice record, Order.PurchaseOrder will contain the purchase order data and Order.Invoice will be null. My query looks something like this:
DC.Orders.Where(a => a.Order_ID == id).SingleOrDefault();
I try not to micromanage LINQ-to-SQL...it does 95% of what I need straight out of the box.
UPDATE 2:
I found this post that discusses the use of DefaultIfEmpty() in order to populated child entities with null if they don't exist. I tried it out with LINQPad on my database and converted that example to lambda syntax (since that's what I use):
ParentTable.GroupJoin
(
ChildTable,
p => p.ParentTable_ID,
c => c.ChildTable_ID,
(p, aggregate) => new { p = p, aggregate = aggregate }
)
.SelectMany (a => a.aggregate.DefaultIfEmpty (),
(a, c) => new
{
ParentTableEntity = a.p,
ChildTableEntity = c
}
)
From what I can figure out from this statement, the GroupJoin expression relates the parent and child tables, while the SelectMany expression aggregates the related child records. The key appears to be the use of the DefaultIfEmpty, which forces the inclusion of the parent entity record even if there are no related child records. (Thanks for compelling me to dig into this further...I think I may have found some useful stuff to help with a pretty huge report I've got on my pipeline...)
UPDATE 3:
If the goal is to keep it simple, then it looks like you're going to have to reference those salesperson fields directly in your Select() expression. The reason you're having to use LoadWith<>() in the first place is because the tables are not being referenced anywhere in your query statement, so the LINQ engine won't automatically pull that information in.
As an example, given this structure:
MailingList ListCompany
=========== ===========
List_ID (PK) ListCompany_ID (PK)
ListCompany_ID (FK) FullName (string)
I want to get the name of the company associated with a particular mailing list:
MailingLists.Where(a => a.List_ID == 2).Select(a => a.ListCompany.FullName)
If that association has NOT been made, meaning that the ListCompany_ID field in the MailingList table for that record is equal to null, this is the resulting SQL generated by the LINQ engine:
SELECT [t1].[FullName]
FROM [MailingLists] AS [t0]
LEFT OUTER JOIN [ListCompanies] AS [t1] ON [t1].[ListCompany_ID] = [t0].[ListCompany_ID]
WHERE [t0].[List_ID] = #p0

LINQ to Entity Framwork: return sorted list of related rows

Table Category (c) has a 1:many relationship with Table Question: a Category can have many Questions but a Question belongs only to one category. The questions are also ranked in terms of difficulty.
I would like a LINQ to EF query that would return all Categories with related Questions but the Questions should be sorted ascending in terms of difficulty (lets say on the Question.Rank column).
My query so far, is missing the sorted of the questions:
var results = from c in context.Category.Include("Question") select c;
How needs to be added to get the questions sorted?
The standard way to order records is to use the orderby statement. See:
http://srtsolutions.com/blogs/billwagner/archive/2006/03/29/ordering-linq-results.aspx
Your problem is a little different since you are trying to order a sub list.
If you use the orderby together with the Loadwith it should work:
http://social.msdn.microsoft.com/Forums/en-US/linqprojectgeneral/thread/ec54792f-1ffb-45c3-9525-797c96023de9
http://social.msdn.microsoft.com/forums/en-US/linqtosql/thread/adbd8e6a-2679-4d03-98fe-c4ed7726f95d/
With some luck the code should look something like this:
var results = from c in context.Category
select new { Category = c, Question = c.Question.OrderBy(o => o.Rank) };

Resources