LINQ - Using where or join - Performance difference? - linq

Based on this question:
What is difference between Where and Join in linq?
My question is following:
Is there a performance difference in the following two statements:
from order in myDB.OrdersSet
from person in myDB.PersonSet
from product in myDB.ProductSet
where order.Persons_Id==person.Id && order.Products_Id==product.Id
select new { order.Id, person.Name, person.SurName, product.Model,UrunAdı=product.Name };
and
from order in myDB.OrdersSet
join person in myDB.PersonSet on order.Persons_Id equals person.Id
join product in myDB.ProductSet on order.Products_Id equals product.Id
select new { order.Id, person.Name, person.SurName, product.Model,UrunAdı=product.Name };
I would always use the second one just because it´s more clear.
My question is now, is the first one slower than the second one?
Does it build a cartesic product and filters it afterwards with the where clauses ?
Thank you.

It entirely depends on the provider you're using.
With LINQ to Objects, it will absolutely build the Cartesian product and filter afterwards.
For out-of-process query providers such as LINQ to SQL, it depends on whether it's smart enough to realise that it can translate it into a SQL join. Even if LINQ to SQL doesn't, it's likely that the query engine actually performing the query will do so - you'd have to check with the relevant query plan tool for your database to see what's actually going to happen.
Side-note: multiple "from" clauses don't always result in a Cartesian product - the contents of one "from" can depend on the current element of earlier ones, e.g.
from file in files
from line in ReadLines(file)
...

My question is now, is the first one slower than the second one? Does it build a cartesic product and filters it afterwards with the where clauses ?
If the collections are in memory, then yes. There is no query optimizer for LinqToObjects - it simply does what the programmer asks in the order that is asked.
If the collections are in a database (which is suspected due to the myDB variable), then no. The query is translated into sql and sent off to the database where there is a query optimizer. This optimizer will generate an execution plan. Since both queries are asking for the same logical result, it is reasonable to expect the same efficient plan will be generated for both. The only ways to be certain are to
inspect the execution plans
or measure the IO (SET STATISTICS IO ON).
Is there a performance difference
If you find yourself in a scenario where you have to ask, you should cultivate tools with which to measure and discover the truth for yourself. Measure - not ask.

Related

How to implement conditional relationship clause in cypher for Spring Neo4j query?

I have a cypher query called from Java (Spring Neo4j) that I'm using, which works.
The basic query is straightfoward - it asks what nodes lead to another node.
The gist of the challenge is that there are a set (of three) items - contributor, tag, and source - that the query might need to filter its results by.
Two of these - tag and source - bring with them a relationship that needs to be tested. But if these are not provided, then that relationship does not need to be tested.
Neo-browser warns me that the query I'm using it will be inefficient, for probably obvious reasons:
#Query("MATCH (p:BoardPosition)<-[:PARENT]-(n:BoardPosition)<-[:PARENT*0..]-(c:BoardPosition),
(t:Tag), (s:JosekiSource) WHERE
id(p)={TargetID} AND
({ContributorID} IS NULL or c.contributor = {ContributorID}) AND
({TagID} IS NULL or ((c)-->(t) AND id(t) = {TagID}) ) AND
({SourceID} IS NULL or ((c)-->(s) AND id(s) = {SourceID}) )
RETURN DISTINCT n"
)
List<BoardPosition> findFilteredVariations(#Param("TargetID") Long targetId,
#Param("ContributorID") Long contributorId,
#Param("TagID") Long tagId,
#Param("SourceID") Long sourceId);
(Java multi-line query string catenation removed from code for clarity)
It would appear that this would be more efficient if the conditional inclusion of the relationship-test could be in the MATCH clause itself. But that does not appear possible.
Is there some in-cypher way I can improve the efficiency of this?
FWIW: the Neo4j browser warning is:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this
will build a cartesian product between all those parts. This may
produce a large amount of data and slow down query processing.

Compound "from" clause in Linq query in Entity Framework

I've been working with Entity Framework for a few weeks now. I have been working with Linq-Objects and Linq-SQL for years. A lot of times, I like to write linq statements like this:
from student in db.Students
from score in student.Scores
where score > 90
select student;
With other forms of linq, this returns distinct students who have at least one score greater than 90. However, in EF this query returns one student for every score greater than 90.
Does anyone know if this behavior can be replicated in unit tests? Is it possible this is a bug in EF?
I don't like that SQL-like syntax (I have no better name for it), especially when you start nesting them.
var students = db.Students.Where(student
=> student.Scores.Any(score => score > 90)
)
.ToList();
This snippet, using the method syntax, does the same thing. I find it far more readable. It's more explicit in the order of operations used.
And as far as I have experienced, EF hasn't yet shown a bug with its selection using method syntax.
Edit
To actually answer your problem:
However, in EF this query returns one student for every score greater than 90.
I think is is due to a JOIN statement used in the final SQL that will be run. This is why I avoid SQL-like syntax, because it becomes very hard to differentiate between what you want to retrieve (students) and what you want to filter with (scores).
Much like you would do in SQL, you are joining the data from students and scores, and then running a filtering operation on that collection. It becomes harder to then unseparate that result again into a collection of students. I think this is the main cause of your issue. It's not a bug per sé, but I think EF can only handle it one way.
Alternative solutions to the above:
If it returns one student per score over 90, take the distinct students returned. It should be the same result set.
Use more explicit parentheses () and formatting to nest separate SQL-like statements.
Note: I'm not saying it can't be done with SQL-like syntax. I am well aware that most of this answer is opinion based.

making inner join linq or using [table].[joiningTable].column is the same?

I have recently started working on linq and I was wondering suppose I have 2 related tables Project (<=with fkAccessLevelId) and AccessLevel and I want to just select values from both tables. Now there are 2 ways I can select values from these tables.
The one i commonly use is:
(from P in DataContext.Projects
join AL in DataContext.AccessLevel
on P.AccessLevelId equals AL.AccessLevelId
select new
{
ProjectName = P.Name,
Access = AL.AccessName
}
Another way of doing this would be:
(from P in DataContext.Projects
select new
{
ProjectName = P.Name,
Access = P.AccessLevel.AccessName
}
What i wanted to know is which of these way is efficient if we increase the number of table say 5-6 with 1-2 tables containing thousands of records...?
You should take a look at the SQL generated. You have to understand that there are several main performance bottle necks in a Linq query (in this case I assume a OMG...Linq to SQL?!?!) the usual main bottle neck is the SQL query on the server.
Typically SQL Server has a very good optimizer, so actually, given the same query, refactored, the perf is pretty uniform.
However in your case, there is a very real difference in the two queries. A project with no Access Level would not appear in the first query, whilst the second query would return with a null AccessName. In effect you would be comparing a LEFT JOIN to an INNER JOIN.
TL:DR For SQL Server/Linq to Entity Framework queries that do the same thing should give similar performance. However your queries are far from similar.

Should I apply string manipulation after or before joining tables in Oracle

I have two tables need to inner join, one table has relatively small number of records compared to the other one. I need to apply some string manipulation to the smaller table, and my question is can I apply the string function after the join, or should I apply them in a sub query and then join the sub select to the bigger table?
An example would be something like this:
Option 1:
SELECT SUBSTR("SMALL_TABLE"."COL_NAME",x,y) "NEW_COL" FROM "BIG_TABLE"
JOIN "SMALL_TABLE" ON ...
Option 2:
SELECT "NEW_COL"
FROM "BIG_TABLE"
JOIN
(
SELECT SUBSTR("SMALL_TABLE"."COL_NAME",x,y) "NEW_COL" FROM "SMALL_TABLE"
) "T"
ON ...
Which is better for performance option 1 or 2?
I am using oracle 11g.
Regardless of how you structure the query, Oracle's optimizer is free to evaluate the function before or after the join. Assuming that the string manipulation is only done as part of the projection step (i.e. it is done only in the SELECT clause and is not used as a predicate in the WHERE clause), I would expect that Oracle would apply the SUBSTR before joining the tables if you used either formulation because it would then have to apply the function to fewer rows (though it can probably treat the SUBSTR as a deterministic call and cache the results if it applies the function after the join).
As with any query optimization question, the first step is always to generate a query plan and see if the different queries actually produce different plans. I would expect the plans to be identical and, thus, the performance to be identical. But there are any number of reasons that one of the two options might produce different plans on your system given your optimizer statistics, initialization parameters, etc.
It is better to apply the operations before doing the join and then joining and querying for the final result. This is called query optimization.
By doing so for ur question you will perform lesser operations when "join"ing as u will be eliminating the useless rows beforehand.
Lots of examples here : http://beginner-sql-tutorial.com/sql-query-tuning.htm
and this is the best one I could find : http://www.cse.iitb.ac.in/~sudarsha/db-book/slide-dir/ch14.ppt‎

Linq - Is operations order relevant?

I have some slow linq queries and need to optimize them. I have read about compiled queries and setting the merge option in NoTracking in my readonly operations.
But I think my problem is that I have too many Includes so the number of joins done in the DB is huge.
context.ExampleEntity
.Include("A")
.Include("B")
.Include("D.E.F")
.Include("G.H")
.Include("I.J")
.Include("K.M")
.Include("K.N")
.Include("O.P")
.Include("Q.R")
.Where(a => condition1 || complexCondition2)
My doubt is, if I put the Where before the Includes, would this filter ExampleEntity objects before making all the joins?? Im not sure about how linq queries are translated to SQL
"Yes".
Each sub-query passes it's results to the next. Moving the Where first will filter, then perform the includes against a potentially smaller set.
Whether that makes sense in the context of your specific query is up to you to decide.

Resources