Generated group by sql query does not make sense - linq

I'm using entity framework core for mysql, and i've been running a complex linq query which i'm trying to optimise.
I turned on logging in the mysql server to view the resulting queries from the linq queries.
Oddly, none of it made sense as my complex query that joined 5 tables and performed multiple group bys, where, and order by clause was registered in the logs as 5 separate select all columns from table statements.
So, I tried a simple group by statement for one table. The resulting sql log produced "Select all_columns from table_name order by groupbyid".
Can anyone explain what happened here?
Thanks in advance.
More info as requested:
Sql query:
var queryCommand = (from p in _context.TableExtract group p by p.tableExtractPersonId);
queryCommand.ToList();
Resulting mysql log after:
SELECT .... [very long list of column names]
FROM TableExtract AS p
ORDER BY p.tableExtractPersonId
I've tried two different entity framework libraries: MySql.Data.EntityFrameworkCore(v8.0.17) and Pomelo.EntityFrameworkCore.MySql (v2.2.20) with the same results. I've tried .net core 3.0 and also received the same results. I'm going to try .net standard next.

Ok. I found it:
var queryCommand = (from p in _context.TableExtract group p by p.tableExtractPersonIdinto g select g.Key)
Forces linq to evaluate as a SQL group by. Otherwise apparently it does it's own thing with the group by.

Related

Query in DB2 vs Oracle

There is a query having multiple inner joins. It involves two views, of which one view is based on four tables, and total there are four tables(including two views).
The same query with the same amount of data in the source tables runs in both, Oracle and DB2. In DB2, surprisingly, it takes 2 minutes to load 3 million records. While in Oracle, it is taking two hours. Same indexes are on all source tables in both the environments. Is the behavior of views (when used in joins) different in both environments (Oracle vs DB2)?
a dummy query I am sharing :-
INSERT INTO TABLE_A
SELECT
adf.column1,
adf.column2,
dd.column3,
SUM(otl.column4) column4,
SUM(otl.column5) column5,
(Case when SUM(otl.column5) = 0 then 0
else round(CAST(SUM(otl.column4) AS DECIMAL(19,2)) /abs(CAST(SUM(otl.column4) AS DECIMAL(18,2))),4)
end) taxl_unrlz_cgl_pct
FROM
view_a adf
INNER JOIN table_b hr on hr.hh_ref_id = adf.hh_ref_id
AND hr.col_typ_cd = 'FIRM'
AND hr.col_end_dt = TO_DATE('1/1/2900','MM/DD/YYYY')
INNER JOIN dw.table_c ar on ar.colb_id = adf.colb_id
AND ar.col_cd = '#'
AND ar.col_num BETWEEN 10000000 AND 89999999
AND ar.col_dt IS NULL
INNER JOIN table_d dd on dd.col_id = adf.col_id
INNER JOIN view2 otl ON otl.cola_id = ar.cola_id
GROUP BY adf.column1, adf.column2, dd.column3;
Technically, both DB2 and Oracle will try to rewrite the query in most efficient way possible using the base query that you have coded. But one of the common (but not frequent) issues that I have seen when using multi-table view is DBMS not being able to rewrite the query using underlying tables. So depending on complexity of the view itself and sometime the additional joins, DBMS may not be able to rewrite the query to use the underlying tables properly and hence resulting in not being able to use the indexes on the underlying tables used in the view. When this happens, the view itself acts like a materialized table (work table) and query goes for table scan on the materialized table.
There is no consistent pattern on when such issue can happen. So you will need to check on a case by case basis.
Since you are mentioning about 2 hrs vs 2 minutes, in most probability that might be the case. So you will need to check the access path on both Oracle and DB2. But you will also need to make sure that stats are updated and access path is based on latest stats on DBMS. Else it won't be apples to apples compare.

Oracle SQL sub query vs inner join

At first, I seen the select statement on Oracle Docs.
I have some question about oracle select behaviour, when my query contain select,join,where.
see this below for information:
My sample table:
[ P_IMAGE_ID ]
IMAGE_ID (PK)
FILE_NAME
FILE_TYPE
...
...
[ P_IMG_TAG ]
IMG_TAG_ID (PK)
IMAGE_ID (FK)
TAG
...
...
My requirement are: get distinct of image when it's tag is "70702".
Method 1: Select -> Join -> Where -> Distinct
SELECT DISTINCT PID.IMAGE_ID
, PID.FILE_NAME
FROM P_IMAGE_ID PID
INNER JOIN P_IMG_TAG PTAG
ON PTAG.IMAGE_ID = PID.IMAGE_ID
WHERE PTAG.TAG = '70702';
I think the query behaviour should be like:
join table -> hint where cause -> distinct select
I use Oracle SQL developer to get the explain plan:
Method 1 cost 76.
Method 2: Select -> Where -> Where -> Distinct
SELECT DISTINCT PID.IMAGE_ID
, PID.FILE_NAME
FROM P_IMAGE_ID PID
WHERE PID.IMAGE_ID IN
(
SELECT PTAG.IMAGE_ID
FROM P_IMG_TAG PTAG
WHERE PTAG.TAG = '70702'
);
I think the second query behaviour should be like:
hint where cause -> hint where cause -> distinct select
I use Oracle SQL developer to get the explain plan too:
Method 2 cost 76 too. Why?
I believe when I try where cause first for reduce the database process and avoid join table that query performance should be better than the table join query, but now when I test it, I am confused, why 2 method cost are equal ?
Or am I misunderstood something ?
List of my question here:
Why 2 method above cost are equal ?
If the result of sub select Tag = '70702' more than thousand or million or more, use join table should be better alright ?
If the result of sub select Tag = '70702' are least, use sub select for reduce data query process is better alright ?
When I use method 1 Select -> Join -> Where -> Distinct mean the database process table joining before hint where cause alright ?
Someone told me when i move hint cause Tag = '70702' into join cause
(ie. INNER JOIN P_IMG_TAG PTAG ON PAT.IMAGE_ID = PID.IMAGE_ID AND PTAG.TAG = '70702' ) it's performance may be better that's alright ?
I read topic subselect vs outer join and subquery or inner join but both are for SQL Server, I don't sure that may be like Oracle database.
The DBMS takes your query and executes something. But it doesn't execute steps that correspond to SQL statement parts in the order they appear in an SQL statement.
Read about "relational query optimization", which could just as well be called "relational query implementation". Eg for Oracle.
Any language processor takes declarations and calls as input and implements the described behaviour in terms of internal data structures and operations, maybe through one or more levels of "intermediate code" running on a "virtual machine", eventually down to physical machines. But even just staying in the input language, SQL queries can be rearranged into other SQL queries that return the same value but perform significantly better under simple and general implementation assumptions. Just as you know that your question's queries always return the same thing for a given database, the DBMS can know. Part of how it knows is that there are many rules for taking a relational algebra expression and generating a different but same-valued expression. Certain rewrite rules apply under certain limited circumstances. There are rules that take into consideration SQL-level relational things like primary keys, unique columns, foreign keys and other constraints. Other rules use implementation-oriented SQL-level things like indexes and statistics. This is the "relational query rewriting" part of relational query optimization.
Even when two different but equivalent queries generate different plans, the cost can be similar because the plans are so similar. Here, both a HASH and SORT index are UNIQUE. (It would be interesting to know what the few top plans were for each of your queries. It is quite likely that those few are the same for both, but that the plan that is more directly derived from the particular input expression is the one that is offered when there's little difference.)
The way to get the DBMS to find good query plans is to write the most natural expression of a query that you can find.

Combine relation with custom SQL

I'd like to generate the following SQL with rails / arel:
SELECT * FROM GROUPS
WHERE id = 10
CONNECT BY PARENT_ID = ID
I don't want to use plain SQL except for the last statement which is oracle specific (the real query is much more complex and I don't want to perform endless string concatenations).
What I've tried so far:
Group.where(id: 10).join('CONNECT BY PARENT_ID=ID')
This does not work as it places the custom SQL before the WHERE statement (as it assumes it's a join).
So the actual question is, how to add a custom bit of SQL to a query after the WHERE statements?

Return objects missing records in many to many using EF 5.0

Given the schema below, I'm trying to build an EF query that returns Contacts that are missing required Forms. Each Contact has a ContactType that is related to a collection of FormTypes. Every Contact is required to have at lease one Form (in ContactForm) of the FormTypes related to its ContactType.
The query that EF generates from the linq query below works against Sql Server, but not against Oracle.
var query = ctx.Contacts.Where (c => c.ContactType.FormTypes.Select (ft => ft.FormTypeID)
.Except(c => c.Forms.Select(f => f.FormTypeID)).Any());
I'm in the process of refactoring a data layer so that all of the EF queries that work against Sql Server will also work against Oracle using Devart's dotConnect data provider.
The error that Oracle is throwing is ORA-00904: "Extent1"."ContactID": invalid identifier.
The problem is that Oracle apparently doesn't support referencing a table column from a query in a nested subquery of level 2 and deeper. The line that Oracle throws on is in the Except (or minus) sub query that is referencing "Extent1"."ContactID". "Extent1" is the alias for Contact that is defined at the top level of the query. Here is Devart's explanation of the Oracle limitation.
The way that I've resolved this issue for many queries is by re-writing them to move relationships between tables out of the Where() predicate into the main body of the query using SelectMany() and in some cases Join(). This tends to flatten the query being sent to the database server and minimizes or eliminates the sub queries produced by EF. Here is a similar issue solved using a left outer join.
The column "Extent1"."ContactID" exists and the naming syntax of the query that EF and Devart produce is not the issue.
Any ideas on how to re-write this query will be much appreciated. The objective is a query that returns Contacts missing Forms of a FormType required by the Contact's ContactType that works against Oracle and Sql Server.
The following entity framework query returns all the ContactIDs for Contacts missing FormTypes required by their ContactType when querying against both Sql Server and Oracle.
var contactNeedsFormTypes =
from c in Contacts
from ft in c.ContactType.FormTypes
select new { ft.FormTypeID, c.ContactID};
var contactHasFormTypes =
from c in Contacts
from f in c.Forms
select new { c.ContactID, f.FormTypeID};
var contactsMissingFormTypes =
from n in contactNeedsFormTypes
join h in contactHasFormTypes
on new {n.ContactID, n.FormTypeID} equals new {h.ContactID, h.FormTypeID}
into jointable
where jointable.Count()==0
select n.ContactID;
contactsMissingFormTypes.Distinct();

Performing simultaneous unrelated queries in EF4 without a stored procedure?

I have a page that pulls together aggregate data from two different tables. I would like to perform these queries in parallel to reduce the latency without having to introduce a stored procedure that would do both.
For example, I currently have this:
ViewBag.TotalUsers = DB.Users.Count();
ViewBag.TotalPosts = DB.Posts.Count();
// Page displays both values but has two trips to the DB server
I'd like something akin to:
var info = DB.Select(db => new {
TotalUsers = db.Users.Count(),
TotalPosts = db.Posts.Count());
// Page displays both values using one trip to DB server.
that would generate a query like this
SELECT (SELECT COUNT(*) FROM Users) AS TotalUsers,
(SELECT COUNT(*) FROM Posts) AS TotalPosts
Thus, I'm looking for a single query to hit the DB server. I'm not asking how to parallelize two separate queries using Tasks or Threads
Obviously I could create a stored procedure that got back both values in a single trip, but I'd like to avoid that if possible as it's easier to add additional stats purely in code rather than having to keep refreshing the DB import.
Am I missing something? Is there a nice pattern in EF to say that you'd like several disparate values that can all be fetched in parallel?
This will return the counts using a single select statement, but there is an important caveat. You'll notice that the EF-generated sql uses cross joins, so there must be a table (not necessarily one of the ones you are counting), that is guaranteed to have rows in it, otherwise the query will return no results. This isn't an ideal solution, but I don't know that it's possible to generate the sql in your example since it doesn't have a from clause in the outer query.
The following code counts records in the Addresses and People tables in the Adventure Works database, and relies on StateProvinces to have at least 1 record:
var r = from x in StateProvinces.Top("1")
let ac = Addresses.Count()
let pc = People.Count()
select new { AddressCount = ac, PeopleCount = pc };
and this is the SQL that is produced:
SELECT
1 AS [C1],
[GroupBy1].[A1] AS [C2],
[GroupBy2].[A1] AS [C3]
FROM
(
SELECT TOP (1) [c].[StateProvinceID] AS [StateProvinceID]
FROM [Person].[StateProvince] AS [c]
) AS [Limit1]
CROSS JOIN
(
SELECT COUNT(1) AS [A1]
FROM [Person].[Address] AS [Extent2]
) AS [GroupBy1]
CROSS JOIN
(
SELECT COUNT(1) AS [A1]
FROM [Person].[Person] AS [Extent3]
) AS [GroupBy2]
and the results from the query when it's run in SSMS:
C1 C2 C3
----------- ----------- -----------
1 19614 19972
You should be able to accomplish what you want with Parallel LINQ (PLINQ). You can find an introduction here.
It seems like there's no good way to do this (yet) in EF4. You can either:
Use the technique described by adrift which will generate a slightly awkward query.
Use the ExecuteStoreQuery where T is some dummy class that you create with property getters/setters matching the name of the columns from the query. The disadvantage of this approach is that you can't directly use your entity model and have to resort to SQL. In addition, you have to create these dummy entities.
Use the a MultiQuery class that combines several queries into one. This is similar to NHibernate's futures hinted at by StanK in the comments. This is a little hack-ish and it doesn't seem to support scalar valued queries (yet).

Resources