Oracle SQL sub query vs inner join - oracle

At first, I seen the select statement on Oracle Docs.
I have some question about oracle select behaviour, when my query contain select,join,where.
see this below for information:
My sample table:
[ P_IMAGE_ID ]
IMAGE_ID (PK)
FILE_NAME
FILE_TYPE
...
...
[ P_IMG_TAG ]
IMG_TAG_ID (PK)
IMAGE_ID (FK)
TAG
...
...
My requirement are: get distinct of image when it's tag is "70702".
Method 1: Select -> Join -> Where -> Distinct
SELECT DISTINCT PID.IMAGE_ID
, PID.FILE_NAME
FROM P_IMAGE_ID PID
INNER JOIN P_IMG_TAG PTAG
ON PTAG.IMAGE_ID = PID.IMAGE_ID
WHERE PTAG.TAG = '70702';
I think the query behaviour should be like:
join table -> hint where cause -> distinct select
I use Oracle SQL developer to get the explain plan:
Method 1 cost 76.
Method 2: Select -> Where -> Where -> Distinct
SELECT DISTINCT PID.IMAGE_ID
, PID.FILE_NAME
FROM P_IMAGE_ID PID
WHERE PID.IMAGE_ID IN
(
SELECT PTAG.IMAGE_ID
FROM P_IMG_TAG PTAG
WHERE PTAG.TAG = '70702'
);
I think the second query behaviour should be like:
hint where cause -> hint where cause -> distinct select
I use Oracle SQL developer to get the explain plan too:
Method 2 cost 76 too. Why?
I believe when I try where cause first for reduce the database process and avoid join table that query performance should be better than the table join query, but now when I test it, I am confused, why 2 method cost are equal ?
Or am I misunderstood something ?
List of my question here:
Why 2 method above cost are equal ?
If the result of sub select Tag = '70702' more than thousand or million or more, use join table should be better alright ?
If the result of sub select Tag = '70702' are least, use sub select for reduce data query process is better alright ?
When I use method 1 Select -> Join -> Where -> Distinct mean the database process table joining before hint where cause alright ?
Someone told me when i move hint cause Tag = '70702' into join cause
(ie. INNER JOIN P_IMG_TAG PTAG ON PAT.IMAGE_ID = PID.IMAGE_ID AND PTAG.TAG = '70702' ) it's performance may be better that's alright ?
I read topic subselect vs outer join and subquery or inner join but both are for SQL Server, I don't sure that may be like Oracle database.

The DBMS takes your query and executes something. But it doesn't execute steps that correspond to SQL statement parts in the order they appear in an SQL statement.
Read about "relational query optimization", which could just as well be called "relational query implementation". Eg for Oracle.
Any language processor takes declarations and calls as input and implements the described behaviour in terms of internal data structures and operations, maybe through one or more levels of "intermediate code" running on a "virtual machine", eventually down to physical machines. But even just staying in the input language, SQL queries can be rearranged into other SQL queries that return the same value but perform significantly better under simple and general implementation assumptions. Just as you know that your question's queries always return the same thing for a given database, the DBMS can know. Part of how it knows is that there are many rules for taking a relational algebra expression and generating a different but same-valued expression. Certain rewrite rules apply under certain limited circumstances. There are rules that take into consideration SQL-level relational things like primary keys, unique columns, foreign keys and other constraints. Other rules use implementation-oriented SQL-level things like indexes and statistics. This is the "relational query rewriting" part of relational query optimization.
Even when two different but equivalent queries generate different plans, the cost can be similar because the plans are so similar. Here, both a HASH and SORT index are UNIQUE. (It would be interesting to know what the few top plans were for each of your queries. It is quite likely that those few are the same for both, but that the plan that is more directly derived from the particular input expression is the one that is offered when there's little difference.)
The way to get the DBMS to find good query plans is to write the most natural expression of a query that you can find.

Related

Replacing NOT IN with NOT EXISTS and an OUTER JOIN in Oracle Database 12c

I understand that the performance of our queries is improved when we use EXISTS and NOT EXISTS in the place of IN and NOT IN, however, is performance improved further when we replace NOT IN with an OUTER JOIN as opposed to NOT EXISTS?
For example, the following query selects all models from a PRODUCT table that are not in another table called PC. For the record, no model values in the PRODUCT or PC tables are null:
select model
from product
where not exists(
select *
from pc
where product.model = pc.model);
The following OUTER JOIN will display the same results:
select product.model
from product left join pc
on pc.model = product.model
where pc.model is null;
Seeing as these both return the same values, which option should we use to better improve the performance of our queries?
The query plan will tell you. It will depend on the data and tables. In the case of OUTER JOIN and NOT EXISTS they are the same.
However to your opening sentence, NOT IN and NOT EXISTS are not the same if NULL is accepted on model. In this case you say model cannot be null so you might find they all have the same plan anyway. However when making this assumption, the database must be told there cannot be nulls (using NOT NULL) as opposed to there simply not being any. If you don't it will make different plans for each query which may result in different performance depending on your actual data. This is generally true and particularly true for ORACLE which does not index NULLs.
Check out EXPLAIN PLAN

Oracle join using Hint USE_NL USE_HASH

What is the best way to Force execution plan to do only nested loop joins for all tables using Hint USE_NL in once case,
And in other case to do only Hash Join using USE_HASH hint for all tables
I want to run both query and see which has low cost in execution plan and use, please suggest
My doubt is in which sequence i should put for all 4 tables inside HINT like below
USE_NL(bl1_gain_adj,customers,bl1_gain,bl1_reply_code)
SELECT bl1_gain_adj.adj_seq_no,
bl1_gain_adj.amount_currency ,
bl1_gain_adj.gain_seq_no,
customers.loan_key,
customers.customer_key,
FROM
bl1_gain_adj,
customers,
bl1_gain,
bl1_reply_code
WHERE
bl1_gain.loan_key = customers.loan_key
AND bl1_gain.customer_key = customers.customer_key
AND bl1_gain.receiver_customer = customers.customer_no
AND bl1_gain.cycle_seq_no = customers.cycle_seq_no
AND bl1_reply_code.gain_code = bl1_gain.gain_code
AND bl1_reply_code.revenue_code = 'RC'
AND bl1_gain_adj.gain_seq_no = bl1_gain.gain_seq_no
AND bl1_gain_adj.customer_key = bl1_gain.customer_key;
Records in tables
---------------
bl1_gain_adj = 100 records
customers = 10 Million records
bl1_gain = 1 Million records
bl1_reply_code = 100 million records
Keeping aside the choice of the most appropriate hint for your query (if any), the order you write the table names/aliases in the USE_NL hint does not matter.
According to Oracle documentation:
Note that USE_NL(table1 table2) is not considered a multi-table hint
because it is a shortcut for USE_NL(table1) and USE_NL(table2)
About USE_NL, Oracle says:
The USE_NL hint instructs the optimizer to join each specified table
to another row source with a nested loops join, using the specified
table as the inner table.
That is, if you write USE_NL(table1 table2 table3 table4) this means "use all these tables as inner tables in a nested loop join"; if your query only has these 4 tables, the hint will be ignored for at least one table: to use a table as inner, we need another table to use as outer, so it's impossible to use all the tables as inner.
LEADING does something different, regarding the order in which tables are scanned:
The LEADING hint instructs the optimizer to use the specified set of
tables as the prefix in the execution plan.

Performing simultaneous unrelated queries in EF4 without a stored procedure?

I have a page that pulls together aggregate data from two different tables. I would like to perform these queries in parallel to reduce the latency without having to introduce a stored procedure that would do both.
For example, I currently have this:
ViewBag.TotalUsers = DB.Users.Count();
ViewBag.TotalPosts = DB.Posts.Count();
// Page displays both values but has two trips to the DB server
I'd like something akin to:
var info = DB.Select(db => new {
TotalUsers = db.Users.Count(),
TotalPosts = db.Posts.Count());
// Page displays both values using one trip to DB server.
that would generate a query like this
SELECT (SELECT COUNT(*) FROM Users) AS TotalUsers,
(SELECT COUNT(*) FROM Posts) AS TotalPosts
Thus, I'm looking for a single query to hit the DB server. I'm not asking how to parallelize two separate queries using Tasks or Threads
Obviously I could create a stored procedure that got back both values in a single trip, but I'd like to avoid that if possible as it's easier to add additional stats purely in code rather than having to keep refreshing the DB import.
Am I missing something? Is there a nice pattern in EF to say that you'd like several disparate values that can all be fetched in parallel?
This will return the counts using a single select statement, but there is an important caveat. You'll notice that the EF-generated sql uses cross joins, so there must be a table (not necessarily one of the ones you are counting), that is guaranteed to have rows in it, otherwise the query will return no results. This isn't an ideal solution, but I don't know that it's possible to generate the sql in your example since it doesn't have a from clause in the outer query.
The following code counts records in the Addresses and People tables in the Adventure Works database, and relies on StateProvinces to have at least 1 record:
var r = from x in StateProvinces.Top("1")
let ac = Addresses.Count()
let pc = People.Count()
select new { AddressCount = ac, PeopleCount = pc };
and this is the SQL that is produced:
SELECT
1 AS [C1],
[GroupBy1].[A1] AS [C2],
[GroupBy2].[A1] AS [C3]
FROM
(
SELECT TOP (1) [c].[StateProvinceID] AS [StateProvinceID]
FROM [Person].[StateProvince] AS [c]
) AS [Limit1]
CROSS JOIN
(
SELECT COUNT(1) AS [A1]
FROM [Person].[Address] AS [Extent2]
) AS [GroupBy1]
CROSS JOIN
(
SELECT COUNT(1) AS [A1]
FROM [Person].[Person] AS [Extent3]
) AS [GroupBy2]
and the results from the query when it's run in SSMS:
C1 C2 C3
----------- ----------- -----------
1 19614 19972
You should be able to accomplish what you want with Parallel LINQ (PLINQ). You can find an introduction here.
It seems like there's no good way to do this (yet) in EF4. You can either:
Use the technique described by adrift which will generate a slightly awkward query.
Use the ExecuteStoreQuery where T is some dummy class that you create with property getters/setters matching the name of the columns from the query. The disadvantage of this approach is that you can't directly use your entity model and have to resort to SQL. In addition, you have to create these dummy entities.
Use the a MultiQuery class that combines several queries into one. This is similar to NHibernate's futures hinted at by StanK in the comments. This is a little hack-ish and it doesn't seem to support scalar valued queries (yet).

Why the Select * FROM Table where ID NOT IN ( list of int ids) query is slow in sql server ce?

well this problem is general in sql server ce
i have indexes on all the the fields.
also the same query but with ID IN ( list of int ids) is pretty fast.
i tried to change the query to OUTER Join but this just make it worse.
so any hints on why this happen and how to fix this problem?
That's because the index is not really helpful for that kind of query, so the database has to do a full table scan. If the query is (for some reason) slower than a simple "SELECT * FROM TABLE", do that instead and filter the unwanted IDs in the program.
EDIT: by your comment, I recognize you use a subquery instead of a list. Because of that, there are three possible ways to do the same (hopefully one of them is faster):
Original statement:
select * from mytable where id not in (select id from othertable);
Alternative 1:
select * from mytable where not exists
(select 1 from othertable where mytable.id=othertable.id);
Alternative 2:
select * from mytable
minus
select mytable.* from mytable in join othertable on mytable.id=othertable.id;
Alternative 3: (ugly and hard to understand, but if everything else fails...)
select * from mytable
left outer join othertable on (mytable.id=othertable.id)
where othertable.id is null;
This is not a problem in SQL Server CE, but overall database.
The OPERATION IN is sargable and NOT IN is nonsargable.
What this mean ?
Search ARGument Able, thies mean that DBMS engine can take advantage of using index, for Non Search ARGument Ablee the index can't be used.
The solution might be using filter statement to remove those IDs
More in SQL Performance Tuning by Peter Gulutzan.
ammoQ is right, index does not help much with your query. Depending on distribution of values in your ID column you could optimise the query by specifying which IDs to select rather than not to select. If you end up requesting say more than ~25% of the table index will not be used anyway though because for nonclustered indexed (which is the only type of indexes which SQL CE supports if memory serves) it would be cheaper to scan the table. Otherwise (if the query is actually selective) you could re-write query with ID ranges to select ('union all' may work better than 'or' to combine ranges if SQL CE supports 'union all', not sure)

Table Join Efficiency Question

When joining across tables (as in the examples below), is there an efficiency difference between joining on the tables or joining subqueries containing only the needed columns?
In other words, is there a difference in efficiency between these two tables?
SELECT result
FROM result_tbl
JOIN test_tbl USING (test_id)
JOIN sample_tbl USING (sample_id)
JOIN (SELECT request_id
FROM request_tbl
WHERE request_status='A') USING(request_id)
vs
SELECT result
FROM (SELECT result, test_id FROM result_tbl)
JOIN (SELECT test_id, sample_id FROM test_tbl) USING(test_id)
JOIN (SELECT sample_id FROM sample_tbl) USING(sample_id)
JOIN (SELECT request_id
FROM request_tbl
WHERE request_status='A') USING(request_id)
The only way to find out for sure is to run both with tracing turned on and then look at the trace file. But in all probability they will be treated the same: the optimizer will merge all the inline views into the main statement and come up with the same query plan.
It doesn't matter. It may actually be WORSE since you are taking control away from the optimizer which generally knows best.
However, remember if you are doing a JOIN and only including a column from one of the tables that it is QUITE OFTEN better to re-write it as a series of EXISTS statements -- because that's what you really mean. JOINs (with some exceptions) will join matching rows which is a lot more work for the optimizer to do.
e.g.
SELECT t1.id1
FROM table1 t1
INNER JOIN table2 ON something = something
should almost always be
SELECT id1
FROM table1 t1
WHERE EXISTS( SELECT *
FROM table2
WHERE something = something )
For simple queries the optimizer may reduce the query plans into identical ones. Check it out on your DBMS.
Also this is a code smell and probably should be changed:
JOIN (SELECT request_id
FROM request_tbl
WHERE request_status='A')
to
SELECT result
FROM request
WHERE EXISTS(...)
AND request_status = 'A'
No difference.
You can tell by running EXPLAIN PLAN on both those statements - Oracle knows that all you want is the "result" column, so it only does the minimum necessary to get the data it needs - you should find that the plans will be identical.
The Oracle optimiser does, sometimes, "materialize" a subquery (i.e. run the subquery and keep the results in memory for later reuse), but this is rare and only occurs when the optimiser believes this will result in a performance improvement; in any case, Oracle will do this "materialization" whether you specified the columns in the subqueries or not.
Obviously if the only place the "results" column is stored is in the blocks (along with the rest of the data), Oracle has to visit those blocks - but it will only keep the relevant info (the "result" column and other relevant columns, e.g. "test_id") in memory when processing the query.

Resources