Force "TOP 100 PERCENT" in EF "sub-query" queryable - linq

Update: I was mistaken about top 100 percent generating a better query plan (the plan is still much better for a reasonably sized top N, and probably has to do with parameter sniffing).
While I still think this focused question has merit, it is not "a useful solution" for my problem2, and might not be for yours ..
I am running into some queries which SQL Server optimizes poorly. The statistics appear correct, and SQL Server chooses the 'worse' plan that performs a seek over millions of records even though the estimated and actual values are the same - but this question is not about that1.
In the problematic queries are of the simplified form:
select * from x
join y on ..
join z on ..
where z.q = ..
However (and since I know the cardinalities better, apparently) the following form consistently results in a much better query plan:
select * from x
join (
-- the result set here is 'quite small'
select top 100 percent *
from y on ..
join z on ..
where z.q = ..) t on ..
In L2S the Take function can be used to limit to top N, but the "problem" I have with this approach is that requires a finite/fixed N such that some query could hypothetically just break, instead of just running really slow with the forced materialization.
While I could choose a 'very large' value for the top N this, ironically (wrt to the initial problem), increases the SQL query execution time as the value of N increases. The expected intermediate result is only expected to be a few dozen to a few hundred records. The current code I have runs a top 100 and then, if such was detected to contain too many results, runs the query again without the limit: but this feels like a kludge .. on top of a kludge.
The question is then: can a EF/L2E/LINQ query generate the equivalent of a top 100 percent on an EF Queryable?
(Forcing materialization via ToList is not an option because the result should be an EF Queryable and remain in LINQ to Entities, not LINQ to Objects.)
While I am currently dealing with EF4, if this is [only] possible in a later version of EF I would accept such as an answer - such is useful knowledge and does answer the question asked.
1 If wishing to answer with "don't do that" or an "alternative", please make it is an secondary answer or aside along with an answer to the actual question being asked. Otherwise, feel free to use the comments.
2 In addition to top 100 percent not generating a better query plan, I forgot to include the 'core issue' at stake, which is bad parameter sniffing (instance is SQL Server 2005).
The following query takes a very long to to complete while direct variable substitution runs "in the blink of an eye" indicating an issue with the parameter sniffing.
declare #x int
set #x = 19348659
select
op.*
from OrderElement oe
join OrderRatePlan rp on oe.OrdersElementID = rp.OrdersElementID
join OrderPrice op on rp.OrdersRatePlanID = op.OrdersRatePlanID
where oe.OrdersProductID = #x
The kludged-but-workable query
select
op.*
from OrderPrice op
join (
-- Choosing a 'small value of N' runs fast and it slows down as the
-- value of N is increases where N >> 1000 simply "takes too long".
-- Using TOP 100 PERCENT also "takes too long".
select top 100
rp.*
from OrderElement oe
join OrderRatePlan rp on oe.OrdersElementID = rp.OrdersElementID
where oe.OrdersProductID = #x
) rp
on rp.OrdersRatePlanID = op.OrdersRatePlanID

Yes, you can do your own query.
db.SqlQuery<something>("SELECT * FROM x ...");

Related

Most efficient way to select in bulk from a multi million records table

I'm interested in getting and doing some processing on all the entities A returned by a query of the form:
SELECT * FROM A a WHERE a.id not in (select b.id from B)
Where A is a "complex" entity in the sense that it inherits (InheritanceTyped.Joined) from other entities and that several of its attributes are other entities (#OneToOne and #ManyToOne).
The query itself takes a few minutes to yield results hence my desire to execute it as few as possible.
Here are the different approaches i tried to get those A elements as efficiently as possible :
Pagination using setFirstResult/ setMaxResults
Do the job, but pretty slowly as the query seems to be executed everytime.(around 50 elements processed/sec)
Getting IDs first, A objects next
Keeping all the IDs in memory is doable, so I execute once
SELECT a.id FROM A a WHERE a.id not in (select b.id from B)
and then select a from A a WHERE a.id= :id, which goes relatively fast as the id column is indexed. This is currently the solution that is the most efficient with (around 100 elements processed/sec)
Using ScollableResults I had high hope with this solution, but it ended up being slower than other alternatives, leaving me at around 20 elements processed/sec ...
As a neophyte, I don't know what other options to investigate, or if I did something wrong in any of my attempts.
Hence my questions:
Are there (factually) other approaches to efficiently tackle this kind of problem ?
Is it normal that ScrollableResults performed so poorly ? Is there something I should have paid attention to while implementing this solution?
EDIT:
Here's the execution plan

Performance tuning tips -Plsql/sql-Database

We are facing performance issue in production. Mv refersh program is running for long, almost 13 to 14 hours.
In the MV refersh program is trying to refersh 5 MV. Among that one of the MV is running for long.
Below is the MV script which is running for long.
SELECT rcvt.transaction_id,
rsh.shipment_num,
rsh.shipped_date,
rsh.expected_receipt_date,
(select rcvt1.transaction_date from rcv_transactions rcvt1
where rcvt1.po_line_id = rcvt.po_line_id
AND rcvt1.transaction_type = 'RETURN TO VENDOR'
and rcvt1.parent_transaction_id=rcvt.transaction_id
)transaction_date
FROM rcv_transactions rcvt,
rcv_shipment_headers rsh,
rcv_shipment_lines rsl
WHERE 1 =1
AND rcvt.shipment_header_id =rsl.shipment_header_id
AND rcvt.shipment_line_id =rsl.shipment_line_id
AND rsl.shipment_header_id =rsh.shipment_header_id
AND rcvt.transaction_type = 'RECEIVE';
Shipment table contains millions of records and above query is trying to extract almost 60 to 70% of the data. We are suspecting data load is the reason.
We are trying to improve the performance for the above script.So we added date filter to restrict the data.
SELECT rcvt.transaction_id,
rsh.shipment_num,
rsh.shipped_date,
rsh.expected_receipt_date,
(select rcvt1.transaction_date from rcv_transactions rcvt1
where rcvt1.po_line_id = rcvt.po_line_id
AND rcvt1.transaction_type = 'RETURN TO VENDOR'
and rcvt1.parent_transaction_id=rcvt.transaction_id
)transaction_date
FROM rcv_transactions rcvt,
rcv_shipment_headers rsh,
rcv_shipment_lines rsl
WHERE 1 =1
AND rcvt.shipment_header_id =rsl.shipment_header_id
AND rcvt.shipment_line_id =rsl.shipment_line_id
AND rsl.shipment_header_id =rsh.shipment_header_id
AND rcvt.transaction_type = 'RECEIVE'
AND TRUNC(rsh.creation_date) >= NVL(TRUNC((sysdate - profile_value),'MM'),TRUNC(rsh.creation_date) );
For 1 year profile, it shows some improvement but if we give for 2 years range its more worse than previous query.
Any suggestions to improve the performance.
Pls help
I'd pull out that scalar subquery into a regular outer join.
Costing for scalar subqueries can be poor and you are forcing it to do a lot of single record lookups (presumably via index) rather than giving it other options.
"The main query then has a scalar subquery in the select list.
Oracle therefore shows two independent plans in the plan table. One for the driving query – which has a cost of two, and one for the scalar subquery, which has a cost of 2083 each time it executes.
But Oracle does not “know” how many times the scalar subquery will run (even though in many cases it could predict a worst-case scenario), and does not make any cost allowance whatsoever for its execution in the total cost of the query."

Optimizing DB2 query - why does bash "time" output stay the same?

I have a DB2 query on the TPC-H schema, which I'm trying to optimize with indexes. I can tell from db2expln output that the estimated costs are significantly lower (300%) when indexes are available.
However I'm also trying to measure execution time and it doesn't change significantly when the query is executed using indexes.
I'm working in an SSH terminal, executing my queries and writing output to a file like
(time db2 "SELECT Q.name FROM
(SELECT custkey, name FROM customer WHERE nationkey = 22) Q WHERE Q.custkey IN
(SELECT custkey FROM orders B WHERE B.orderkey IN
(SELECT orderkey FROM lineitem WHERE receiptdate BETWEEN '1992-06-11' AND '1992-07-11'))") &> output.txt
I did 10 measurements each: 1) without indexes, 2) with index on lineitem.receiptdate, 3) with indexes on lineitem.receiptdate and customer.nationkey,
calculated average time and standard deviation, all are within the same range.
I executed RUNSTATS ON TABLE schemaname.tablename AND DETAILED INDEXES ALL after index creation.
I read this post about output of the time command, from what I understand sys+user time should be relevant for my measurement. There is no change in added sys+user time, not either in real.
sys + user is around 44 ms, real is around 1s.
Any hints why I cannot see a change in time? Am I interpreting time output wrong? Are the optimizer estimations in db2expln misleading?
Disclaimer: I'm supposed to give a presentation about this at university, so it's technically homework, but as it's more of a comprehension question and not "please make my code work" I hope it's appropriate to post it here. Also, I know the query could be simplified but my question is not about this.
The optimizer estimates timerons (measurement of internal costs) and these timerons cannot be translated one to one into query execution time. So a difference of 300% in timerons does not mean you will see a 300% difference in runtime.
Measuring time for one or more SQL statements I recommend to use db2batch with the option
-i complete
`SELECT f1.name FROM customer f1
WHERE f1.nationkey = 22 and exists (
select * from orders f2 inner join lineitem f3 on f2.orderkey=f3.orderkey
where f1.custkey=f2.custkey and f3.receiptdate BETWEEN '1992-06-11' AND '1992-07-11'
)`

performance of rand()

I have heard that I should avoid using 'order by rand()', but I really need to use it. Unlike what I have been hearing, the following query comes up very fast.
select
cp1.img_id as left_id,
cp1.img_filename as left_filename,
cp1.facebook_name as left_facebook_name,
cp2.img_id as right_id,
cp2.img_filename as right_filename,
cp2.facebook_name as right_facebook_name
from
challenge_photos as cp1
cross join
challenge_photos as cp2
where
(cp1.img_id < cp2.img_id)
and
(cp1.img_id,cp2.img_id) not in ((0,0))
and
(cp1.img_status = 1 and cp2.img_status = 1)
order by rand() limit 1
is this query considered 'okay'? or should I use queries that I can find by searching "alternative to rand()" ?
It's usually a performance thing. You should avoid, as much as possible, per-row functions since they slow down your queries.
That means things like uppercase(name), salary * 1.1 and so on. It also includes rand(). It may not be an immediate problem (at 10,000 rows) but, if you ever want your database to scale, you should keep it in mind.
The two main issues are the fact that you're performing a per-row function and then having to do a full sort on the output before selecting the first row. The DBMS cannot use an index if you sort on a random value.
But, if you need to do it (and I'm not making judgement calls there), then you need to do it. Pragmatism often overcomes dogmatism in the real world :-)
A possibility, if performance ever becomes an issue, is to get a count of the records with something like:
select count(*) from ...
then choose a random value on the client side and use a:
limit <start>, <count>
clause in another select, adjusting for the syntax used by your particular DBMS. This should remove the sorting issue and the transmission of unneeded data across the wire.

How to inline a variable in PL/SQL?

The Situation
I have some trouble with my query execution plan for a medium-sized query over a large amount of data in Oracle 11.2.0.2.0. In order to speed things up, I introduced a range filter that does roughly something like this:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
-- [...]
As you can see, I want to restrict the JOIN of organisations using an optional range of organisation numbers. Client code can call DO_STUFF with (supposed to be fast) or without (very slow) the restriction.
The Trouble
The trouble is, PL/SQL will create bind variables for the above org_from and org_to parameters, which is what I would expect in most cases:
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((:B1 IS NULL) OR (:B1 <= org.no))
AND ((:B2 IS NULL) OR (:B2 >= org.no)))
-- [...]
The Workaround
Only in this case, I measured the query execution plan to be a lot better when I just inline the values, i.e. when the query executed by Oracle is actually something like
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
-- [...]
By "a lot", I mean 5-10x faster. Note that the query is executed very rarely, i.e. once a month. So I don't need to cache the execution plan.
My questions
How can I inline values in PL/SQL? I know about EXECUTE IMMEDIATE, but I would prefer to have PL/SQL compile my query, and not do string concatenation.
Did I just measure something that happened by coincidence or can I assume that inlining variables is indeed better (in this case)? The reason why I ask is because I think that bind variables force Oracle to devise a general execution plan, whereas inlined values would allow for analysing very specific column and index statistics. So I can imagine that this is not just a coincidence.
Am I missing something? Maybe there is an entirely other way to achieve query execution plan improvement, other than variable inlining (note I have tried quite a few hints as well but I'm not an expert on that field)?
In one of your comments you said:
"Also I checked various bind values.
With bind variables I get some FULL
TABLE SCANS, whereas with hard-coded
values, the plan looks a lot better."
There are two paths. If you pass in NULL for the parameters then you are selecting all records. Under those circumstances a Full Table Scan is the most efficient way of retrieving data. If you pass in values then indexed reads may be more efficient, because you're only selecting a small subset of the information.
When you formulate the query using bind variables the optimizer has to take a decision: should it presume that most of the time you'll pass in values or that you'll pass in nulls? Difficult. So look at it another way: is it more inefficient to do a full table scan when you only need to select a sub-set of records, or to do indexed reads when you need to select all records?
It seems as though the optimizer has plumped for full table scans as being the least inefficient operation to cover all eventualities.
Whereas when you hard code the values the Optimizer knows immediately that 10 IS NULL evaluates to FALSE, and so it can weigh the merits of using indexed reads for find the desired sub-set records.
So, what to do? As you say this query is only run once a month I think it would only require a small change to business processes to have separate queries: one for all organisations and one for a sub-set of organisations.
"Btw, removing the :R1 IS NULL clause
doesn't change the execution plan
much, which leaves me with the other
side of the OR condition, :R1 <=
org.no where NULL wouldn't make sense
anyway, as org.no is NOT NULL"
Okay, so the thing is you have a pair of bind variables which specify a range. Depending on the distribution of values, different ranges might suit different execution plans. That is, this range would (probably) suit an indexed range scan...
WHERE org.id BETWEEN 10 AND 11
...whereas this is likely to be more fitted to a full table scan...
WHERE org.id BETWEEN 10 AND 1199999
That is where Bind Variable Peeking comes into play.
(depending on distribution of values, of course).
Since the query plans are actually consistently different, that implies that the optimizer's cardinality estimates are off for some reason. Can you confirm from the query plans that the optimizer expects the conditions to be insufficiently selective when bind variables are used? Since you're using 11.2, Oracle should be using adaptive cursor sharing so it shouldn't be a bind variable peeking issue (assuming you are calling the version with bind variables many times with different NO values in your testing.
Are the cardinality estimates on the good plan actually correct? I know you said that the statistics on the NO column are accurate but I would be suspicious of a stray histogram that may not be updated by your regular statistics gathering process, for example.
You could always use a hint in the query to force a particular index to be used (though using a stored outline or optimizer plan stability would be preferable from a long-term maintenance perspective). Any of those options would be preferable to resorting to dynamic SQL.
One additional test to try, however, would be to replace the SQL 99 join syntax with Oracle's old syntax, i.e.
SELECT <<something>>
FROM <<some other table>> cust,
organization org
WHERE cust.org_id = org.id
AND ( ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
That obviously shouldn't change anything, but there have been parser issues with the SQL 99 syntax so that's something to check.
It smells like Bind Peeking, but I am only on Oracle 10, so I can't claim the same issue exists in 11.
This looks a lot like a need for Adaptive Cursor Sharing, combined with SQLPlan stability.
I think what is happening is that the capture_sql_plan_baselines parameter is true. And the same for use_sql_plan_baselines. If this is true, the following is happening:
The first time that a query started it is parsed, it gets a new plan.
The second time, this plan is stored in the sql_plan_baselines as an accepted plan.
All following runs of this query use this plan, regardless of what the bind variables are.
If Adaptive Cursor Sharing is already active,the optimizer will generate a new/better plan, store it in the sql_plan_baselines but is not able to use it, until someone accepts this newer plan as an acceptable alternative plan. Check dba_sql_plan_baselines and see if your query has entries with accepted = 'NO' and verified = null
You can use dbms_spm.evolve to evolve the new plan and have it automatically accepted if the performance of the plan is at least 1,5 times better than without the new plan.
I hope this helps.
I added this as a comment, but will offer up here as well. Hope this isn't overly simplistic, and looking at the detailed responses I may be misunderstanding the exact problem, but anyway...
Seems your organisations table has column no (org.no) that is defined as a number. In your hardcoded example, you use numbers to do the compares.
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
In your procedure, you are passing in varchar2:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
So to compare varchar2 to number, Oracle will have to do the conversions, so this may cause the full scans.
Solution: change proc to pass in numbers

Resources