Performance tuning tips -Plsql/sql-Database - oracle

We are facing performance issue in production. Mv refersh program is running for long, almost 13 to 14 hours.
In the MV refersh program is trying to refersh 5 MV. Among that one of the MV is running for long.
Below is the MV script which is running for long.
SELECT rcvt.transaction_id,
rsh.shipment_num,
rsh.shipped_date,
rsh.expected_receipt_date,
(select rcvt1.transaction_date from rcv_transactions rcvt1
where rcvt1.po_line_id = rcvt.po_line_id
AND rcvt1.transaction_type = 'RETURN TO VENDOR'
and rcvt1.parent_transaction_id=rcvt.transaction_id
)transaction_date
FROM rcv_transactions rcvt,
rcv_shipment_headers rsh,
rcv_shipment_lines rsl
WHERE 1 =1
AND rcvt.shipment_header_id =rsl.shipment_header_id
AND rcvt.shipment_line_id =rsl.shipment_line_id
AND rsl.shipment_header_id =rsh.shipment_header_id
AND rcvt.transaction_type = 'RECEIVE';
Shipment table contains millions of records and above query is trying to extract almost 60 to 70% of the data. We are suspecting data load is the reason.
We are trying to improve the performance for the above script.So we added date filter to restrict the data.
SELECT rcvt.transaction_id,
rsh.shipment_num,
rsh.shipped_date,
rsh.expected_receipt_date,
(select rcvt1.transaction_date from rcv_transactions rcvt1
where rcvt1.po_line_id = rcvt.po_line_id
AND rcvt1.transaction_type = 'RETURN TO VENDOR'
and rcvt1.parent_transaction_id=rcvt.transaction_id
)transaction_date
FROM rcv_transactions rcvt,
rcv_shipment_headers rsh,
rcv_shipment_lines rsl
WHERE 1 =1
AND rcvt.shipment_header_id =rsl.shipment_header_id
AND rcvt.shipment_line_id =rsl.shipment_line_id
AND rsl.shipment_header_id =rsh.shipment_header_id
AND rcvt.transaction_type = 'RECEIVE'
AND TRUNC(rsh.creation_date) >= NVL(TRUNC((sysdate - profile_value),'MM'),TRUNC(rsh.creation_date) );
For 1 year profile, it shows some improvement but if we give for 2 years range its more worse than previous query.
Any suggestions to improve the performance.
Pls help

I'd pull out that scalar subquery into a regular outer join.
Costing for scalar subqueries can be poor and you are forcing it to do a lot of single record lookups (presumably via index) rather than giving it other options.
"The main query then has a scalar subquery in the select list.
Oracle therefore shows two independent plans in the plan table. One for the driving query – which has a cost of two, and one for the scalar subquery, which has a cost of 2083 each time it executes.
But Oracle does not “know” how many times the scalar subquery will run (even though in many cases it could predict a worst-case scenario), and does not make any cost allowance whatsoever for its execution in the total cost of the query."

Related

Optimizing DB2 query - why does bash "time" output stay the same?

I have a DB2 query on the TPC-H schema, which I'm trying to optimize with indexes. I can tell from db2expln output that the estimated costs are significantly lower (300%) when indexes are available.
However I'm also trying to measure execution time and it doesn't change significantly when the query is executed using indexes.
I'm working in an SSH terminal, executing my queries and writing output to a file like
(time db2 "SELECT Q.name FROM
(SELECT custkey, name FROM customer WHERE nationkey = 22) Q WHERE Q.custkey IN
(SELECT custkey FROM orders B WHERE B.orderkey IN
(SELECT orderkey FROM lineitem WHERE receiptdate BETWEEN '1992-06-11' AND '1992-07-11'))") &> output.txt
I did 10 measurements each: 1) without indexes, 2) with index on lineitem.receiptdate, 3) with indexes on lineitem.receiptdate and customer.nationkey,
calculated average time and standard deviation, all are within the same range.
I executed RUNSTATS ON TABLE schemaname.tablename AND DETAILED INDEXES ALL after index creation.
I read this post about output of the time command, from what I understand sys+user time should be relevant for my measurement. There is no change in added sys+user time, not either in real.
sys + user is around 44 ms, real is around 1s.
Any hints why I cannot see a change in time? Am I interpreting time output wrong? Are the optimizer estimations in db2expln misleading?
Disclaimer: I'm supposed to give a presentation about this at university, so it's technically homework, but as it's more of a comprehension question and not "please make my code work" I hope it's appropriate to post it here. Also, I know the query could be simplified but my question is not about this.
The optimizer estimates timerons (measurement of internal costs) and these timerons cannot be translated one to one into query execution time. So a difference of 300% in timerons does not mean you will see a 300% difference in runtime.
Measuring time for one or more SQL statements I recommend to use db2batch with the option
-i complete
`SELECT f1.name FROM customer f1
WHERE f1.nationkey = 22 and exists (
select * from orders f2 inner join lineitem f3 on f2.orderkey=f3.orderkey
where f1.custkey=f2.custkey and f3.receiptdate BETWEEN '1992-06-11' AND '1992-07-11'
)`

Force "TOP 100 PERCENT" in EF "sub-query" queryable

Update: I was mistaken about top 100 percent generating a better query plan (the plan is still much better for a reasonably sized top N, and probably has to do with parameter sniffing).
While I still think this focused question has merit, it is not "a useful solution" for my problem2, and might not be for yours ..
I am running into some queries which SQL Server optimizes poorly. The statistics appear correct, and SQL Server chooses the 'worse' plan that performs a seek over millions of records even though the estimated and actual values are the same - but this question is not about that1.
In the problematic queries are of the simplified form:
select * from x
join y on ..
join z on ..
where z.q = ..
However (and since I know the cardinalities better, apparently) the following form consistently results in a much better query plan:
select * from x
join (
-- the result set here is 'quite small'
select top 100 percent *
from y on ..
join z on ..
where z.q = ..) t on ..
In L2S the Take function can be used to limit to top N, but the "problem" I have with this approach is that requires a finite/fixed N such that some query could hypothetically just break, instead of just running really slow with the forced materialization.
While I could choose a 'very large' value for the top N this, ironically (wrt to the initial problem), increases the SQL query execution time as the value of N increases. The expected intermediate result is only expected to be a few dozen to a few hundred records. The current code I have runs a top 100 and then, if such was detected to contain too many results, runs the query again without the limit: but this feels like a kludge .. on top of a kludge.
The question is then: can a EF/L2E/LINQ query generate the equivalent of a top 100 percent on an EF Queryable?
(Forcing materialization via ToList is not an option because the result should be an EF Queryable and remain in LINQ to Entities, not LINQ to Objects.)
While I am currently dealing with EF4, if this is [only] possible in a later version of EF I would accept such as an answer - such is useful knowledge and does answer the question asked.
1 If wishing to answer with "don't do that" or an "alternative", please make it is an secondary answer or aside along with an answer to the actual question being asked. Otherwise, feel free to use the comments.
2 In addition to top 100 percent not generating a better query plan, I forgot to include the 'core issue' at stake, which is bad parameter sniffing (instance is SQL Server 2005).
The following query takes a very long to to complete while direct variable substitution runs "in the blink of an eye" indicating an issue with the parameter sniffing.
declare #x int
set #x = 19348659
select
op.*
from OrderElement oe
join OrderRatePlan rp on oe.OrdersElementID = rp.OrdersElementID
join OrderPrice op on rp.OrdersRatePlanID = op.OrdersRatePlanID
where oe.OrdersProductID = #x
The kludged-but-workable query
select
op.*
from OrderPrice op
join (
-- Choosing a 'small value of N' runs fast and it slows down as the
-- value of N is increases where N >> 1000 simply "takes too long".
-- Using TOP 100 PERCENT also "takes too long".
select top 100
rp.*
from OrderElement oe
join OrderRatePlan rp on oe.OrdersElementID = rp.OrdersElementID
where oe.OrdersProductID = #x
) rp
on rp.OrdersRatePlanID = op.OrdersRatePlanID
Yes, you can do your own query.
db.SqlQuery<something>("SELECT * FROM x ...");

Improve SQLite anti-join performance

Check out the update at the bottom of this question, the cause of the unexpected variance in query times noted below has been identified as the result of a sqliteman quirk
I have the following two tables in a SQLite DB (The structure might seem pointless I know but bear with me)
+-----------------------+
| source |
+-----------------------+
| item_id | time | data |
+-----------------------+
+----------------+
| target |
+----------------+
| item_id | time |
+----------------+
--Both tables have a multi column index on item_id and time
The source table contains around 500,000 rows, there will never more than one matching record in the target table, in practise it is likely almost all source rows will have a matching target row.
I am attempting to a perform a fairly standard anti-join to find all records in source without corresponding rows in target, but am finding it difficult to create a query with an acceptable execution time.
The query I am using is:
SELECT
source.item_id,
source.time,
source.data
FROM source
LEFT JOIN target USING (item_id, time)
WHERE target.item_id IS NULL;
Just the LEFT JOIN without the WHERE clause takes around 200ms to complete, with it this increases to 5000ms.
While I originally noticed the slow query from within my consuming application the timings above were obtained by executing the statements directly from within sqliteman.
Is there a particular reason why this seemingly simple clause so dramatically increases execution time and is there some way I can restructure this query to improve it?
I have also tried the following with the same result. (I imagine the underlying query plan is the same)
SELECT
source.item_id,
source.time,
source.data
FROM source
WHERE NOT EXISTS (
SELECT 1 FROM target
WHERE target.item_id = source.item_id
AND target.time = source.time
);
Thanks very much!
Update
Terribly sorry, it turns out that these apparent results are actually due to a quirk with sqliteman.
It seems sqliteman arbitrarily applies a limit to the number of rows returned to 256, and will load more dynamically as you scroll through them. This will make a query over a large dataset appear much quicker then actually is, making it a poor choice for estimating query performance.
Nonetheless is their any obvious way to improve the performance of this query or am I simply hitting limits of what SQLite is capable of?
This is the query plan of your query (either one):
0|0|0|SCAN TABLE source
0|1|1|SEARCH TABLE target USING COVERING INDEX ti (item_id=? AND time=?)
This is pretty much as efficient as possible:
Every row in source must be checked, by
searching for a matching row in target.
It might be possible to make one little improvement.
The source rows are probably not ordered, so the target search will do a lookup at a random position in the index.
If we can force the source scan to be in index order, the target lookups will be in order too, which makes it more likely for these index pages to already be in the cache.
SQLite will use the source index if we do not use any columns not in the index, i.e., if we drop the data column:
> EXPLAIN QUERY PLAN
SELECT source.item_id, source.time
FROM source
LEFT JOIN target USING (item_id, time)
WHERE target.item_id IS NULL;
0|0|0|SCAN TABLE source USING COVERING INDEX si
0|1|1|SEARCH TABLE target USING COVERING INDEX ti (item_id=? AND time=?)
This might not help much.
But if it helps, and if you want the other columns in source, you can do this by doing the join first, and then looking up the source rows by their rowid (the extra lookup should not hurt if you have very few results):
SELECT *
FROM source
WHERE rowid IN (SELECT source.rowid
FROM source
LEFT JOIN target USING (item_id, time)
WHERE target.item_id IS NULL)

Postgres optimize UPDATE

I have to do a bit complicated data import. I need to do a number of UPDATEs which currently updating over 3 million rows in one query. This query is applying about 30-45 sec each (some of them even 4-5 minutes). My question is, whether I can speed it up. Where can I read something about it, what kind of indexes and on which columns I can set to improve those updates. I don't need exacly answer, so I don't show the tables. I am looking for some stuff to learn about it.
Two things:
1) Post an EXPLAIN ANALYZE of your UPDATE query.
2) If your UPDATE does not need to be atomic, then you may want to consider breaking apart the number of rows affected by your UPDATE. To minimize the number of "lost rows" due to exceeding the Free Space Map, consider the following approach:
BEGIN
UPDATE ... LIMIT N; or some predicate that would limit the number of rows (e.g. WHERE username ilike 'a%';).
COMMIT
VACUUM table_being_updated
Repeat steps 1-4 until all rows are updated.
ANALYZE table_being_updated
I suspect you're updating every row in your table and don't need all rows to be visible with the new value at the end of a single transaction, therefore the above approach of breaking the UPDATE up in to smaller transactions will be a good approach.
And yes, an INDEX on the relevant columns specified in the UPDATE's predicate will help will dramatically help. Again, post an EXPLAIN ANALYZE if you need further assistance.
If by a number of UPDATEs you mean one UPDATE command to each updated row then the problem is that all the target table's indexes will be updated and all constraints will be checked at each updated row. If that is the case then try instead to update all rows with a single UPDATE:
update t
set a = t2.b
from t2
where t.id = t2.id
If the imported data is in a text file then insert it in a temp table first and update from there. See my answer here

performance of rand()

I have heard that I should avoid using 'order by rand()', but I really need to use it. Unlike what I have been hearing, the following query comes up very fast.
select
cp1.img_id as left_id,
cp1.img_filename as left_filename,
cp1.facebook_name as left_facebook_name,
cp2.img_id as right_id,
cp2.img_filename as right_filename,
cp2.facebook_name as right_facebook_name
from
challenge_photos as cp1
cross join
challenge_photos as cp2
where
(cp1.img_id < cp2.img_id)
and
(cp1.img_id,cp2.img_id) not in ((0,0))
and
(cp1.img_status = 1 and cp2.img_status = 1)
order by rand() limit 1
is this query considered 'okay'? or should I use queries that I can find by searching "alternative to rand()" ?
It's usually a performance thing. You should avoid, as much as possible, per-row functions since they slow down your queries.
That means things like uppercase(name), salary * 1.1 and so on. It also includes rand(). It may not be an immediate problem (at 10,000 rows) but, if you ever want your database to scale, you should keep it in mind.
The two main issues are the fact that you're performing a per-row function and then having to do a full sort on the output before selecting the first row. The DBMS cannot use an index if you sort on a random value.
But, if you need to do it (and I'm not making judgement calls there), then you need to do it. Pragmatism often overcomes dogmatism in the real world :-)
A possibility, if performance ever becomes an issue, is to get a count of the records with something like:
select count(*) from ...
then choose a random value on the client side and use a:
limit <start>, <count>
clause in another select, adjusting for the syntax used by your particular DBMS. This should remove the sorting issue and the transmission of unneeded data across the wire.

Resources