Optimizing DB2 query - why does bash "time" output stay the same? - bash

I have a DB2 query on the TPC-H schema, which I'm trying to optimize with indexes. I can tell from db2expln output that the estimated costs are significantly lower (300%) when indexes are available.
However I'm also trying to measure execution time and it doesn't change significantly when the query is executed using indexes.
I'm working in an SSH terminal, executing my queries and writing output to a file like
(time db2 "SELECT Q.name FROM
(SELECT custkey, name FROM customer WHERE nationkey = 22) Q WHERE Q.custkey IN
(SELECT custkey FROM orders B WHERE B.orderkey IN
(SELECT orderkey FROM lineitem WHERE receiptdate BETWEEN '1992-06-11' AND '1992-07-11'))") &> output.txt
I did 10 measurements each: 1) without indexes, 2) with index on lineitem.receiptdate, 3) with indexes on lineitem.receiptdate and customer.nationkey,
calculated average time and standard deviation, all are within the same range.
I executed RUNSTATS ON TABLE schemaname.tablename AND DETAILED INDEXES ALL after index creation.
I read this post about output of the time command, from what I understand sys+user time should be relevant for my measurement. There is no change in added sys+user time, not either in real.
sys + user is around 44 ms, real is around 1s.
Any hints why I cannot see a change in time? Am I interpreting time output wrong? Are the optimizer estimations in db2expln misleading?
Disclaimer: I'm supposed to give a presentation about this at university, so it's technically homework, but as it's more of a comprehension question and not "please make my code work" I hope it's appropriate to post it here. Also, I know the query could be simplified but my question is not about this.

The optimizer estimates timerons (measurement of internal costs) and these timerons cannot be translated one to one into query execution time. So a difference of 300% in timerons does not mean you will see a 300% difference in runtime.
Measuring time for one or more SQL statements I recommend to use db2batch with the option
-i complete

`SELECT f1.name FROM customer f1
WHERE f1.nationkey = 22 and exists (
select * from orders f2 inner join lineitem f3 on f2.orderkey=f3.orderkey
where f1.custkey=f2.custkey and f3.receiptdate BETWEEN '1992-06-11' AND '1992-07-11'
)`

Related

How make view use index

The view has several join, but no WHEREcloses. It helped our developers to have all the data the needed in one appian object, that could be easily used in the "low code" later on. In most cases, Appian add conditions to query the data on the view, in a subsequent WHERE clause like below:
query: [Report on Record Type], order by: [[Sort[histoDateAction desc], Sort[id asc]]],
filters:[((histoDateAction >= TypedValue[it=9,v=2022-10-08 22:00:00.0])
AND (histoDateAction < TypedValue[it=9,v=2022-10-12 22:00:00.0])
AND (histoUtilisateur = TypedValue[it=3,v=miwem6]))
]) (APNX-1-4198-000) (APNX-1-4205-031)
Now we start to have data in the database, and performances get low. Reason seems to be, from execution plan view, query do not use indexes when data is queried.
Here is how the query for view VIEW_A looks:
SELECT
<columns> (not much transformation here)
FROM A
LEFT JOIN R on R.id=A.id_type1
LEFT JOIN R on R.id=A.id_type2
LEFT JOIN R on R.id=A.id_type3
LEFT JOIN U on U.id=A.id_user <500>
LEFT JOIN C on D.id=A.id_customer <50000>
LEFT JOIN P on P.id=A.id_prestati <100000>
and in the current, Appian added below clauses:
where A.DATE_ACTION < to_date('2022-10-12 22:00:00', 'YYYY-MM-DD HH24:MI:SS')
and A.DATE_ACTION >= to_date('2022-10-08 22:00:00', 'YYYY-MM-DD HH24:MI:SS')
and A.USER_ACTION = 'miwem6'
typically, when I show explain plan for the VIEW_A WHERE <conditions> , I have a cost around 6'000, and when I show explain plan for the <code of the view> where <clause>, the cost is 30.
Is it possible to use some Oracle hint to tell it: "Some day, someone will query this adding a WHERE clause on some columns, so don't be a stupid engine and use indexes when time comes"?
First, this isn't a great architecture. I can't tell you how many times folks have pulled me in to diagnose performance problems due to unpredictably dynamic queries where they are adding an assortment of unforseable WHERE predicates.
But if you have to do this, you can increase your likelihood of using indexes by lowering their cost. Like this:
SELECT /*+ opt_param('optimizer_index_cost_adj',1) */
<columns> (not much transformation here)
FROM A . . .
If you know for sure that nested loops + index use is the way you want to access everything, you can even disable the CBO entirely:
SELECT /*+ rule */
<columns> (not much transformation here)
FROM A . . .
But of course it's on you to ensure that there's an index on every high cardinality column that your system may use to significantly filter desired rows by. That's not every column, but it sounds like it may be quite a few.
Oh, and one more thing... please ignore COST. By definition Oracle always chooses what it computes as the lowest cost plan. When it makes wrong choices, it's because it's computation of cost is incorrect. Therefore by definition, if you are having problems, the COST numbers you see are wrong. Ignore them.

Performance tuning tips -Plsql/sql-Database

We are facing performance issue in production. Mv refersh program is running for long, almost 13 to 14 hours.
In the MV refersh program is trying to refersh 5 MV. Among that one of the MV is running for long.
Below is the MV script which is running for long.
SELECT rcvt.transaction_id,
rsh.shipment_num,
rsh.shipped_date,
rsh.expected_receipt_date,
(select rcvt1.transaction_date from rcv_transactions rcvt1
where rcvt1.po_line_id = rcvt.po_line_id
AND rcvt1.transaction_type = 'RETURN TO VENDOR'
and rcvt1.parent_transaction_id=rcvt.transaction_id
)transaction_date
FROM rcv_transactions rcvt,
rcv_shipment_headers rsh,
rcv_shipment_lines rsl
WHERE 1 =1
AND rcvt.shipment_header_id =rsl.shipment_header_id
AND rcvt.shipment_line_id =rsl.shipment_line_id
AND rsl.shipment_header_id =rsh.shipment_header_id
AND rcvt.transaction_type = 'RECEIVE';
Shipment table contains millions of records and above query is trying to extract almost 60 to 70% of the data. We are suspecting data load is the reason.
We are trying to improve the performance for the above script.So we added date filter to restrict the data.
SELECT rcvt.transaction_id,
rsh.shipment_num,
rsh.shipped_date,
rsh.expected_receipt_date,
(select rcvt1.transaction_date from rcv_transactions rcvt1
where rcvt1.po_line_id = rcvt.po_line_id
AND rcvt1.transaction_type = 'RETURN TO VENDOR'
and rcvt1.parent_transaction_id=rcvt.transaction_id
)transaction_date
FROM rcv_transactions rcvt,
rcv_shipment_headers rsh,
rcv_shipment_lines rsl
WHERE 1 =1
AND rcvt.shipment_header_id =rsl.shipment_header_id
AND rcvt.shipment_line_id =rsl.shipment_line_id
AND rsl.shipment_header_id =rsh.shipment_header_id
AND rcvt.transaction_type = 'RECEIVE'
AND TRUNC(rsh.creation_date) >= NVL(TRUNC((sysdate - profile_value),'MM'),TRUNC(rsh.creation_date) );
For 1 year profile, it shows some improvement but if we give for 2 years range its more worse than previous query.
Any suggestions to improve the performance.
Pls help
I'd pull out that scalar subquery into a regular outer join.
Costing for scalar subqueries can be poor and you are forcing it to do a lot of single record lookups (presumably via index) rather than giving it other options.
"The main query then has a scalar subquery in the select list.
Oracle therefore shows two independent plans in the plan table. One for the driving query – which has a cost of two, and one for the scalar subquery, which has a cost of 2083 each time it executes.
But Oracle does not “know” how many times the scalar subquery will run (even though in many cases it could predict a worst-case scenario), and does not make any cost allowance whatsoever for its execution in the total cost of the query."

More efficient query to avoid OutOfMemoryError in Hive

I'm getting an exception in Hive:
java.lang.OutOfMemoryError: GC overhead limit exceeded.
In searching I've found that is because 98% of all CPU time of the process is going to garbage collect (whatever that means?). Is the core of my issue in my query? Should I be writing the below in a different way to avoid this kind of problem?
I'm trying to count how many of a certain phone type have an active 'Use' in a given time period. Is there a way to do this logic differently, that would run better?
select count(a.imei)
from
(Select distinct imei
from pingdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
and ((SUBSTR(imei,12,2) = "04") or (SUBSTR(imei,12,2) = "05")) ) a
join
(SELECT distinct imei
FROM eventdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
AND event = "Use" AND clientversion like '3.2%') b
on a.imei=b.imei
Thank you
Applying distinct to each dataset before joining them is safer because joining not unique keys will duplicate data.
I would recommend to partition your datasets by to_date(timestamp) field (yyyy-MM-dd) to make partition pruning work according to your where clause (check it works). Partition also by event field if datasets are too big and contain a lot of data where event <> 'Use'.
It's important to know on which stage it fails. Study the exception as well. If it fails on mappers then you should optimize your subqueries (add partitions as I mentioned). if it fails on reducer (join) then you should somehow improve join (try to reduce bytes per reducer:
set hive.exec.reducers.bytes.per.reducer=67108864; or even less) if it fails on writer (OrcWriter then try to add partition to Output table by substr from imei and 'distribute by substr(imei...)` at the end of query to reduce pressure on reducers).
Or add une more column with low cardinality and even distribution to distribute the data between more reducers evenly:
distribute by substr(imei...), col2
Make sure that partition column is in the distribute by. This will reduce the number of files written by each reducer and help to get rid of OOM
In order to improve performance, by looking at your query: I would partition the hive tables by yyyy, mm, dd, or by first two digits of imei, you will have to decide the variable according to your need of querying these tables and amount of data. but I would vote for yyyy, mm, dd, that will give you tremendous amount of improvement on performance. see improving-query-performance-using-partitioning
But for now, this should give you some improvements:
Select count(distinct(pd.imei))
from pingdata pd join eventdata ed on pd.imei=ed.imei
where
TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'
and pd.timestamp=ed.pd.timestamp
and SUBSTR(pd.imei,12,2) in ('04','05')
and ed.event = 'Use' AND ed.clientversion like '3.2%';
if TO_DATE(timestamp) values are inserted on same day, in other words if both values are same for date than and pd.timestamp=ed.pd.timestamp condition should be excluded.
Select count(distinct(pd.imei))
from pingdata pd join eventdata ed on pd.imei=ed.imei
where
TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'
and SUBSTR(pd.imei,12,2) in ('04','05')
and ed.event = 'Use' AND ed.clientversion like '3.2%';
Try running both queries and compare results. Do let us know the differences and if you find this helpful.

Force "TOP 100 PERCENT" in EF "sub-query" queryable

Update: I was mistaken about top 100 percent generating a better query plan (the plan is still much better for a reasonably sized top N, and probably has to do with parameter sniffing).
While I still think this focused question has merit, it is not "a useful solution" for my problem2, and might not be for yours ..
I am running into some queries which SQL Server optimizes poorly. The statistics appear correct, and SQL Server chooses the 'worse' plan that performs a seek over millions of records even though the estimated and actual values are the same - but this question is not about that1.
In the problematic queries are of the simplified form:
select * from x
join y on ..
join z on ..
where z.q = ..
However (and since I know the cardinalities better, apparently) the following form consistently results in a much better query plan:
select * from x
join (
-- the result set here is 'quite small'
select top 100 percent *
from y on ..
join z on ..
where z.q = ..) t on ..
In L2S the Take function can be used to limit to top N, but the "problem" I have with this approach is that requires a finite/fixed N such that some query could hypothetically just break, instead of just running really slow with the forced materialization.
While I could choose a 'very large' value for the top N this, ironically (wrt to the initial problem), increases the SQL query execution time as the value of N increases. The expected intermediate result is only expected to be a few dozen to a few hundred records. The current code I have runs a top 100 and then, if such was detected to contain too many results, runs the query again without the limit: but this feels like a kludge .. on top of a kludge.
The question is then: can a EF/L2E/LINQ query generate the equivalent of a top 100 percent on an EF Queryable?
(Forcing materialization via ToList is not an option because the result should be an EF Queryable and remain in LINQ to Entities, not LINQ to Objects.)
While I am currently dealing with EF4, if this is [only] possible in a later version of EF I would accept such as an answer - such is useful knowledge and does answer the question asked.
1 If wishing to answer with "don't do that" or an "alternative", please make it is an secondary answer or aside along with an answer to the actual question being asked. Otherwise, feel free to use the comments.
2 In addition to top 100 percent not generating a better query plan, I forgot to include the 'core issue' at stake, which is bad parameter sniffing (instance is SQL Server 2005).
The following query takes a very long to to complete while direct variable substitution runs "in the blink of an eye" indicating an issue with the parameter sniffing.
declare #x int
set #x = 19348659
select
op.*
from OrderElement oe
join OrderRatePlan rp on oe.OrdersElementID = rp.OrdersElementID
join OrderPrice op on rp.OrdersRatePlanID = op.OrdersRatePlanID
where oe.OrdersProductID = #x
The kludged-but-workable query
select
op.*
from OrderPrice op
join (
-- Choosing a 'small value of N' runs fast and it slows down as the
-- value of N is increases where N >> 1000 simply "takes too long".
-- Using TOP 100 PERCENT also "takes too long".
select top 100
rp.*
from OrderElement oe
join OrderRatePlan rp on oe.OrdersElementID = rp.OrdersElementID
where oe.OrdersProductID = #x
) rp
on rp.OrdersRatePlanID = op.OrdersRatePlanID
Yes, you can do your own query.
db.SqlQuery<something>("SELECT * FROM x ...");

performance of rand()

I have heard that I should avoid using 'order by rand()', but I really need to use it. Unlike what I have been hearing, the following query comes up very fast.
select
cp1.img_id as left_id,
cp1.img_filename as left_filename,
cp1.facebook_name as left_facebook_name,
cp2.img_id as right_id,
cp2.img_filename as right_filename,
cp2.facebook_name as right_facebook_name
from
challenge_photos as cp1
cross join
challenge_photos as cp2
where
(cp1.img_id < cp2.img_id)
and
(cp1.img_id,cp2.img_id) not in ((0,0))
and
(cp1.img_status = 1 and cp2.img_status = 1)
order by rand() limit 1
is this query considered 'okay'? or should I use queries that I can find by searching "alternative to rand()" ?
It's usually a performance thing. You should avoid, as much as possible, per-row functions since they slow down your queries.
That means things like uppercase(name), salary * 1.1 and so on. It also includes rand(). It may not be an immediate problem (at 10,000 rows) but, if you ever want your database to scale, you should keep it in mind.
The two main issues are the fact that you're performing a per-row function and then having to do a full sort on the output before selecting the first row. The DBMS cannot use an index if you sort on a random value.
But, if you need to do it (and I'm not making judgement calls there), then you need to do it. Pragmatism often overcomes dogmatism in the real world :-)
A possibility, if performance ever becomes an issue, is to get a count of the records with something like:
select count(*) from ...
then choose a random value on the client side and use a:
limit <start>, <count>
clause in another select, adjusting for the syntax used by your particular DBMS. This should remove the sorting issue and the transmission of unneeded data across the wire.

Resources