I have recently came across a case in which the overall SELECT statement was very fast (0.06 sec) but this part took 20 seconds:
COALESCE(T1.trees, T1.trees, T1.flowers) ASC
While used in: ORDER BY COALESCE(T2.trees, T2.trees, T2.flowers) ASC, T1.Shirts)
T1 & T2 are aliases to the same table while T2 is used for left outer Join..
Any alternatives for the COALESCE(T2.trees, T2.trees, T2.flowers) ASC part that will present better performance?
(as I read on other posts that probably do the the fact that COALESCE(x, y) doesn't make full use of indexes). So assume that trees column is indexed.)
Cheers,
udifel
Related
I am facing a big performance problem when trying to get a list of objects with pagination from an oracle11g database.
As far as I know and as much as I have checked online, the only way to achieve pagination in oracle11g is the following :
Example : [page=1, size=100]
SELECT * FROM
(
SELECT pagination.*, rownum r__ FROM
(
select * from "TABLE_NAME" t
inner join X on X.id = t.id
inner join .....
where ......
order
) pagination
WHERE rownum <= 200
)
WHERE r__ > 100
The problem in this query, is that the most inner query fetching data from the table "TABLE_NAME" is returning a huge amount of data and causing the overall query to take 8 seconds (there are around 2 Million records returned after applying the where clause, and it contains 9 or 10 join clause).
The reason of this is that the most inner query is fetching all the data that respects the where clause and then the second query is getting the 200 rows, and the third to exclude the first 100 to get the second pages' data we need.
Isn't there a way to do that in one query, in a way to fetch the second pages' data that we need without having to do all these steps and cause performance issues?
Thank you!!
It depends on your sorting options (order by ...): database needs to sort whole dataset before applying outer where rownum<200 because of your order by clause.
It will fetch only 200 rows if you remove your order by clause. In some cases oracle can avoid sort operations (for example, if oracle can use some indexes to get requested data in the required order). Btw, Oracle uses optimized sorting operations in case of rownum<N predicates: it doesn't sort full dataset, it just gets top N records instead.
You can investigate sort operations deeper using sort trace event: alter session set events '10032 trace name context forever, level 10';
Furthermore, sometimes it's better to use analytic functions like
select *
from (
select
t1.*
,t2.*
,row_number()over([partition by ...] order by ...) rn
from t1
,t2
where ...
)
where rn <=200
and rn>=100
because in some specific cases Oracle can transform your query to push sorting and sort filter predicates to the earliest possible steps.
I am executing the below MERGE statement for Insert Update operation.
It is working fine for 1 to 2 million records but for more than 4 to 5 billion records it takes 6 to 7 hours to complete.
Can anyone suggest some alternative or performance tips for Merge Statement
merge into employee_payment ep
using (
select
p.pay_id vista_payroll_id,
p.pay_date pay_dte,
c.client_id client_id,
c.company_id company_id,
case p.uni_ni when 0 then null else u.unit_id end unit_id,
p.pad_seq pay_dist_seq_nbr,
ph.payroll_header_id payroll_header_id,
p.pad_id vista_paydist_id,
p.pad_beg_payperiod pay_prd_beg_dt,
p.pad_end_payperiod pay_prd_end_d
from
stg_paydist p
inner join company c on c.vista_company_id = p.emp_ni
inner join payroll_header ph on ph.vista_payroll_id = p.pay_id
left outer join unit u on u.vista_unit_id = p.uni_ni
where ph.deleted = '0'
) ps
on (ps.vista_paydist_id = ep.vista_paydist_id)
when matched then
update
set ep.vista_payroll_id = ps.vista_payroll_id,
ep.pay_dte = ps.pay_dte,
ep.client_id = ps.client_id,
ep.company_id = ps.company_id,
ep.unit_id = ps.unit_id,
ep.pay_dist_seq_nbr = ps.pay_dist_seq_nbr,
ep.payroll_header_id = ps.payroll_header_id
when not matched then
insert (
ep.employee_payment_id,
ep.vista_payroll_id,
ep.pay_dte,
ep.client_id,
ep.company_id,
ep.unit_id,
ep.pay_dist_seq_nbr,
ep.payroll_header_id,
ep.vista_paydist_id
) values (
seq_employee_payments.nextval,
ps.vista_payroll_id,
ps.pay_dte,
ps.client_id,
ps.company_id,
ps.unit_id,
ps.pay_dist_seq_nbr,
ps.payroll_header_id,
ps.vista_paydist_id
) log errors into errorlog (v_batch || 'EMPLOYEE_PAYMENT') reject limit unlimited;
Try using the Oracle hints:
MERGE /*+ append leading(PS) use_nl(PS EP) parallel (12) */
Try to using hints to optimize inner using query.
Processing lots of data takes lots of time...
Here are some things that may help you (assuming there is not a probolem with bad execution plan):
Adding a where-clause in the UPDATE-part to only update records when the values are actually different. If you are merging the same data over and over again and only a smaller subset of the data is actually modified, this will improve performance.
If you indeed are processing the same data over and over again, investigate whether you can add some modification flag/date to only process new records since last time.
Depending on the kind of environment and when/who is updating your source tables, investigate whether a truncate-insert approach is beneficial. Remember to set the indexes unusuable on before hand.
I think your best bet here is to exploit the patterns in your data. This is something oracle does not know about, so you may have to get creative.
I was working on a similar problem and a good solution i found was to break the query up.
The primary reason big table merges are a bad idea is because of the in memory storage of the result of the using query. Because the PGA gets filled up pretty quickly so the database starts using the temporary table space of sort operations and joins. The temp tablespace being on disk is excruciatingly slow. The use of excessive temp table space can be easily avoided by splitting the query into two queries.
So the below query
merger into emp e
using (
select a,b,c,d from (/* big query here */)
) ec
on /*conditions*/
when matched then
/* rest of merge logic */
can become
create table temp_big_query as select a,b,c,d from (/* big query here */);
merger into emp e
using (
select a,b,c,d from temp_big_query
) ec
on /*conditions*/
when matched then
/* rest of merge logic */
if the using query also has CTEs and sub queries try breaking that query up to use more temp tables like the one shown above. Also avoid using parallel hints because they mostly tend to slow the query down unless the query itself has something that can be done in parallel, try using indexes instead instead as much as possible parallel should be used as the last option for optimization.
I know some references are missing please feel free to comment and add references or point out mistakes in my answer.
I have four tables; two for current data, two for archive data. One of the archive tables has tens of millions of rows. All tables have a couple narrow indexes and are very similar.
Given the following queries:
SELECT (SELECT COUNT(*) FROM A)
UNION SELECT (SELECT COUNT(*) FROM B)
UNION SELECT (SELECT COUNT(*) FROM C_LargeTable)
UNION SELECT (SELECT COUNT(*) FROM D);
A, B and D perform index scans. C_LargeTable uses a seq scan and the query takes about 20 seconds to execute. Table D has millions of rows as well, but is only about 10% of the size of C_LargeTable
If I then modify my query to execute using the following logic, which sufficiently narrows counts, I still get the same results, the index is used and the query takes about 5 seconds, or 1/4th of the time
...
SELECT (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col < 'G')
+ (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col BETWEEN 'G' AND 'Q')
+ (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col > 'Q')
...
It does not makes sense to me to have the I/O overhead of a full table scan for a count when perfectly good indexes exist and there is a covering primary key which would ensure uniqueness. My understanding of postgres is that a PRIMARY KEY isn't like a SQL Server clustering index in that it determines a sort, but it implicitly creates a btree index to ensure uniqueness, which I assume should require significantly less I/O than a full table scan.
Is this potentially an indication of an optimization that I may need to perform to organize data within C_LargeTable?
There isn't a covering index on the primary key because PostgreSQL doesn't support them (true up to and including 9.4 anyway).
The heap scan is required because of MVCC visibility. The index doesn't contain visibility information. Pg can do an index scan, but it still has to check visibility info from the heap, and with an index scan that'd be random I/O to read the whole table, so a seqscan will be much faster.
Make sure you run 9.2 or newer, and that autovacuum is configured to run frequently on the table. You should then be able to do an index-only scan where the visibility map is used. This only works under limited circumstances as Horse notes; see the wiki page on count and on index-only scans. If you aren't letting autovacuum run regularly enough the visibility map will be outdated and Pg won't be able to do an index-only scan.
In future, make sure you post explain or preferably explain analyze output with any queries.
This question already has answers here:
COUNT(*) vs. COUNT(1) vs. COUNT(pk): which is better? [duplicate]
(5 answers)
Closed 8 years ago.
I'm running a query like this in MSSQL2008:
select count(*)
from t1
inner join t2 on t1.id = t2.t1_id
inner join t3 on t1.id = t3.t1_id
Assume t1.id has a NOT NULL constraint. Since they're inner joins and t1.id can never be null, using count(t1.id) instead of count(*) should produce the exact same end result. My question is: Would the performance be the same?
I'm also wondering whether the joins could affect this. I realize that adding or removing a join will affect both performance and the length of the result set. Suppose that without changing the join pattern, you set count to target only one table. Would it make any difference? In other words, is there a difference between these two queries:
select count(*) from t1 inner join t2 on t1.id = t2.t1_id
select count(t1.*) from t1 inner join t2 on t1.id = t2.t1_id
COUNT(id) vs. COUNT(*) in MySQL answers this question for MySQL, but I couldn't find answers for MS-SQL specifically, and I can't find anything at all that takes the join factor into account.
NOTE: I tried to find this information on both Google and SO, but it was difficult to figure out how to word my search.
I tried a few SELECT COUNT(*) FROM MyTable vs. SELECT COUNT(SomeColumn) FROM MyTable with various sizes of tables, and where the SomeColumn once is a clustering key column, once it's in a non-clustered index, and once it's in no index at all.
In all cases, with all sizes of tables (from 300'000 rows to 170 million rows), I never see any difference in terms of either speed nor execution plan - in all cases, the COUNT is handled by doing a clustered index scan --> i.e. scanning the whole table, basically. If there is a non-clustered index involved, then the scan is on that index - even when doing a SELECT COUNT(*)!
There doesn't seem to be any difference in terms of speed or approach how those things are counted - to count them all, SQL Server just needs to scan the whole table - period.
Tests were done on SQL Server 2008 R2 Developer Edition
select count(*) will be slower as it attempts to fetch everything. Specifying a column (PK or any other indexed column) will speed up things as the query engine knows ahead of time what it is looking for. It'll also use an index as opposed to going against the table.
When joining across tables (as in the examples below), is there an efficiency difference between joining on the tables or joining subqueries containing only the needed columns?
In other words, is there a difference in efficiency between these two tables?
SELECT result
FROM result_tbl
JOIN test_tbl USING (test_id)
JOIN sample_tbl USING (sample_id)
JOIN (SELECT request_id
FROM request_tbl
WHERE request_status='A') USING(request_id)
vs
SELECT result
FROM (SELECT result, test_id FROM result_tbl)
JOIN (SELECT test_id, sample_id FROM test_tbl) USING(test_id)
JOIN (SELECT sample_id FROM sample_tbl) USING(sample_id)
JOIN (SELECT request_id
FROM request_tbl
WHERE request_status='A') USING(request_id)
The only way to find out for sure is to run both with tracing turned on and then look at the trace file. But in all probability they will be treated the same: the optimizer will merge all the inline views into the main statement and come up with the same query plan.
It doesn't matter. It may actually be WORSE since you are taking control away from the optimizer which generally knows best.
However, remember if you are doing a JOIN and only including a column from one of the tables that it is QUITE OFTEN better to re-write it as a series of EXISTS statements -- because that's what you really mean. JOINs (with some exceptions) will join matching rows which is a lot more work for the optimizer to do.
e.g.
SELECT t1.id1
FROM table1 t1
INNER JOIN table2 ON something = something
should almost always be
SELECT id1
FROM table1 t1
WHERE EXISTS( SELECT *
FROM table2
WHERE something = something )
For simple queries the optimizer may reduce the query plans into identical ones. Check it out on your DBMS.
Also this is a code smell and probably should be changed:
JOIN (SELECT request_id
FROM request_tbl
WHERE request_status='A')
to
SELECT result
FROM request
WHERE EXISTS(...)
AND request_status = 'A'
No difference.
You can tell by running EXPLAIN PLAN on both those statements - Oracle knows that all you want is the "result" column, so it only does the minimum necessary to get the data it needs - you should find that the plans will be identical.
The Oracle optimiser does, sometimes, "materialize" a subquery (i.e. run the subquery and keep the results in memory for later reuse), but this is rare and only occurs when the optimiser believes this will result in a performance improvement; in any case, Oracle will do this "materialization" whether you specified the columns in the subqueries or not.
Obviously if the only place the "results" column is stored is in the blocks (along with the rest of the data), Oracle has to visit those blocks - but it will only keep the relevant info (the "result" column and other relevant columns, e.g. "test_id") in memory when processing the query.