I have a series of extremely similar queries that I run against a table of 1.4 billion records (with indexes), the only problem is that at least 10% of those queries take > 100x more time to execute than others.
I ran an explain plan and noticed that the for the fast queries (roughly 90%) Oracle is using an index range scan; on the slow ones, it's using a full index scan.
Is there a way to force Oracle to do an index range scan?
To "force" Oracle to use an index range scan, simply use an optimizer hint INDEX_RS_ASC. For example:
CREATE TABLE mytable (a NUMBER NOT NULL, b NUMBER NOT NULL, c CHAR(10)) NOLOGGING;
INSERT /*+ APPEND */ INTO mytable(a,b,c)
SELECT level, mod(level,100)+1, 'a' FROM dual CONNECT BY level <= 1E6;
CREATE INDEX myindex_ba ON mytable(b, a);
EXECUTE dbms_stats.gather_table_stats(NULL,'mytable');
SELECT /*+ FULL(m) */ b FROM mytable m WHERE b=10; -- full table scan
SELECT /*+ INDEX_RS_ASC(m) */ b FROM mytable m WHERE b=10; -- index range scan
SELECT /*+ INDEX_FFS(m) */ b FROM mytable m WHERE b=10; -- index fast full scan
Whether this will make your query actually run faster depends on many factors like the selectivity of the indexed value or the physical order of the rows in your table. For instance, if you change the query to WHERE b BETWEEN 10 AND <xxx>, the following costs appear in the execution plans on my machine:
b BETWEEN 10 AND 10 20 40 80
FULL 749 750 751 752
INDEX_RS_ASC 29 325 865 1943
INDEX_FFS 597 598 599 601
If you change the query very slightly to not only select the indexed column b, but also other, non-index columns, the costs change dramatically:
b BETWEEN 10 AND 10 20 40 80
FULL 749 750 751 754
INDEX_RS_ASC 3352 40540 108215 243563
INDEX_FFS 3352 40540 108215 243563
I suggest the following approach:-
Get an explain plan on the slow statement
Using an INDEX hint, get an explain plan on using the index
You'll notice that the cost of the INDEX plan is greater. This is why Oracle is not choosing the index plan. The cost is Oracle's estimate based on the statistics it has and various assumptions.
If the estimated cost of a plan is greater, but it actually runs quicker then the estimate is wrong. Your job is to figure out why the estimate is wrong and correct that. Then Oracle will choose the right plan for this statement and others on it's own.
To figure out why it's wrong, look at the number of expected rows in the plan. You will probably find one of these is an order of magnitude out. This might be due to non-uniformly distributed column values, old statistics, columns that corelate with each other etc.
To resolve this, you can get Oracle to collect better statistics and hint it with better starting assumptions. Then it will estimate accurate costs and come up with the fastest plan.
If you post more information I might be able to comment further.
If you want to know why the optimizer takes the decisions it does you need to use the 10053 trace.
SQL> alter session set events '10053 trace name context forever, level 1';
Then run explain plans for a sample fast query and a sample slow query. In the user dump directory you will get trace files detailing the decision trees which the CBO goes through. Somewhere in those files you will find the reasons why it chooses a full index scan over an index range scan.
I'm not saying the trace files are an easy read. The best resource for understanding them is Wolfgang Breitling's excellent whitepaper "A look under the hood of CBO" (PDF)
you can use oracle sql hints. you can force to use specific index or exclude index
check the documentation
http://psoug.org/reference/hints.html
http://www.adp-gmbh.ch/ora/sql/hints/index.html
like
select /*+ index(scott.emp ix_emp) */ from scott.emp emp_alias
I have seen hint is ignored by Oracle.
Recently, our DBA uses "optimizer_index_cost_adj" and it utilized the index.
This is Oracle parameter but you could use it as session level.
100 is default value and we used 10 as the parameter.
Related
I am investigating a problem where our application takes too much time to get data from Oracle Database. In my investigation, I found that the slowness of the query traces to the join between tables and because of the aggregate function- SUM.
This may look simple but I am not a good with SQL query optimization.
The query is below
SELECT T1.TONNES, SUM(R.TONNES) AS TOTAL_TONNES
FROM
RECLAIMED R ,
(SELECT DELIVERY_OUT_ID, SUM(TONNES) AS TONNES FROM RECLAIMED WHERE DELIVERY_IN_ID=53773 GROUP BY DELIVERY_OUT_ID) T1
where
R.DELIVERY_OUT_ID = T1.DELIVERY_OUT_ID
GROUP BY
T1.TONNES
SUM(R.TONNES) is the total tonnes per delivery out.
SUM(TONNES) is the total tonnes per delivery in.
My table looks like
I have 16 million entries in this table, and by trying multiple delivery_in_id's by average I am getting about 6 seconds for the query to comeback.
I have similar database (complete copy but only have 4 million entries) and when the same query is applied I am getting less than 1 seconds.
They have both the same indexes so I am confident that index is not a problem.
I am certain that it is just the data, it is heavy on the first database(16 million). I have a feeling that when this query is optimized then the problem will be solved.
Open for suggestions : )
Are the two DB on the same server? If it's not, first, compare the computer configuration, settings and running applications.
If there is no differences, you can try to check if your have NULL values in the column you want to SUM. Use NVL-function to improve your query if there are some.
Also, you may "Analyse index" (or "Rebuild Index"). It cleans up the index. (it's quite fast and safe for your data).
If it is not helping, look if the TABLESPACE of your table is not full. It might have some impact... but I am not sure.
;-)
I've solved the performance problem by updating the stored procedure. It is optimized in a way by adding a filter in the first table before joining the second table. Below is the outcome stored procedure
SELECT R.DELIVERY_IN_ID, R.DELIVERY_OUT_ID, SUM(R.TONNES),
(SELECT SUM(TONNES) AS TONNES FROM RECLAIMED WHERE DELIVERY_OUT_ID=R.DELIVERY_OUT_ID) AS TOTAL_TONNES
FROM
CTSBT_RECLAIMED R
WHERE DELIVERY_IN_ID=53733
GROUP BY DELIVERY_IN_ID, R.DELIVERY_OUT_ID
The result in timing/performance is huge for my case since I am joining a huge table(16M).This query now goes for less than a second. The slowness, I am suspecting is due to 1 table having no index(see T1) and even though in my case it is only about 20 items, it is enough to slow the query down because it compares it to 16 million entries.
The optimized query does the filtering of this 16 million and merge to T1 after.
Should there be a better way how to optimized this? Probably. But I am happy with this result and solved what I intended to solve. Now moving on.
Thanks for those who commented.
I have to sum a huge number of data with aggregation and where clause, using this query
what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term.
I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms
The problem is that this query takes more than 15 min. because I have huge tables
any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000
If you have a sufficient amount of free disk space, then create a materialized view
over the join of the three tables
fast-refreshable on commit (don't use the ANSI join syntax here, even if tempted to do so, or the mview won't be fast-refreshable ... a strange bug in Oracle)
with query rewrite enabled
properly physically organized for quick calculations
The query rewrite is optional. If you can modify the above insert-select, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
hash partitioning by Plag_User_Terms.ISN (with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
single-table hash clustering by Plag_User_Terms.ISN if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
either use nested loops joins, since the number of rows processed seems to be quite low (assumed by the estimations in the execution plan)
or full-scan the plag_correlations3 table in parallel
Bottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.
I had an Oracle query as below that took 10 minutes or longer to run:
select
r.range_text as duration_range,
nvl(count(c.call_duration),0) as calls,
nvl(SUM(call_duration),0) as total_duration
from
call_duration_ranges r
left join
big_table c
on c.call_duration BETWEEN r.range_lbound AND r.range_ubound
and c.aaep_src = 'MAIN_SOURCE'
and c.calltimestamp_local >= to_date('01-02-2014 00:00:00' ,'dd-MM-yyyy HH24:mi:ss')
AND c.calltimestamp_local <= to_date('28-02-2014 23:59:59','dd-MM-yyyy HH24:mi:ss')
and c.destinationnumber LIKE substr( 'abc:1301#company.com:5060;user=phone',1,8) || '%'
group by
r.range_text
order by
r.range_text
If I changed the date part of the query to:
(c.calltimestamp_local+0) >= to_date('01-02-2014 00:00:00' ,'dd-MM-yyyy HH24:mi:ss')
(AND c.calltimestamp_local+0) <= to_date('28-02-2014 23:59:59','dd-MM-yyyy HH24:mi:ss')
It runs in 2 seconds. I did this based on another post to avoid using the date index. Seems counter intuitive though--the index slowing things down so much.
Ran the explain plan and it seems identical between the new and updated query. Only difference is the the MERGE JOIN operation is 16,269 bytes in the old query and 1,218 bytes in the new query. Actually cardinality is higher in the old query as well. And I actually don't see an "INDEX" operation on the old or new query in the explain plan, just for the index on the destinationnumber field.
So why is the index slowing down the query so much? What can I do to the index--don't think using the "+0" is the best solution going forward...
Querying for two days of data, suppressing use of destinationnumber index:
0 SELECT STATEMENT ALL_ROWS 329382 1218 14
1 SORT GROUP BY 329382 1218 14
2 MERGE JOIN OUTER 329381 1218 14
3 SORT JOIN 4 308 14
4 TABLE ACCESS FULL CALL_DURATION_RANGES ANALYZED 3 308 14
5 FILTER
6 SORT JOIN 329377 65 1
7 TABLE ACCESS BY GLOBAL INDEX ROWID BIG_TABLE ANALYZED 329376 65 1
8 INDEX RANGE SCAN IDX_CDR_CALLTIMESTAMP_LOCAL ANALYZED 1104 342104
Querying for 2 days using destinationnumber index:
0 SELECT STATEMENT ALL_ROWS 11 1218 14
1 SORT GROUP BY 11 1218 14
2 MERGE JOIN OUTER 10 1218 14
3 SORT JOIN 4 308 14
4 TABLE ACCESS FULL CALL_DURATION_RANGES ANALYZED 3 308 14
5 FILTER
6 SORT JOIN 6 65 1
7 TABLE ACCESS BY GLOBAL INDEX ROWID BIG_TABLE ANALYZED 5 65 1
8 INDEX RANGE SCAN IDX_DESTINATIONNUMBER_PART ANALYZED 4 4
Querying for one month, suppressing destinationnumber index--full scan:
0 SELECT STATEMENT ALL_ROWS 824174 1218 14
1 SORT GROUP BY 824174 1218 14
2 MERGE JOIN OUTER 824173 1218 14
3 SORT JOIN 4 308 14
4 TABLE ACCESS FULL CALL_DURATION_RANGES ANALYZED 3 308 14
5 FILTER
6 SORT JOIN 824169 65 1
7 PARTITION RANGE ALL 824168 65 1
8 TABLE ACCESS FULL BIG_TABLE ANALYZED 824168 65 1
Seems counter intuitive though--the index slowing things down so much.
Counter-intuitive only if you don't understand how indexes work.
Indexes are good for retrieving individual rows. They are not suited to retrieving large numbers of records. You haven't bothered to provide any metrics but it seems likely your query is touching a large number of rows. In which case a full table scan or other set=based operation will be more much efficient.
Tuning date range queries is tricky, because it's very hard for the database to know how many records lie between the two bounds, no matter how up-to-date our statistics are. (Even more tricky to tune when the date bounds can vary - one day is a different matter from one month or one year.) So often we need to help the optimizer by using our knowledge of our data.
don't think using the "+0" is the best solution going forward...
Why not? People have been using that technique to avoid using an index in a specific query for literally decades.
However, there are more modern solutions. The undocumented cardinality hint is one:
select /*+ cardinality(big_table,10000) */
... should be enough to dissuade the optimizer from using an index - provided you have accurate statistics gathered for all the tables in the query.
Alternatively you can force the optimizer to do a full table scan with ...
select /*+ full(big_table) */
Anyway, there's nothing you can do to the index to change the way databases work. You could make things faster with partitioning, but I would guess if your organisation had bought the Partitioning option you'd be using it already.
These are the reasons using an index slows down a query:
A full tablescan would be faster. This happens if a substantial fraction of rows has to be retrieved. The concrete numbers depend on various factors, but as a rule of thumb in common situations using an index is slower if you retrieve more than 10-20% of your rows.
Using another index would be even better, because fewer rows are left after the first stage. Using a certain index on a table usually means that other indexes cannot be used.
Now it is the optimizers job to decide which variant is the best. To perform this task, he has to guess (among other things) how much rows are left after applying certain filtering clauses. This estimate is based on the tables statistics, and is usually quite ok. It even takes into account skewed data, but it might be off if either your statitiscs are outdated or you have a rather uncommon distribution of data. For example, if you computed your statistics before the data of february was inserted in your example, the optimizer might wrongly conclude that there are only few (if any) rows left after applaying the date range filter.
Using combined indexes on several columns might also be an option dependent on your data.
Another note on the "skewed data issue": There are cases the optimizer detects skewed data in column A if you have an index on cloumn A but not if you only have a combined index on columns A and B, because the combination might make the distribution more even. This is one of the few cases where an index on A,B does not make an index on A redundant.
APCs answer show how to use hints to direct the optimizer in the right direction if it still produces wrong plans even with right statistics.
My application allows users to collect measurement data as part of an experiment, and needs to have the ability to report on all of the measurements ever taken.
Below is a very simplified version of the tables I have:
CREATE TABLE EXPERIMENTS(
EXPT_ID INT,
EXPT_NAME VARCHAR2(255 CHAR)
);
CREATE TABLE USERS(
USER_ID INT,
EXPT_ID INT
);
CREATE TABLE SAMPLES(
SAMPLE_ID INT,
USER_ID INT
);
CREATE TABLE MEASUREMENTS(
MEASUREMENT_ID INT,
SAMPLE_ID INT,
MEASUREMENT_PARAMETER_1 NUMBER,
MEASUREMENT_PARAMETER_2 NUMBER
);
In my database there are 2000 experiments, each of which has 18 users. Each user has 6 samples to measure, and would do 100 measurements per sample.
This means that there are 2000 * 18 * 6 * 100 = 21600000 measurements currently stored in the database.
I'm trying to write a query that will get the AVG() of measurement parameter 1 and 2 for each user - that would return about 36,000 rows.
The query I have is extremely slow - I've left it running for over 30 minutes and it doesn't come back with anything. My question is: is there an efficient way of getting the averages? And is it actually possible to get results back for this amount of data in a reasonable time, say 2 minutes? Or am I being unrealistic?
Here's (again a simplified version) the query I have:
SELECT
E.EXPT_ID,
U.USER_ID,
AVG(MEASUREMENT_PARAMETER_1) AS AVG_1,
AVG(MEASUREMENT_PARAMETER_2) AS AVG_2
FROM
EXPERIMENTS E,
USERS U,
SAMPLES S,
MEASUREMENTS M
WHERE
U.EXPT_ID = E.EXPT_ID
AND S.USER_ID = U.USER_ID
AND M.SAMPLE_ID = S.SAMPLE_ID
GROUP BY E.EXPT_ID, U.USER_ID
This will return a row for each expt_id/user_id combination and the average of the 2 measurement parameters.
For your query, in any case, the DBMS needs to read the complete measurements table. This is by far the biggest part of data to read, and the part which takes most time if the query is optimized well (will come to that later). That means that the minimum runtime of your query is about the time it takes to read the complete measurements table from whereever it is stored. You can get a rough estimate by checking how much data that is (in MB or GB) and checking how much time it would take to read this amount of data from the harddisk (or where the table is stored). If your query runs slower by a factor of 5 or more, you can be sure that there is room for optimization.
There is a vast amount of information (tutorials, individual hints which can be invaluable, and general practices lists) about how to optimize oracle queries. You will not get through all this information quickly. But if you provide the execution plan of your query (this is what oracle's query optimizer thinks is the best way to fulfill your query), we will be able to spot steps which can be optimized and suggest solutions.
I created an index for one table, a simple index just like that:
CREATE INDEX IDX_TRANSACAO_NOVA_STATUS ON TRANSACAO_NOVA(STATUS) TABLESPACE COMVENIF;
This table has 1000K registers insinde and the status table just 5 or 6 possible values. After created the index i expected that the query bellow would have a better performance:
select * from transacao_nova tn where tn.status = 'XXX'
but, the explain plan still show me a full scan with 16.000 cost.
any help? i'm not a dba but i need to improve this performance.
thanks in advance.
If there are only 5 or 6 different status values and a million records the query optimizer may be deciding it is not worth using the index to do a range scan that would still return a substantial number of all the records in the table.
You might look into using an index-clustered table for this application.
If data in the status column es skewed (not uniform: some values appear very often and others appear very rarely), you can accelerate queries for the rare values by refreshing statistics (and verifying that you are calculating a histogram for the status column. This will make Oracle use the index in the cases in which it is more efficient.
http://docs.oracle.com/cd/E11882_01/server.112/e16638/stats.htm#autoId12
Be aware that automatically determining if a column needs a histogram is not a good idea as it may lead to inconsistent behaviour. It is better to manually specify histograms when needed. Also, histograms affect every query that uses those columns, so they should be collected with care.
You might need to generate new statistics on the table.
http://docs.oracle.com/cd/B19306_01/server.102/b14211/stats.htm
A common mistake is to assume that an index range scan will be better than a full scan because you only want some "small" fraction of the total rows in the table. But if the rows you want are scattered throughout the table's storage extents, locating them by an index lookup can be slower than just scanning the entire table. I can't say for sure that's the case in your situation, but it's a possibility.
For a more in-depth discussion of this topic I recommend this paper.