I am trying to do performance tuning on a SQL query in Oracle 12c which is using a window partition. There's an index created on HUB_POL_KEY, PIT_EFF_START_DT on the table PIT. While running the explain plan with /*+ gather_plan_statistics */ hint, I observed there's a Window Sort Step in the Explain Plan which is having an Estimated Row Count of 5000K and an Actual Row Count of 1100. I executed DBMS_STATS.GATHER_TABLE_STATS on the table. When I checked in USER_TAB_COLUMNS table, I see there's no histogram generated for HUB_POL_KEY, PIT_EFF_START_DT. However, there's histogram existing for all other columns.
SQL Query
SELECT
PIT.HUB_POL_KEY,
NVL(LEAD(PIT.PIT_EFF_START_DT) OVER (PARTITION BY PIT.HUB_POL_KEY ORDER BY PIT.PIT_EFF_START_DT) ,TO_DATE('31.12.9999', 'DD.MM.YYYY')) EFF_END_DT
FROM PIT
1st Try:
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT');
2nd Try:
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT', method_opt=>('FOR COLUMNS SIZE 254 (HUB_POL_KEY,PIT_EFF_START_DT)'));
Checking Histogram:
SELECT HISTOGRAM FROM USER_TAB_COLUMNS
WHERE TABLE_NAME = 'PIT'
AND COLUMN_NAME IN ('HUB_POL_KEY','PIT_EFF_START_DT') --NONE
Table Statistics:
SELECT COUNT(*) FROM PIT --5570253
SELECT COLUMN_NAME,NUM_DISTINCT,NUM_BUCKETS,HISTOGRAM FROM USER_TAB_COL_STATISTICS
WHERE TABLE_NAME = 'PIT'
AND COLUMN_NAME IN ('HUB_POL_KEY','PIT_EFF_START_DT')
+------------------+--------------+-------------+-----------+
| COLUMN_NAME | NUM_DISTINCT | NUM_BUCKETS | HISTOGRAM |
+------------------+--------------+-------------+-----------+
| HUB_POL_KEY | 4703744 | 1 | NONE |
| PIT_EFF_START_DT | 154416 | 1 | NONE |
+------------------+--------------+-------------+-----------+
What am I missing here? Why is the bucket size 1 even when I am running the gather_table_stat procedure with method_opt specifying a size?
The correct syntax as per Oracle documentation should be method_opt=>('FOR COLUMNS (HUB_POL_KEY,PIT_EFF_START_DT) SIZE 254'). Trying it did not create the histogram stats as expected thought (maybe a bug ¯_(ツ)_/¯).
On the other side using method_opt=>('FOR ALL COLUMNS SIZE 254') or method_opt=>('FOR COLUMNS <column_name> SIZE 254') is working fine.
Probably a workaround would be then to gather stats for columns separately:
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT', method_opt=>('FOR COLUMNS HUB_POL_KEY SIZE 254'));
EXEC DBMS_STATS.GATHER_TABLE_STATS('stg','PIT', method_opt=>('FOR COLUMNS PIT_EFF_START_DT SIZE 254'));
Related
In Oracle I can get the size of a table. I would like to estimate the size of a view (non materialized). Is it possible?
I know that views don't have any data per se, but we are moving the data to our data lake and would like to estimate it. Knowing the size we will be able to optimize our resources and speed up the process
You can use EXPLAIN PLAN to estimate the number of bytes and rows that will be returned by reading the entire view. But keep in mind that these numbers are only estimates, they depend on having current statistics, and they will be less accurate for more complicated queries.
For example, on my system, EXPLAIN PLAN estimates that a somewhat complicated metadata view will return 34 MB and 75,590 rows. Whereas the actual values are roughly 14 MB and 85,402 rows.
Commands:
explain plan for select * from dba_objects;
select * from table(dbms_xplan.display);
Results:
Plan hash value: 3423780594
------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 75590 | 34M| 134K (1)| 00:00:06 |
| 1 | VIEW | DBA_OBJECTS | 75590 | 34M| 134K (1)| 00:00:06 |
| 2 | UNION-ALL | | | | | |
...
Estimate multiple views in a single query
With a few tricks you can create estimates for multiple views all within a single query. This solution requires Oracle 12.1 or higher. The WITH FUNCTION syntax is a bit odd, and some IDEs struggle with it, so you might have to play around with the semicolon and slash at the end.
--Create sample views from data dictionary views.
create or replace view view1 as select * from all_tables;
create or replace view view2 as select * from all_tab_privs;
create or replace view view3 as select * from all_objects;
--Get the estimated size of each query. The actual values will differ for your database.
with function get_bytes(p_view_name varchar2) return number is
v_bytes number;
--(Needed because "explain plan" is technically DML, which normally shouldn't be executed inside a query.)
pragma autonomous_transaction;
begin
--Create an explain plan for reading everything from the view.
execute immediate replace(q'[explain plan set statement_id = '#VIEW_NAME#' for select * from #VIEW_NAME#]', '#VIEW_NAME#', p_view_name);
--Get the size in bytes.
--Latest plan information. (In case the explain plan was generated multiple times.)
select max(bytes)
into v_bytes
from
(
--Plan information.
select bytes, timestamp, max(timestamp) over () latest_timestamp
from plan_table
where statement_id = p_view_name and id = 0
)
where latest_timestamp = timestamp;
--As part of the AUTONOMOUS_TRANSACTION feature, the function must either commit or rollback.
rollback;
return v_bytes;
end;
select view_name, round(get_bytes(view_name) / 1024 / 1024, 1) mb
from user_views
order by mb desc, view_name;
/
Results:
VIEW_NAME MB
------------ ----------
VIEW3 2.4
VIEW1 .8
VIEW2 .7
I am trying to come up with an example showing that indexes can have a dramatic (orders of magnitude) effect on query execution time. After hours of trial and error I am still at square one. Namely, the speed-up is not large even when the execution plan shows using the index.
Since I realized that I better have a large table for the index to make a difference, I wrote the following script (using Oracle 11g Express):
CREATE TABLE many_students (
student_id NUMBER(11),
city VARCHAR(20)
);
DECLARE
nStudents NUMBER := 1000000;
nCities NUMBER := 10000;
curCity VARCHAR(20);
BEGIN
FOR i IN 1 .. nStudents LOOP
curCity := ROUND(DBMS_RANDOM.VALUE()*nCities, 0) || ' City';
INSERT INTO many_students
VALUES (i, curCity);
END LOOP;
COMMIT;
END;
I then tried quite a few queries, such as:
select count(*)
from many_students M
where M.city = '5467 City';
and
select count(*)
from many_students M1
join many_students M2 using(city);
and a few other ones.
I have seen this post and think that my queries satisfy the requirements stated in the replies there. However, none of the queries I tried showed dramatic improvement after building an index: create index myindex on many_students(city);
Am I missing some characteristic that distinguishes a query for which an index makes a dramatic difference? What is it?
The test case is a good start but it needs a few more things to get a noticeable performance difference:
Realistic data sizes. One million rows of two small values is a small table. With a table that small the performance difference between a good and a bad execution plan may not matter much.
The below script will double the table size until it gets to 64 million rows. It takes about 20 minutes on my machine. (To make it go quicker, for larger sizes, you could make the table nologging and add an /*+ append */ hint to the insert.
--Increase the table to 64 million rows. This took 20 minutes on my machine.
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
commit;
--The table has about 1.375GB of data. The actual size will vary.
select bytes/1024/1024/1024 gb from dba_segments where segment_name = 'MANY_STUDENTS';
Gather statistics. Always gather statistics after large table changes. The optimizer cannot do its job well unless it has table, column, and index statistics.
begin
dbms_stats.gather_table_stats(user, 'MANY_STUDENTS');
end;
/
Use hints to force a good and bad plan. Optimizer hints should usually be avoided. But to quickly compare different plans they can be helpful to fix a bad plan.
For example, this will force a full table scan:
select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
But you'll also want to verify the execution plan:
explain plan for select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
select * from table(dbms_xplan.display);
Flush the cache. Caching is probably the main culprit behind the index and full table scan queries taking the same amount of time. If the table fits entirely in memory then the time to read all the rows may be almost too small to measure. The number could be dwarfed by the time to parse the query or to send a simple result across the network.
This command will force Oracle to remove almost everything from the buffer cache. This will help you test a "cold" system. (You probably do not want to run this statement on a production system.)
alter system flush buffer_cache;
However, that won't flush the operating system or SAN cache. And maybe the table really would fit in memory on production. If you need to test a fast query it may be necessary to put it in a PL/SQL loop.
Multiple, alternating runs. There many things happening in the background, like caching and other processes. It's so easy to get bad results because something unrelated changed on the system.
Maybe the first run takes extra long to put things in a cache. Or maybe some huge job was started between queries. To avoid those issues, alternate running the two queries. Run them five times, throw out the highs and lows, and compare the averages.
For example, copy and paste the statements below five times and run them. (If using SQL*Plus, run set timing on first.) I already did that and posted the times I got in a comment before each line.
--Seconds: 0.02, 0.02, 0.03, 0.234, 0.02
alter system flush buffer_cache;
select count(*) from many_students M where M.city = '5467 City';
--Seconds: 4.07, 4.21, 4.35, 3.629, 3.54
alter system flush buffer_cache;
select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
Testing is hard. Putting together decent performance tests is difficult. The above rules are only a start.
This might seem like overkill at first. But it's a complex topic. And I've seen so many people, including myself, waste a lot of time "tuning" something based on a bad test. Better to spend the extra time now and get the right answer.
An index really shines when the database doesn't need to go to every row in a table to get your results. So COUNT(*) isn't the best example. Take this for example:
alter session set statistics_level = 'ALL';
create table mytable as select * from all_objects;
select * from mytable where owner = 'SYS' and object_name = 'DUAL';
---------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 300 |00:00:00.01 | 12 |
| 1 | TABLE ACCESS FULL| MYTABLE | 1 | 19721 | 300 |00:00:00.01 | 12 |
---------------------------------------------------------------------------------------
So, here, the database does a full table scan (TABLE ACCESS FULL), which means it has to visit every row in the database, which means it has to load every block from disk. Lots of I/O. The optimizer guessed that it was going to find 15000 rows, but I know there's only one.
Compare that with this:
create index myindex on mytable( owner, object_name );
select * from mytable where owner = 'SYS' and object_name = 'JOB$';
select * from table( dbms_xplan.display_cursor( null, null, 'ALLSTATS LAST' ));
----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 3 | 2 |
| 1 | TABLE ACCESS BY INDEX ROWID| MYTABLE | 1 | 2 | 1 |00:00:00.01 | 3 | 2 |
|* 2 | INDEX RANGE SCAN | MYINDEX | 1 | 1 | 1 |00:00:00.01 | 2 | 2 |
----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("OWNER"='SYS' AND "OBJECT_NAME"='JOB$')
Here, because there's an index, it does an INDEX RANGE SCAN to find the rowids for the table that match our criteria. Then, it goes to the table itself (TABLE ACCESS BY INDEX ROWID) and looks up only the rows we need and can do so efficiently because it has a rowid.
And even better, if you happen to be looking for something that is entirely in the index, the scan doesn't even have to go back to the base table. The index is enough:
select count(*) from mytable where owner = 'SYS';
select * from table( dbms_xplan.display_cursor( null, null, 'ALLSTATS LAST' ));
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 46 | 46 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 46 | 46 |
|* 2 | INDEX RANGE SCAN| MYINDEX | 1 | 8666 | 9294 |00:00:00.01 | 46 | 46 |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("OWNER"='SYS')
Because my query involved the owner column and that's contained in the index, it never needs to go back to the base table to look anything up there. So the index scan is enough, then it does an aggregation to count the rows. This scenario is a little less than perfect, because the index is on (owner, object_name) and not just owner, but its definitely better than doing a full table scan on the main table.
I have a table in Oracle where there are two columns. In the first column, sometimes there are duplicate values that corresspond to a different value in the second column. How can I write a query that shows only unique values of the first column and all possible values from the second column?
The table looks somewhat like below
COLUMN_1 | COLUMN_2
NUMBER_1 | 4
NUMBER_2 | 4
NUMBER_3 | 1
NUMBER_3 | 6
NUMBER_4 | 3
NUMBER_4 | 4
NUMBER_4 | 5
NUMBER_4 | 6
You can use listagg() if you are using Oracle 11G or higher like
SELECT
COLUMN_1,
LISTAGG(COLUMN_2, '|') WITHIN GROUP (ORDER BY COLUMN_2) "ListValues"
FROM table1
GROUP BY COLUMN_1
Else, see this link for an alternative for lower versions
Oracle equivalent of MySQL group_concat
I'm trying to optimize a set of stored procs which are going against many tables including this view. The view is as such:
We have TBL_A (id, hist_date, hist_type, other_columns) with two types of rows: hist_type 'O' vs. hist_type 'N'. The view self joins table A to itself and transposes the N rows against the corresponding O rows. If no N row exists for the O row, the O row values are repeated. Like so:
CREATE OR REPLACE FORCE VIEW V_A (id, hist_date, hist_type, other_columns_o, other_columns_n)
select
o.id, o.hist_date, o.hist_type,
o.other_columns as other_columns_o,
case when n.id is not null then n.other_columns else o.other_columns end as other_columns_n
from
TBL_A o left outer join TBL_A n
on o.id=n.id and o.hist_date=n.hist_date and n.hist_type = 'N'
where o.hist_type = 'O';
TBL_A has a unique index on: (id, hist_date, hist_type). It also has a unique index on: (hist_date, id, hist_type) and this is the primary key.
The following query is at issue (in a stored proc, with x declared as TYPE_TABLE_OF_NUMBER):
select b.id BULK COLLECT into x from TBL_B b where b.parent_id = input_id;
select v.id from v_a v
where v.id in (select column_value from table(x))
and v.hist_date = input_date
and v.status_new = 'CLOSED';
This query ignores the index on id column when accessing TBL_A and instead does a range scan using the date to pick up all the rows for the date. Then it filters that set using the values from the array. However if I simply give the list of ids as a list of numbers the optimizer uses the index just fine:
select v.id from v_a v
where v.id in (123, 234, 345, 456, 567, 678, 789)
and v.hist_date = input_date
and v.status_new = 'CLOSED';
The problem also doesn't exist when going against TBL_A directly (and I have a workaround that does that, but it's not ideal.).Is there a way to get the optimizer to first retrieve the array values and use them as predicates when accessing the table? Or a good way to restructure the view to achieve this?
Oracle does not use the index because it assumes select column_value from table(x) returns 8168 rows.
Indexes are faster for retrieving small amounts of data. At some point it's faster to scan the whole table than repeatedly walk the index tree.
Estimating the cardinality of a regular SQL statement is difficult enough. Creating an accurate estimate for procedural code is almost impossible. But I don't know where they came up with 8168. Table functions are normally used with pipelined functions in data warehouses, a sorta-large number makes sense.
Dynamic sampling can generate a more accurate estimate and likely generate a plan that will use the index.
Here's an example of a bad cardinality estimate:
create or replace type type_table_of_number as table of number;
explain plan for
select * from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8168 | 00:00:01 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 8168 | 00:00:01 |
-------------------------------------------------------------------------
Here's how to fix it:
explain plan for select /*+ dynamic_sampling(2) */ *
from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 7 | 00:00:01 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 7 | 00:00:01 |
-------------------------------------------------------------------------
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
The following query performs very poorly, due to the "order by". My goal is to get only a small subset of the resultset (using ROWNUM, for example). However, when I add "order by" it goes through the entire resultset performing an index lookup for each record, which makes it extremely slow. Without sorting the query is about 100 times faster when I limit the resultset to, for example, 1000 records.
QUERY:
SELECT text_field
from mytable where
contains(text_field,'ABC', 1)>0
order by another_field;
THIS IS HOW I CREATED THE INDEX:
CREATE INDEX myindex ON mytable (text_field) INDEXTYPE IS ctxsys.context FILTER BY another_field
EXECUTION PLAN:
---------------------------------------------------------------
| Id | Operation | Name |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT ORDER BY | |
| 2 | TABLE ACCESS BY INDEX ROWID| MYTABLE |
|* 3 | DOMAIN INDEX | MYINDEX |
---------------------------------------------------------------
I also used CTXCAT instead of CONTEXT, and no improvement. I think the problem is, when I want the results sorted (only top 1000), it performs an index lookup for each record in the "entire" resultset. Is there a way to avoid that?
Thank you.
To have the ordering applied before the rownum filter, you need to use an in-line view:
SELECT text_file
from (
SELECT text_field
from mytable where
contains(text_field,'ABC', 1)>0
order by another_field
)
where rownum <= 1000;
With your index in place Oracle should optimise this to do as little work as possible. You should see 'sort order by stopkey' and 'count stopkey' steps in the plan, which is Oracle being clever and knowing it only needs to get 1000 values from the index.
If you don't use the in-line view but just add the rownum to your original query it will still optimise it but as you state it will order the first 1000 random (or indeterminate, anyway) rows it finds, because of the sequence of operations it performs.