How Oracle uses statistics data - oracle

In previous question I got comment about Oracle statistics:
Oracle doesn't know that 50M is greater than the number of rows. Sure, it has statistics, but they could be old and wrong - and Oracle would never allow itself to deliver an incorrect result only because the statistics are wrong
I was pretty sure that Oracle relies on statistics when preparing query execution plan. Before version 10 it was recommended to refresh statistics from time to time and from 10g Oracle gathers statistics automatically.
Can somebody explain how much Oracle query analyzer relies on statistics data?

Oracle uses statistics a lot, to generate query execution plans. What it does not (and should not) do is use those statistics in a way that will affect query results, which is what you were trying to do with "ROWNUM < 50000000". The statistics may be out of date, or missing. However, this will only mean that Oracle may be slow to generate the correct result, it does not mean that Oracle will return an incorrect result.
If Oracle worked as you hoped, then it might decide that "ROWNUM < 50000000" meant "get all rows" even though the table now contained 60,000,000 rows (but had out of date stats saying it contained only 49,000,000). Fortunately it does not.

Statistics are VERY important for the query optimizer. They should be gathered on a regular basis either automatically or manually.
When executing a query Oracle will produce a pool of available execution plans in order to satisfy your query. Those execution plans are the same from the standpoint that they will return you the same exact result, it's just the road to getting there may be far more efficient for one plan over another. To determine this efficiency, Oracle uses the stats generated on the objects used in each of the execution plans to determine their individual costs. If those stats are non-existent or are stale, the cost associated with each plan will be less accurate and therefore the optimal plan may not be chosen.
Here are some of the key stats that Oracle uses for determining this cost:
Table statistics
* Number of rows
* Number of blocks
* Average row length
Column statistics
* Number of distinct values (NDV) in column
* Number of nulls in column
* Data distribution (histogram)
* Extended statistics
Index statistics
* Number of leaf blocks
* Levels
* Clustering factor
System statistics
* I/O performance and utilization
* CPU performance and utilization

Statistics are used by the oracle cost based optimizer (CBO) to calculate the relative costs of different ways of executing a query so that the most appropriate one can be chosen.
On the whole this works very well, and is being continually improved. For example in 11g you can gather multicolumn histograms that help greatly with queries having predicates on correlated columns (eg. strongly correlated like month of birth and star sign, or more weakly correlted like gender and height).
However it is not perfect. For example estimating the cardinality of the result set of a join between two tables is reasonably accurate, as is estimating the cardinality from a filter operation, but combining the two requires a lot of estimation that can easily be inaccurate. In some cases these issues can be worked around with hints, or with the use of global temporary tables for intermediate result sets.
Another problem of statistics is that changing them can change the execution plan, so there is more of a movement recently to either discourage continual gathering of statistics, or to analyse the impact of changes to statistics before implementing them.
Look for the Jonathan Lewis book -- it is a very thorough treatment of the subject.

Related

Oracle SQL table and index statistics

I've been doing some reading on gathering table and index statistics on Oracle databases but it's left me ... confused.
For the sake of argument, let's assume Oracle 11gR2 as the RDBMS. Regarding gathering table and index statistics, when should it be done, which is the preferred way of doing it, and does Oracle really automatically gather the necessary statistics for us?
Regarding the first point: when should it be done. I've read that, as a rule of thumb, gathering table and index statistics should be done after around 10% of the table's records have been modified (inserted, updated, etc) since the last time the table was analyzed.
Regarding the second point: which is the preferred way of doing it. If we want to calculate both table and index statistics, does executing DBMS_STATS.GATHER_TABLE_STATS with default options, assuming the table is not partitioned, suffice?
Regarding the third point:does Oracle really gather the necessary statistics automatically for us. If this is the case, should i not worry abouth gathering table statistics (see points 1 and 2)?
Thanks in advance.
EDIT: Following the comment by ammoQ, i realized that the question is not clear in what the use case really is, here. My question is about tables that aren' "manipulated" via a user's actions, i.e manually, rather via procedures typically ran by database jobs. Take my example, for instance. My ETL process loads several tables on a daily basis and it does so in approximately 1 hour. Of that 1 hour, about half is spent analyzing the tables themselves. Thus, the tables area analyzed daily, following insertions or updates. This seems overkill, hence the question.
In general, you need to have statistics that are representative (not necessarily accurate) and that give you the right execution plan. By default, Oracle will run a statistics collection job, during the nightly batch window. That may be fine for some applications, but if you have a data warehouse, which presumably includes a regular data load process, then managing the stats should be part of that process. Note that I have said "managing" and not "collecting" statistics. That's just my way of saying that there are other options for statistics in addition to just gathering statistics, although gathering statistics would be where I would start.
There are also things that can be done to optimize statistics gathering, incremental statistics for example. The other thing that is very important is is to use the AUTO Sample size when gathering stats. Do not specify a percentage, not even 100%. The reason is that auto sample size enables a number of internal optimizations and capabilities that are disabled if you do not use AUTO sample size.
So, taking your specific points
10% staleness is pretty random, and is just a number used by the auto stats.
dbms_stats.gather_table_stats() with default values is the preferred method. One parameter that I may change would be the DEGREE, to enable stats gathering in parallel
In 12c, basic stats are gathered on load into an empty table (or empty partition). Stats are built on indexes when indexes are created. So to reiterate what I said above, stats gathering should be part of your ELT process.
I hope that makes sense and helps.

Oracle explain plan cost interpretation

I would like to know if the explain plan cost in Oracle always determine if a specific query is most efficient (in terms of performance, resource usage, disk access, ect) than other?
My question is because I have two tables.
One with a local partition index.
The other with global partition index.
Both have the same structure and the same data. Then i have a query, the cost is significantly different. The global partition index one has a very small cost, and the local partition index one a very high cost.However when i run the queries in SQL Developer, the response time is higher for the table with global partition index.
Thanks.
Cost is not comparable across two different SQL statements. it cannot and should not be inferred that higher cost = higher runtime or IO/CPU usage.
Cost is just an internal ranking that oracle applies when its calculating all possible sql plans for a specific sql statement.
As you've seen, a low cost for one sql is longer to run than a high cost sql. the cost numbers are affected by a great many things such as sql hints (first_rows etc), table statistics, system level statistics (load stats, or setting different numbers on optimizer_index_cost_adj/optimizer_index_caching etc).
Always tune SQL by IO/CPU (ie actual resource usage). ignore "cost" really.
also see here: http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:313416745628

when I try to use Oracle Statistics my cost grows?

I have a big query on for tables and I want to optimize it.
The weird part is that when I get the execution plan without statistics it says something like 1.2M. however if I get statistics for one of the tables involved in the query, my cost lowers to 4k. But if I ask for statistics in the other tables the cost grows to 50k, so I am not sure what's happening.
Can anyone explain a reason why giving more statistics actually increases query cost?
The Cost Based Optimiser uses as much information as you can give it in order to calculate the cost of a plan. If you update (i.e. change) the statistics it uses, then obviously that will change the calculated cost of the plan.
It's not actually the gathering of stats that causes the cost to grow - it's how those stats have changed (whether up or down) that causes the calculated cost to change.
In the absence of statistics, Oracle may use heuristics, guesswork or a quick sample of the data (depending on the settings in your instance).
Generally, the better (more accurate or representative) the statistics, the more accurate the cost calculation.
The cost based optimizer has it's challenges. There are rounding errors that can have quite an impact on decisions that it makes. This is one of the reasons that SQL Plan Stability, introduced in 11g is so nice. Forget about 10g, if you can, or prepare for long debugging sessions.
At the first use, a plan is generated based on the current statistics and executed. If SQL is repeated, the SQL and the plan are stored in a baseline. In the maintenance window, the most expensive plans are re evaluated and in many cases, a better plan can be provided. This is possible because at runtime, the optimizer is limited in the time it is given to search for a plan. In the maintenance window, a lot more time can be spent to find the best plan.
In 11g the peeking is also fixed and a single SQL can now have multiple plans, based on the values of the bind variables.
The query cost is based on many factors, where IO is a very important factor.
How are your tables filled and where are the high water marks located? A table that is filled and emptied constantly can have it's high watermark far away....
There are lots of bugs in the optimizer, lots of options, controlled by hidden parameters. You could try to use them to tweek the behaviour. Upgrading to 11g might be a lot smarter as it solves lots of performance problems for many applications.

Oracle: Difference in execution plans between databases

I am comparing queries my development and production database.
They are both Oracle 9i, but almost every single query has a completely different execution plan depending on the database.
All tables/indexes are the same, but the dev database has about 1/10th the rows for each table.
On production, the query execution plan it picks for most queries is different from development, and the cost is somtimes 1000x higher. Queries on production also seem to be not using the correct indexes for queries in some cases (full table access).
I have ran dbms_utility.analyze schema on both databases recently as well in the hopes the CBO would figure something out.
Is there some other underlying oracle configuration that could be causing this?
I am a developer mostly so this kind of DBA analysis is fairly confusing at first..
1) The first thing I would check is if the database parameters are equivalent across Prod and Dev. If one of the parameters that affects the decisions of the Cost Based Optimizer is different then all bets are off. You can see the parameter in v$parameter view;
2) Having up to date object statistics is great but keep in mind the large difference you pointed out - Dev has 10% of the rows of Prod. This rowcount is factored into how the CBO decides the best way to execute a query. Given the large difference in row counts I would not expect plans to be the same.
Depending on the circumstance the optimizer may choose to Full Table Scan a table with 20,000 rows (Dev)where it may decide an index is lower cost on the table that has 200,000 rows (Prod). (Numbers just for demonstration, the CBO uses costing algorighms for determining what to FTS and what to Index scan, not absolute values).
3) System statistics also factor into the explain plans. This is a set of statistics that represent CPU and disk i/o characteristics. If your hardware on both systems is different then I would expect your System Statistics to be different and this can affect the plans. Some good discussion from Jonathan Lewis here
You can view system stats via the sys.aux_stats$ view.
Now I'm not sure why different plans are a bad thing for you... if stats are up to date and parameters set correctly you should be getting decent performance from either system no matter what the difference in size...
but it is possible to export statistics from your Prod system and load them into your Dev system. This make your Prod statistics available to your Dev database.
Check the Oracle documentation for the DBMS_STATS package, specifically the EXPORT_SCHEMA_STATS, EXPORT_SYSTEM_STATS, IMPORT_SCHEMA_STATS, IMPORT_SYSTEM_STATS procedures. Keep in mind you may need to disable the 10pm nightly statistics jobs on 10g/11g... or you can investigate Locking statistics after import so they are not updated by nightly jobs.

Understanding the results of Execute Explain Plan in Oracle SQL Developer

I'm trying to optimize a query but don't quite understand some of the information returned from Explain Plan. Can anyone tell me the significance of the OPTIONS and COST columns? In the OPTIONS column, I only see the word FULL. In the COST column, I can deduce that a lower cost means a faster query. But what exactly does the cost value represent and what is an acceptable threshold?
The output of EXPLAIN PLAN is a debug output from Oracle's query optimiser. The COST is the final output of the Cost-based optimiser (CBO), the purpose of which is to select which of the many different possible plans should be used to run the query. The CBO calculates a relative Cost for each plan, then picks the plan with the lowest cost.
(Note: in some cases the CBO does not have enough time to evaluate every possible plan; in these cases it just picks the plan with the lowest cost found so far)
In general, one of the biggest contributors to a slow query is the number of rows read to service the query (blocks, to be more precise), so the cost will be based in part on the number of rows the optimiser estimates will need to be read.
For example, lets say you have the following query:
SELECT emp_id FROM employees WHERE months_of_service = 6;
(The months_of_service column has a NOT NULL constraint on it and an ordinary index on it.)
There are two basic plans the optimiser might choose here:
Plan 1: Read all the rows from the "employees" table, for each, check if the predicate is true (months_of_service=6).
Plan 2: Read the index where months_of_service=6 (this results in a set of ROWIDs), then access the table based on the ROWIDs returned.
Let's imagine the "employees" table has 1,000,000 (1 million) rows. Let's further imagine that the values for months_of_service range from 1 to 12 and are fairly evenly distributed for some reason.
The cost of Plan 1, which involves a FULL SCAN, will be the cost of reading all the rows in the employees table, which is approximately equal to 1,000,000; but since Oracle will often be able to read the blocks using multi-block reads, the actual cost will be lower (depending on how your database is set up) - e.g. let's imagine the multi-block read count is 10 - the calculated cost of the full scan will be 1,000,000 / 10; Overal cost = 100,000.
The cost of Plan 2, which involves an INDEX RANGE SCAN and a table lookup by ROWID, will be the cost of scanning the index, plus the cost of accessing the table by ROWID. I won't go into how index range scans are costed but let's imagine the cost of the index range scan is 1 per row; we expect to find a match in 1 out of 12 cases, so the cost of the index scan is 1,000,000 / 12 = 83,333; plus the cost of accessing the table (assume 1 block read per access, we can't use multi-block reads here) = 83,333; Overall cost = 166,666.
As you can see, the cost of Plan 1 (full scan) is LESS than the cost of Plan 2 (index scan + access by rowid) - which means the CBO would choose the FULL scan.
If the assumptions made here by the optimiser are true, then in fact Plan 1 will be preferable and much more efficient than Plan 2 - which disproves the myth that FULL scans are "always bad".
The results would be quite different if the optimiser goal was FIRST_ROWS(n) instead of ALL_ROWS - in which case the optimiser would favour Plan 2 because it will often return the first few rows quicker, at the cost of being less efficient for the entire query.
The CBO builds a decision tree, estimating the costs of each possible execution path available per query. The costs are set by the CPU_cost or I/O_cost parameter set on the instance. And the CBO estimates the costs, as best it can with the existing statistics of the tables and indexes that the query will use. You should not tune your query based on cost alone. Cost allows you to understand WHY the optimizer is doing what it does. Without cost you could figure out why the optimizer chose the plan it did. Lower cost does not mean a faster query. There are cases where this is true and there will be cases where this is wrong. Cost is based on your table stats and if they are wrong the cost is going to be wrong.
When tuning your query, you should take a look at the cardinality and the number of rows of each step. Do they make sense? Is the cardinality the optimizer is assuming correct? Is the rows being return reasonable. If the information present is wrong then its very likely the optimizer doesn't have the proper information it needs to make the right decision. This could be due to stale or missing statistics on the table and index as well as cpu-stats. Its best to have stats updated when tuning a query to get the most out of the optimizer. Knowing your schema is also of great help when tuning. Knowing when the optimizer chose a really bad decision and pointing it in the correct path with a small hint can save a load of time.
Here is a reference for using EXPLAIN PLAN with Oracle: http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/ex_plan.htm), with specific information about the columns found here: http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/ex_plan.htm#i18300
Your mention of 'FULL' indicates to me that the query is doing a full-table scan to find your data. This is okay, in certain situations, otherwise an indicator of poor indexing / query writing.
Generally, with explain plans, you want to ensure your query is utilizing keys, thus Oracle can find the data you're looking for with accessing the least number of rows possible. Ultimately, you can sometime only get so far with the architecture of your tables. If the costs remain too high, you may have to think about adjusting the layout of your schema to be more performance based.
In recent Oracle versions the COST represent the amount of time that the optimiser expects the query to take, expressed in units of the amount of time required for a single block read.
So if a single block read takes 2ms and the cost is expressed as "250", the query could be expected to take 500ms to complete.
The optimiser calculates the cost based on the estimated number of single block and multiblock reads, and the CPU consumption of the plan. the latter can be very useful in minimising the cost by performing certain operations before others to try and avoid high CPU cost operations.
This raises the question of how the optimiser knows how long operations take. recent Oracle versions allow the collections of "system statistics", which are definitely not to be confused with statistics on tables or indexes. The system statistics are measurements of the performance of the hardware, mostly importantly:
How long a single block read takes
How long a multiblock read takes
How large a multiblock read is (often different to the maximum possible due to table extents being smaller than the maximum, and other reasons).
CPU performance
These numbers can vary greatly according to the operating environment of the system, and different sets of statistics can be stored for "daytime OLTP" operations and "nighttime batch reporting" operations, and for "end of month reporting" if you wish.
Given these sets of statistics, a given query execution plan can be evaluated for cost in different operating environments, which might promote use of full table scans at some times or index scans at others.
The cost is not perfect, but the optimiser gets better at self-monitoring with every release, and can feedback the actual cost in comparison to the estimated cost in order to make better decisions for the future. this also makes it rather more difficult to predict.
Note that the cost is not necessarily wall clock time, as parallel query operations consume a total amount of time across multiple threads.
In older versions of Oracle the cost of CPU operations was ignored, and the relative costs of single and multiblock reads were effectively fixed according to init parameters.
FULL is probably referring to a full table scan, which means that no indexes are in use. This is usually indicating that something is wrong, unless the query is supposed to use all the rows in a table.
Cost is a number that signals the sum of the different loads, processor, memory, disk, IO, and high numbers are typically bad. The numbers are added up when moving to the root of the plan, and each branch should be examined to locate the bottlenecks.
You may also want to query v$sql and v$session to get statistics about SQL statements, and this will have detailed metrics for all kind of resources, timings and executions.

Resources