How does Oracle calculate the cost in an explain plan? - performance

Can anyone explain how the cost is evaluated in an Oracle explain plan?
Is there any specific algorithm to determine the cost of the query?
For example: full table scans have higher cost, index scan lower... How does Oracle evaluate the cases for full table scan, index range scan, etc.?
This link is same as what I am asking: Question about Cost in Oracle Explain Plan
But can anyone explain with an example, we can find the cost by executing explain plan, but how does it work internally?

There are many, many specific algorithms for computing the cost. Far more than could realistically be discussed here. Jonathan Lewis has done an admirable job of walking through how the cost-based optimizer decides on the cost of a query in his book Cost-Based Oracle Fundamentals. If you're really interested, that's going to to be the best place to start.
It is a fallacy to assume that full table scans will have a higher cost than, say, an index scan. It depends on the optimizer's estimates of the number of rows in the table and the optimizer's estimates of the number of rows the query will return (which, in turn, depends on the optimizer's estimates of the selectivity of the various predicates), the relative cost of a sequential read vs. a serial read, the speed of the processor, the speed of the disk, the probability that blocks will be available in the buffer cache, your database's optimizer settings, your session's optimizer settings, the PARALLEL attribute of your tables and indexes, and a whole bunch of other factors (this is why it takes a book to really start to dive into this sort of thing). In general, Oracle will prefer a full table scan if your query is going to return a large fraction of the rows in your table and an index access if your query is going to return a small fraction of the rows in your table. And "small fraction" is generally much smaller than people initially estimate-- if you're returning 20-25% of the rows in a table, for example, you're almost always better off using a full table scan.
If you are trying to use the COST column in a query plan to determine whether the plan is "good" or "bad", you're probably going down the wrong path. The COST is only valid if the optimizer's estimates are accurate. But the most common reason that query plans would be incorrect is that the optimizer's estimates are incorrect (statistics are incorrect, Oracle's estimates of selectivity are incorrect, etc.). That means that if you see one plan for a query that has a cost of 6 and a plan for a different version of that query that has a cost of 6 million, it is entirely possible that the plan that has a cost of 6 million is more efficient because the plan with the low cost is incorrectly assuming that some step is going to return 1 row rather than 1 million rows.
You are much better served ignoring the COST column and focusing on the CARDINALITY column. CARDINALITY is the optimizer's estimate of the number of rows that are going to be returned at each step of the plan. CARDINALITY is something you can directly test and compare against. If, for example, you see a step in the plan that involves a full scan of table A with no predicates and you know that A has roughly 100,000 rows, it would be concerning if the optimizer's CARDINALITY estimate was either way too high or way too low. If it was estimating the cardinality to be 100 or 10,000,000 then the optimizer would almost certainly be either picking the table scan in error or feeding that data into a later step where its cost estimate would be grossly incorrect leading it to pick a poor join order or a poor join method. And it would probably indicate that the statistics on table A were incorrect. On the other hand, if you see that the cardinality estimates at each step is reasonably close to reality, there is a very good chance that Oracle has picked a reasonably good plan for the query.

Another place to get started on understanding the CBO's algorithms is this paper by Wolfgang Breitling. Jonathan Lewis's book is more detailed and more recent, but the paper is a good introduction.

In the 9i documentation Oracle produced an authoratative looking mathematical model for cost:
Cost = (#SRds * sreadtim +
#MRds * mreadtim +
#CPUCycles / cpuspeed ) / sreadtim
where:
#SRDs is the number of single block reads
#MRDs is the number of multi block reads
#CPUCycles is the number of CPU Cycles *)
sreadtim is the single block read time
mreadtim is the multi block read time
cpuspeed is the CPU cycles per second
So it gives a good idea of the factors which go into calculating cost. This was why Oracle introduced the capability to gather system statistics: to provide accurate values for CPU speed, etc
Now we fast forward to the equivalent 11g documentation and we find the maths has been replaced with a cursory explanation:
"Cost of the operation as estimated by the optimizer's query approach.
Cost is not determined for table access operations. The value of this
column does not have any particular unit of measurement; it is merely
a weighted value used to compare costs of execution plans. The value
of this column is a function of the CPU_COST and IO_COST columns."
I think this reflects the fact that cost just isn't a very reliable indicator of execution time. Jonathan Lewis recently posted a pertinent blog piece. He shows two similar looking queries; their explain plans are different but they have identical costs. Nevertheless when it comes to runtime one query performs considerably slower than the other. Read it here.

Related

ORA-01795 - why is the maximum number of expressions limited to 1000

SO is full of work-arounds, but I'm wondering about the historical reasons behind the 1000 limit for "maximum number of expressions" in IN clause?
It might be because, there is potential of being abused with tons of values. And every value in it will be transformed into equivalent OR condition.
For example NAME IN ('JOHN', 'CHARLES'..) would be transformed into NAME = 'JOHN' OR NAME = 'CHARLES'
So, it might impact the performance..
But note Oracle still supports
SELECT ID FROM EMP WHERE NAME IN (SELECT NAME FROM ATTENDEES)
In this case, the optimizer doesn't convert into multiple OR conditions, but make a JOIN instead..
This restriction is not only for IN list, but on any expression list. Documentation says :
A comma-delimited list of expressions can contain no more than 1000 expressions.
Your question is WHY the limit is 1000. Why not 100 or 10000 or a million? I guess it relates to the limit of the number of columns in a table, which is 1000. Perhaps, this relation is true in Oracle internally to make the expression list and the columns to match with the DML statement.
But, for a good design, the limit 1000 itself is big. Practically, you won't reach the limit.
And, a quote from the famous AskTom site on similar topic,
We'll spend more time parsing queries then actually executing them!
Update My own thoughts
I think Oracle is quite old in DB technology, that these limits were made then once and they never had to think about it again. All expression list have 1000 limit. And a robust design never let the users to ask Oracle for an explanation. And Tom's answer abour parsing always make me think that all this limit purpose back then in 70s or 80s was more of computation issue. The algorithms based on C might have needed some limit and Oracle came uo with 1000.
Update 2 : From application and it's framework point of view
As a DBA, I have seen so many develpers approaching me with performance issues which are actually issues with application framework generating the queries to fetch the data from database. The application provides the functionality to the users to add filters, which eventually form the AND, OR logic within the IN list of the query. Internally Oracle expands it as query rewrite in the optimization stage as OR logic. And the query becomes huge, thus increasing the time to PARSE it. Most of the times, it suppresses the index usage. So, this is one of the cases where a query is generated with huge IN list, via application framework.

Estimating number of results in Google App Engine Query

I'm attempting to estimate the total amount of results for app engine queries that will return large amounts of results.
In order to do this, I assigned a random floating point number between 0 and 1 to every entity. Then I executed the query for which I wanted to estimate the total results with the following 3 settings:
* I ordered by the random numbers that I had assigned in ascending order
* I set the offset to 1000
* I fetched only one entity
I then plugged the entities's random value that I had assigned for this purpose into the following equation to estimate the total results (since I used 1000 as the offset above, the value of OFFSET would be 1000 in this case):
1 / RANDOM * OFFSET
The idea is that since each entity has a random number assigned to it, and I am sorting by that random number, the entity's random number assignment should be proportionate to the beginning and end of the results with respect to its offset (in this case, 1000).
The problem I am having is that the results I am getting are giving me low estimates. And the estimates are lower, the lower the offset. I had anticipated that the lower the offset that I used, the less accurate the estimate should be, but I thought that the margin of error would be both above and below the actual number of results.
Below is a chart demonstrating what I am talking about. As you can see, the predictions get more consistent (accurate) as the offset increases from 1000 to 5000. But then the predictions predictably follow a 4 part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).
Am I making a mistake here, or does the standard python random number generator not distribute numbers evenly enough for this purpose?
Thanks!
Edit:
It turns out that this problem is due to my mistake. In another part of the program, I was grabbing entities from the beginning of the series, doing an operation, then re-assigning the random number. This resulted in a denser distribution of random numbers towards the end.
I did a little more digging into this concept, fixed the problem, and tried it again on a different query (so the number of results are different from above). I found that this idea can be used to estimate the total results for a query. One thing of note is that the "error" is very similar for offsets that are close by. When I did a scatter chart in excel, I expected the accuracy of the predictions at each offset to "cloud". Meaning that offsets at the very begging would produce a larger, less dense cloud that would converge to a very tiny, dense could around the actual value as the offsets got larger. This is not what happened as you can see below in the cart of how far off the predictions were at each offset. Where I thought there would be a cloud of dots, there is a line instead.
This is a chart of the maximum after each offset. For example the maximum error for any offset after 10000 was less than 1%:
When using GAE it makes a lot more sense not to try to do large amounts work on reads - it's built and optimized for very fast requests turnarounds. In this case it's actually more efficent to maintain a count of your results as and when you create the entities.
If you have a standard query, this is fairly easy - just use a sharded counter when creating the entities. You can seed this using a map reduce job to get the initial count.
If you have queries that might be dynamic, this is more difficult. If you know the range of possible queries that you might perform, you'd want to create a counter for each query that might run.
If the range of possible queries is infinite, you might want to think of aggregating counters or using them in more creative ways.
If you tell us the query you're trying to run, there might be someone who has a better idea.
Some quick thought:
Have you tried Datastore Statistics API? It may provide a fast and accurate results if you won't update your entities set very frequently.
http://code.google.com/appengine/docs/python/datastore/stats.html
[EDIT1.]
I did some math things, I think the estimate method you purposed here, could be rephrased as an "Order statistic" problem.
http://en.wikipedia.org/wiki/Order_statistic#The_order_statistics_of_the_uniform_distribution
For example:
If the actual entities number is 60000, the question equals to "what's the probability that your 1000th [2000th, 3000th, .... ] sample falling in the interval [l,u]; therefore, the estimated total entities number based on this sample, will have an acceptable error to 60000."
If the acceptable error is 5%, the interval [l, u] will be [0.015873015873015872, 0.017543859649122806]
I think the probability won't be very large.
This doesn't directly deal with the calculations aspect of your question, but would using the count attribute of a query object work for you? Or have you tried that out and it's not suitable? As per the docs, it's only slightly faster than retrieving all of the data, but on the plus side it would give you the actual number of results.
http://code.google.com/appengine/docs/python/datastore/queryclass.html#Query_count

Oracle Date Range Inconsistancy

I am running a fairly large query on a specific range of dates. The query takes about 30 seconds EXCEPT when I do a range of 10/01/2011-10/31/2011. For some reason that range never finishes. For example 01/01/2011-01/31/2011, and pretty much every other range, finishes in the expected time.
Also, I noticed that doing smaller ranges, like a week, takes longer than larger ranges.
When Oracle gathers statistics on a table, it will record the low value and the high value in a date column and use that to estimate the cardinality of a predicate. If you create a histogram on the column, it will gather more detailed information about the distribution of data within the column. Otherwise, Oracle's cost based optimizer (CBO) will assume a uniform distribution.
For example, if you have a table with 1 million rows and a DATE column with a low value of January 1, 2001 and a high value of January 1, 2011, it will assume that the approximately 10% of the data is in the range January 1, 2001 - January 1, 2002 and that roughly 0.027% of the data would come from some time on March 3, 2008 (1/(10 years * 365 days per year + leap days).
So long as your queries use dates from within the known range, the optimizer's cardinality estimates are generally pretty good so its decisions about what plan to use are pretty good. If you go a bit beyond the upper or lower bound, the estimates are still pretty good because the optimizer assumes that there probably is data that is larger or smaller than it saw when it sampled the data to gather the statistics. But when you get too far from the range that the optimizer statistics expect to see, the optimizer's cardinality estimates get too far out of line and it eventually chooses a bad plan. In your case, prior to refreshing the statistics, the maximum value the optimizer was expecting was probably September 25 or 26, 2011. When your query looked for data for the month of October, 2011, the optimizer most likely expected that the query would return very few rows and chose a plan that was optimized for that scenario rather than for the larger number of rows that were actually returned. That caused the plan to be much worse given the actual volume of data that was returned.
In Oracle 10.2, when Oracle does a hard parse and generates a plan for a query that is loaded into the shared pool, it peeks at the bind variable values and uses those values to estimate the number of rows a query will return and thus the most efficient query plan. Once a query plan has been created and until the plan is aged out of the shared pool, subsequent executions of the same query will use the same query plan regardless of the values of the bind variables. Of course, the next time the query has to be hard parsed because the plan was aged out, Oracle will peek and will likely see new bind variable values.
Bind variable peeking is not a particularly well-loved feature (Adaptive Cursor Sharing in 11g is much better) because it makes it very difficult for a DBA or a developer to predict what plan is going to be used at any particular instant because you're never sure if the bind variable values that the optimizer saw during the hard parse are representative of the bind variable values you generally see. For example, if you are searching over a 1 day range, an index scan would almost certainly be more efficient. If you're searching over a 5 year range, a table scan would almost certainly be more efficient. But you end up using whatever plan was chosen during the hard parse.
Most likely, you can resolve the problem simply by ensuring that statistics are gathered more frequently on tables that are frequently queried based on ranges of monotonically increasing values (date columns being by far the most common such column). In your case, it had been roughly 6 weeks since statistics had been gathered before the problem arose so it would probably be safe to ensure that statistics are gathered on this table every month or every couple weeks depending on how costly it is to gather statistics.
You could also use the DBMS_STATS.SET_COLUMN_STATS procedure to explicitly set the statistics for this column on a more regular basis. That requires more coding and work but saves you the time of gathering statistics. That can be hugely beneficial in a data warehouse environment but it's probably overkill in a more normal OLTP environment.

What is the best way to analyse a large dataset with similar records?

Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.

Is a globally partitioned index better (faster) than a non-partitioned index?

I'm interested to find out if there is a performance benefit to partitioning a numeric column that is often the target of a query. Currently I have a materialized view that contains ~50 million records. When using a regular b-tree index and searching by this numeric column I get a cost of 7 and query results in about 0.8 seconds (with non-primed cache). After adding a global hash partition (with 64 partitions) for that column I get a cost of 6 and query results in about 0.2 seconds (again with non-primed cache).
My first reaction is that the partitioned index has improved the performance of my query. However, I realize that this may just be a coincidence and could be totally dependent on the values being searched on, or others I'm not aware of. So my question is: is there a performance benefit to adding a global hash partition to a numeric column on a large table or is the cost of determining which index partitions to scan out-weighed by the cost of just doing a full range scan on a non-indexed partition?
I'm sure this, like many Oracle questions, can be answered with an "it depends." :) I'm interested in learning what factors I should consider to determine the benefits of each approach.
Thanks!
I'm pretty sure you have found this reference in your research - Partitioned Tables and Indexes. However I give a link to it if somebody is interested, this is a very good material about partitioning.
Straight to the point - Partitioned index just decomposes the index into pieces (16 in your situation) and spread the data depending on their hashed partitioning key. When you want to use it, Oracle "calculates" the hash of the key and determine in which section to continue with searching.
Knowing how index searching works, on really huge data I think it is better to choose the partitioned index in order to decrease the index tree you traverse (regular index). It really depends on the data, which is in the table (how regular index tree is composed) and is hashing and direct jump to lower node faster than regular tree traverse from the start node.
Finally, you must be more confident with the test results. If one technique gives better results on your exact data than some other don't worry to implement it.

Resources