I have a high level question. Say I have a SQL query that takes 30ms to complete, it runs against an indexed column on a table with 1million records. Now if the table size is increased to 5million records should I expect the query to take 5 times as long (as 5 times the indexes have to be searched), so 150ms. I apologise if this is too simplistic, I have a program that is running 10 SQL (indexed) against a table that is going to be increased by this factor, the queries currently take 300ms and I am concerned this would increase to 1.5s. Any help would be appreciated!
You can think of an index lookup as doing a search through a binary tree followed by a fetch of the page with the appropriate data. Typically, the index would fit in memory and the search through the index would be quite fast. Multiplying the data size of 10 would increase the depth of the tree by 3 or 4. With in-memory comparison operations this would not be noticeable for most queries. (There are other types of indexes besides binary b-trees, but this is a convenient model for thinking about performance.)
The data fetch then could incur the overhead of reading a page from disk. That should still be quite fast.
So, the easy answer to your question is: no. However, this assumes that the query is something like:
select t.*
from table t
where t.indexcol = CONSTANTVALUE
And, it assumes that the query only returns one row. Things that might affect the performance as the table size increases include:
The size of the returned data set increases with the size of the table. Returning more values necessarily takes longer. For some queries, the performance is more dependent on the mechanism for returning values than calculating/fetching the data.
The query contains join or group by.
The statistics of the table are up-to-date, so the optimizer doesn't accidentally choose a full table scan rather than an index lookup.
You are in a memory-constrained environment where the index doesn't fit in memory. Or, the entire table is in memory when smaller but incurs the overhead of a cache-miss as it gets larger.
Related
From time to time our Oracle response times decrease significally for a minute or two, without having extra load.
we were able to identify an insert statement, which produces a lot of buffer busy waits.
From the ADDM report, we got the following hint:
Consider partitioning the INDEX "IDX1" with object
ID 4711 in a manner that will evenly distribute concurrent DML across
multiple partitions.
To be honest: I am not sure what that means. I don't know what a partitioned index is. I only can Image that it means to create a Partition with a local index.
Can you help me out here?
There is a very high frequency of reading and writing to that table. no updates or deletes are used.
Thanks,
E.
I am not sure what that means.
Oracle is telling you that there is a lot of concurrent ("at the same time") activity on a very small part of your index. This happens a lot.
Consider an index column TAB1_PK on table TAB1 whose values are inserted from a sequence TAB1_S. Suppose you have 5 database sessions all inserting into TAB1 at the same time.
Because TAB1_PK is indexed, and because the sequence is generating values in numeric order, what happens is that all those sessions have to read and update the same blocks of the index at the same time.
This can cause a lot of contention -- way more than you would expect, due to the way indexes work with multi-version read consistency. I mean, in some rare situations (depending on how the transaction logic is written and the number of concurrent sessions), it can really be crippling.
The (really) old way to avoid this problem was to use a reverse key index. That way, the sequential column values did not all go to the same index blocks.
However, that is a two-edged sword. On the one hand, you get less contention because you're inserting all over the index (good). On the other hand, your rows are going all over the index, meaning you cannot cache them all. You've just turned a big logical I/O problem into a physical I/O problem!
Nowadays, we have a better solution -- a GLOBAL HASH PARTITION on the index.
With a GHP, you can specify the number of hash buckets and use that to trade-off between how much contention you need to handle vs how compact you want the index updates (for better buffer caching). The more index hash partitions you use, the better your concurrency but the worse your index block buffer caching will be.
I find a number (of global hash partitions) around 16 is pretty good.
I have a large-ish sqlite3 (3.6.22) database (about 1 GB, 5 million rows) with a single table indexed on one column. The problem is that the time to do a typical INSERT transaction fluctuates widely. I insert about 10000 rows at a time (wrapped in a transaction of course). Often it takes about 1.5 seconds, but about every fifth transaction it suddenly takes several minutes for the very same transaction to complete. I've done a lot of experimentation, and I've discovered that the phenomena only occurs if there is an index, which makes me think it is updating the index which takes a lot of time.
I need more consistent performance. A bit higher average insertion times times would be ok, if I can only avoid that some transactions suddenly takes 200x as long as the previous one... What should I do?
Here's the schema. The strings in blocks.md5 are always exactly 32 bytes long and likely unique. The rolling.value column will contain very large 64-bit integers.
CREATE TABLE blocks (blob char(32) NOT NULL,
offset long NOT NULL,
md5 char(32) NOT NULL,
row_md5 char(32));
CREATE TABLE rolling (value INT NOT NULL);
CREATE INDEX index_md5 ON blocks (md5);
CREATE UNIQUE INDEX index_rolling ON rolling (value);
I don't know exactly how sqlite indexes are implemented, but I'd expect the behavior you describe if they were storing the index on disk or reordering the data.
Imagine a scenario where when they are allocating blocks for the index, they start some page with N slots for data. When the page fills up, they have to allocate another and split the data between them.
When you're inserting your data, the ordering of the MD5 will be as random as it gets, so every page will fill up independently. There isn't any reasonable way for the indexing strategy to know that.
Other databases will even recommend using different indexing strategies than normal for strings, especially in the case of something like random MD5s.
Trying to do this in an all memory database would tell you whether its algorithmic or disk access.
I've only really tried to avoid this in an offline system where I could sort data before inserting. After it was all inserted I would index it and that was as fast as I could find. If you're doing 10k at a time, that might be your use case, though I don't know.
I have a table in my oracle db with 100 columns. 50 columns in this table are not used by the program accessing this table. (i.e. the select queries only select the relevant columns and NOT using '*')
My question is this :
If I recreate the same table with only the columns I need will it improve queries performance using the same query I used with the original table (remember that only the relevant columns are selected)
It is well worth mentioning the the program makes these queries a reasonable amount of times per second!
P.S. :
This is an existing project I am working on and the table design was made a long time ago for other products as well (thats why we have unused columns now)
So the effect of this will be that the average row will be smaller, if the extra columns have got data that will no longer be in the table. Therefore the table can be smaller, and not only will it use less space on disk it will use less memory space in the SGA, and caching will be more efficient.
Therefore, if you access the table via a full table scan then it will be faster to read the segment, but if you use index-based access mechanisms then the only performance improvement is likely to be through an improved chance of fetching the block from cache.
[Edited]
This SO thread suggests "it always pulls a tuple...". Hence, you are likely to see some performance improvement, not sure major or minor, as already mentioned.
I'm trying to optimize a query but don't quite understand some of the information returned from Explain Plan. Can anyone tell me the significance of the OPTIONS and COST columns? In the OPTIONS column, I only see the word FULL. In the COST column, I can deduce that a lower cost means a faster query. But what exactly does the cost value represent and what is an acceptable threshold?
The output of EXPLAIN PLAN is a debug output from Oracle's query optimiser. The COST is the final output of the Cost-based optimiser (CBO), the purpose of which is to select which of the many different possible plans should be used to run the query. The CBO calculates a relative Cost for each plan, then picks the plan with the lowest cost.
(Note: in some cases the CBO does not have enough time to evaluate every possible plan; in these cases it just picks the plan with the lowest cost found so far)
In general, one of the biggest contributors to a slow query is the number of rows read to service the query (blocks, to be more precise), so the cost will be based in part on the number of rows the optimiser estimates will need to be read.
For example, lets say you have the following query:
SELECT emp_id FROM employees WHERE months_of_service = 6;
(The months_of_service column has a NOT NULL constraint on it and an ordinary index on it.)
There are two basic plans the optimiser might choose here:
Plan 1: Read all the rows from the "employees" table, for each, check if the predicate is true (months_of_service=6).
Plan 2: Read the index where months_of_service=6 (this results in a set of ROWIDs), then access the table based on the ROWIDs returned.
Let's imagine the "employees" table has 1,000,000 (1 million) rows. Let's further imagine that the values for months_of_service range from 1 to 12 and are fairly evenly distributed for some reason.
The cost of Plan 1, which involves a FULL SCAN, will be the cost of reading all the rows in the employees table, which is approximately equal to 1,000,000; but since Oracle will often be able to read the blocks using multi-block reads, the actual cost will be lower (depending on how your database is set up) - e.g. let's imagine the multi-block read count is 10 - the calculated cost of the full scan will be 1,000,000 / 10; Overal cost = 100,000.
The cost of Plan 2, which involves an INDEX RANGE SCAN and a table lookup by ROWID, will be the cost of scanning the index, plus the cost of accessing the table by ROWID. I won't go into how index range scans are costed but let's imagine the cost of the index range scan is 1 per row; we expect to find a match in 1 out of 12 cases, so the cost of the index scan is 1,000,000 / 12 = 83,333; plus the cost of accessing the table (assume 1 block read per access, we can't use multi-block reads here) = 83,333; Overall cost = 166,666.
As you can see, the cost of Plan 1 (full scan) is LESS than the cost of Plan 2 (index scan + access by rowid) - which means the CBO would choose the FULL scan.
If the assumptions made here by the optimiser are true, then in fact Plan 1 will be preferable and much more efficient than Plan 2 - which disproves the myth that FULL scans are "always bad".
The results would be quite different if the optimiser goal was FIRST_ROWS(n) instead of ALL_ROWS - in which case the optimiser would favour Plan 2 because it will often return the first few rows quicker, at the cost of being less efficient for the entire query.
The CBO builds a decision tree, estimating the costs of each possible execution path available per query. The costs are set by the CPU_cost or I/O_cost parameter set on the instance. And the CBO estimates the costs, as best it can with the existing statistics of the tables and indexes that the query will use. You should not tune your query based on cost alone. Cost allows you to understand WHY the optimizer is doing what it does. Without cost you could figure out why the optimizer chose the plan it did. Lower cost does not mean a faster query. There are cases where this is true and there will be cases where this is wrong. Cost is based on your table stats and if they are wrong the cost is going to be wrong.
When tuning your query, you should take a look at the cardinality and the number of rows of each step. Do they make sense? Is the cardinality the optimizer is assuming correct? Is the rows being return reasonable. If the information present is wrong then its very likely the optimizer doesn't have the proper information it needs to make the right decision. This could be due to stale or missing statistics on the table and index as well as cpu-stats. Its best to have stats updated when tuning a query to get the most out of the optimizer. Knowing your schema is also of great help when tuning. Knowing when the optimizer chose a really bad decision and pointing it in the correct path with a small hint can save a load of time.
Here is a reference for using EXPLAIN PLAN with Oracle: http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/ex_plan.htm), with specific information about the columns found here: http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/ex_plan.htm#i18300
Your mention of 'FULL' indicates to me that the query is doing a full-table scan to find your data. This is okay, in certain situations, otherwise an indicator of poor indexing / query writing.
Generally, with explain plans, you want to ensure your query is utilizing keys, thus Oracle can find the data you're looking for with accessing the least number of rows possible. Ultimately, you can sometime only get so far with the architecture of your tables. If the costs remain too high, you may have to think about adjusting the layout of your schema to be more performance based.
In recent Oracle versions the COST represent the amount of time that the optimiser expects the query to take, expressed in units of the amount of time required for a single block read.
So if a single block read takes 2ms and the cost is expressed as "250", the query could be expected to take 500ms to complete.
The optimiser calculates the cost based on the estimated number of single block and multiblock reads, and the CPU consumption of the plan. the latter can be very useful in minimising the cost by performing certain operations before others to try and avoid high CPU cost operations.
This raises the question of how the optimiser knows how long operations take. recent Oracle versions allow the collections of "system statistics", which are definitely not to be confused with statistics on tables or indexes. The system statistics are measurements of the performance of the hardware, mostly importantly:
How long a single block read takes
How long a multiblock read takes
How large a multiblock read is (often different to the maximum possible due to table extents being smaller than the maximum, and other reasons).
CPU performance
These numbers can vary greatly according to the operating environment of the system, and different sets of statistics can be stored for "daytime OLTP" operations and "nighttime batch reporting" operations, and for "end of month reporting" if you wish.
Given these sets of statistics, a given query execution plan can be evaluated for cost in different operating environments, which might promote use of full table scans at some times or index scans at others.
The cost is not perfect, but the optimiser gets better at self-monitoring with every release, and can feedback the actual cost in comparison to the estimated cost in order to make better decisions for the future. this also makes it rather more difficult to predict.
Note that the cost is not necessarily wall clock time, as parallel query operations consume a total amount of time across multiple threads.
In older versions of Oracle the cost of CPU operations was ignored, and the relative costs of single and multiblock reads were effectively fixed according to init parameters.
FULL is probably referring to a full table scan, which means that no indexes are in use. This is usually indicating that something is wrong, unless the query is supposed to use all the rows in a table.
Cost is a number that signals the sum of the different loads, processor, memory, disk, IO, and high numbers are typically bad. The numbers are added up when moving to the root of the plan, and each branch should be examined to locate the bottlenecks.
You may also want to query v$sql and v$session to get statistics about SQL statements, and this will have detailed metrics for all kind of resources, timings and executions.
We have about 10K rows in a table. We want to have a form where we have a select drop down that contains distinct values of a given column in this table. We have an index on the column in question.
To increase performance I created a little cache table that contains the distinct values so we didn't need to do a select distinct field from table against 10K rows. Surprisingly it seems doing select * from cachetable (10 rows) is no faster than doing the select distinct against 10K rows. Why is this? Is the index doing all the work? At what number of rows in our main table will there be a performance improvement by querying the cache table?
For a DB, 10K rows is nothing. You're not seeing much difference because the actual calculation time is minimal, with most of it consumed by other, constant, overhead.
It's difficult to predict when you'd start noticing a difference, but it would probably be at around a million rows.
If you've already set up caching and it's not detrimental, you may as well leave it in.
10k rows is not much... start caring when you reach 500k ~ 1 million rows.
Indexes do a great job, specially if you just have 10 different values for that index.
This depends on numerous factors - the amount of memory your DB has, the size of the rows in the table, use of a parameterised query and so forth, but generally 10K is not a lot of rows and particularly if the table is well indexed then it's not going to cause any modern RDBMS any sweat at all.
As a rule of thumb I would generally only start paying close attention to performance issues on a table when it passes the 100K rows mark, and 500K doesn't usually cause much of a problem if indexed correctly and accessed by such. Performance usually tends to fall off catastrophically on large tables - you may be fine on 500K rows but crawling on 600K - but you have a long way to go before you are at all likely to hit such problems.
Is the index doing all the work?
You can tell how the query is being executed by viewing the execution plan.
For example, try this:
explain plan for select distinct field from table;
select * from table(dbms_xplan.display);
I notice that you didn't include an ORDER BY on that. If you do not include ORDER BY then the order of the result set may be random, particularly if oracle uses the HASH algorithm for making a distinct list. You ought to check that.
So I'd look at the execution plans for the original query that you think is using an index, and at the one based on the cache table. Maybe post them and we can comment on what's really going on.
Incidentaly, the cache table would usually be implemented as a materialised view, particularly if the master table is generally pretty static.
Serious premature optimization. Just let the database do its job, maybe with some tweaking to the configuration (especially if it's MySQL, which has several cache types and settings).
Your query in 10K rows most probably uses HASH SORT UNIQUE.
As 10K most probably fit into db_buffers and hash_area_size, all operations are performed in memory, and you won't note any difference.
But if the query will be used as a part of a more complex query, or will be swapped out by other data, you may need disk I/O to access the data, which will slow your query down.
Run your query in a loop in several sessions (as many sessions as there will be users connected), and see how it performs in that case.
For future plans and for scalability, you may want to look into an indexing service that uses pure memory or something faster than the TCP DB round-trip. A lot of people (including myself) use Lucene to achieve this by normalizing the data into flat files.
Lucene has a built-in Ram Drive directory indexer, which can build the index all in memory - removing the dependency on the file system, and greatly increasing speed.
Lately, I've architected systems that have a single Ram drive index wrapped by a Webservice. Then, I have my Ajax-like dropdowns query into that Webservice for high availability and high speed - no db layer, no file system, just pure memory and if remote tcp packet speed.
If you have an index on the column, then all the values are in the index and the dbms never has to look in the table. It just looks in the index which just has 10 entries. If this is mostly read only data, then cache it in memory. Caching helps scalability and a lot by relieving the database of work. A query that is quick on a database with no users, might perform poorly if a 30 queries are going on at the same time.