I have a table created which is partitioned by year and I have the below two local non unique indexes created on the table
idx1 : (year, wk, pd, sku)
idx2 : (sku, str )
My undersatnding is that idx2 is redundant and and you could just create
one index with (year,wk, pd, sku, str) to take adavantage of skip scans.
Any thoughts or comments?
Indexing the partition key is redundant as partition pruning will effectively "index" that field, and it pruning occurs before any index reads. At which point, skip scans will act normally, assuming the optimizer doesn't score the skip scan higher than a full partition scan.
Assuming you have a query specifying only sku and str, then idx2 is not redundant.
Using idx1, you've still got at least 52 weeks x ?? (don't know what pd is...) = 52 or more index sub-trees for an INDEX SKIP SCAN to look through, per partition.
An INDEX SKIP SCAN on idx1 will do 52+ index probes. An INDEX RANGE SCAN on idx2 will to one probe per partition (which you could cut down to 1 overall probe by making that a global index).
So, idx2 offers benefits. Whether those benefits justify the cost of a 2nd index on the table depend on your usage patterns (writes vs reads) and performance goals.
Related
Say you have a hash partitioned table with a local index like this:
CREATE TABLE foo (
bar NUMBER,
baz NUMBER,
a NUMBER
)
PARTITION BY HASH (bar) (
PARTITION H0,
PARTITION H1
);
-- LOCAL index
CREATE INDEX foo_n1 ON foo (baz, a) LOCAL;
Assume you know in advance which partition some data is in, and would like to query the table with a known value for baz (which is locally indexed).
Is there any real benefit from querying a direct partition (FROM foo PARTITION (H0)):
SELECT *
FROM foo PARTITION (H0)
WHERE 1=1
AND baz = 1234
Versus just querying all partitions FROM foo:
SELECT *
FROM foo
WHERE 1=1
AND baz = 1234
?
I would think not.
Filtering on BAZ = 1234 will result in a range scan over the btree index, which will find the first entry of 1234 in O(log n) time; all other entries will be adjacent and it is just a matter of navigating through them. The main work is finding the first entry in the index.
Even if you had 64 partitions, each with a million rows, for a grand total of 64 million rows, the difference between log(64,000,000) and log(1,000,000) to find the first entry of 1234 appears pretty immaterial.
Edit: as #astentx mentioned, not using the PARTITION syntax will result in 64 * log(1,000,000), while using PARTITION should result in log(1,000,000). Assuming base 10, this is a difference of 384 vs 6 comparisons, which is a lot bigger than my previous estimation.
It seems to me that the direct partition query would only be significant for full fast index scans and index skip scans, because these scans take O(n) to find the first entry in the index, so minimizing the total row count would have a larger material effect. It would likewise help for full table scans for the same reasoning. However I can't see how this would help for range scans and unique scans.
Is this correct?
Additionally, I'm not sure a relatively large deduction in the number of comparisons while searching an index will have a significant affect most of the time. My guess is that the query will spend most of its time doing table access for each rowid that is returned from the index, and that the 364 vs 6 comparisons to look up the correct item in the index would take up a relatively small amount of time. I have not tested it however.
I run costly simulations of which I store the resulting matrix as a BLOB in a sqlite3 database. This matrix represents a quantity of a physics system which is simulated at a certain temperature, with a certain boundary_condition and for a certain matrix size size. Furthermore, I always use either a convergence or number_of_steps to terminate my simulation, which I need to store as well. When I require this matrix, I query my database as follows:
SELECT * FROM matrices
WHERE temperature = ? AND size = ? AND boundary_condition = ?
AND number_of_steps <= ?
ORDER BY number_of_steps DESC LIMIT 1
Or
SELECT * FROM matrices
WHERE temperature = ? AND size = ? AND boundary_condition = ?
AND convergence >= ?
ORDER BY convergence ASC LIMIT 1
Right now, I have no primary key or indices. The lookups started becoming a bit slow and, well, I found out I'm doing full table scans each time. I looked around and concluded that a composite primary key of (temperature, boundary_condition, size) could be appropriate, and indices on number_of_steps and convergence would result in a complexity of k log(N) dependence, where N is the size of the database, and k the rows of output.
Is this (close to) optimal? An additional complexity I have is that such a composite primary key will be liable to change. If I suddenly have to include another required parameter in my simulation, I'd have to change the primary key. Is this a problem?
These queries still have to read all table rows with matching (temperature, boundary_condition, size) values to find the one with the largest/smallest number_of_steps/convergence value.
You could speed up these particular queries by having two indexes on (temperature, boundary_condition, size, number_of_steps) and (temperature, boundary_condition, size, convergence).
Primary key constraints enforce uniqueness and document how table rows are to be identified.
If, at the moment, your only goal is performance, it would be easier to have a separate index, which can be easily dropped and recreated.
I've been having some difficulty scaling up the application and decided to ask a question here.
Consider a relational database (say mysql). Let's say it allows users to make posts and these are stored in the post table (has fields: postid, posterid, data, timestamp). So, when you go to retrieve all posts by you sorted by recency, you simply get all posts with posterid = you and order by date. Simple enough.
This process will use timestamp as the index since it has the highest cardinality and correctly so. So, beyond looking into the indexes, it'll take literally 1 row fetch from disk to complete this task. Awesome!
But let's say it's been 1 million more posts (in the system) by other users since you last posted. Then, in order to get your latest post, the database will peg the index on timestamp again, and it's not like we know how many posts have happened since then (or should we at least manually estimate and set preferred key)? Then we wasted looking into a million and one rows just to fetch a single row.
Additionally, a set of posts from multiple arbitrary users would be one of the use cases, so I cannot make fields like userid_timestamp to create a sub-index.
Am I seeing this wrong? Or what must be changed fundamentally from the application to allow such operation to occur at least somewhat efficiently?
Indexing
If you have a query: ... WHERE posterid = you ORDER BY timestamp [DESC], then you need a composite index on {posterid, timestamp}.
Finding all posts of a given user is done by a range scan on the index's leading edge (posterid).
Finding user's oldest/newest post can be done in a single index seek, which is proportional to the B-Tree height, which is proportional to log(N) where N is number of indexed rows.
To understand why, take a look at Anatomy of an SQL Index.
Clustering
The leafs of a "normal" B-Tree index hold "pointers" (physical addresses) to indexed rows, while the rows themselves reside in a separate data structure called "table heap". The heap can be eliminated by storing rows directly in leafs of the B-Tree, which is called clustering. This has its pros and cons, but if you have one predominant kind of query, eliminating the table heap access through clustering is definitely something to consider.
In this particular case, the table could be created like this:
CREATE TABLE T (
posterid int,
`timestamp` DATETIME,
data VARCHAR(50),
PRIMARY KEY (posterid, `timestamp`)
);
The MySQL/InnoDB clusters all its tables and uses primary key as clustering key. We haven't used the surrogate key (postid) since secondary indexes in clustered tables can be expensive and we already have the natural key. If you really need the surrogate key, consider making it alternate key and keeping the clustering established through the natural key.
For queries like
where posterid = 5
order by timestamp
or
where posterid in (4, 578, 222299, ...etc...)
order by timestamp
make an index on (posterid, timestamp) and the database should pick it all by itself.
edit - i just tried this with mysql
CREATE TABLE `posts` (
`id` INT(11) NOT NULL,
`ts` INT NOT NULL,
`data` VARCHAR(100) NULL DEFAULT NULL,
INDEX `id_ts` (`id`, `ts`),
INDEX `id` (`id`),
INDEX `ts` (`ts`),
INDEX `ts_id` (`ts`, `id`)
)
ENGINE=InnoDB
I filled it with a lot of data, and
explain
select * from posts where id = 5 order by ts
picks the id_ts index
Assuming you use hash tables to implement your Data Base - yes. Hash tables are not ordered, and you have no other way but to iterate all elements in order to find the maximal.
However, if you use some ordered DS, such as a B+ tree (which is actually pretty optimized for disks and thus data bases), it is a different story.
You can store elements in your B+ tree ordered by user (primary order/comparator) and date (secondary comparator, descending). Once you have this DS, finding the first element can be achieved in O(log(n)) disk seeks by finding the first element matching the primary criteria (user-id).
I am not familiar with the implementations of data bases, but AFAIK, some of them do allow you to create an index, based on a B+ tree - and by doing so, you can achieve finding the last post of a user more efficiently.
P.S.
To be exact, the concept of "greatest" element or ordering is not well defined in Relational Algebra. There is no max operator. To get the max element of a table R with a single column a one should actually create the Cartesian product of that table and find this entry. There is no max nor sort operator in strict relational algebra (though it does exist in SQL)
(Assuming set, and not multiset semantics):
MAX = R \ Project(Select(R x R, R1.a < R2.a),R1.a)
In general, every index on a table slows down INSERTs into the table
by a factor of three; two indexes generally make the insert twice as
slow as one index. (Yet, a two-part single index is not much worse
than a single-part single index).
I got this from the book Oracle 9i Performance Tuning Tips and Techniques by Richard Niemiec (Osborne Oracle Press Series).
What does the following terms mean:
Two-part single index
Single part single index
Are there any more kinds of indexes?
.
By two-part index I presume Rich means a composite index, that is an index built on multiple columns. Like this:
create index t23_t_idx on t23 (col4, col2);
Whereas a single part index indexes a single column:
create index t23_s_idx on t23(col1);
The indexes created above are b-tree indexes. Oracle has many other types of indexes. For starters, indexes can be unique, in which case they only allow one instance of the given value in the indexed column (or permutation of values for composite columns).
There are also bit-mapped indexes, which impose a much higher performance penalty on DML but which speed up certain types of query; it is rare to come across bitmapped indexes outside of data warehouses.
We can create function-based indexes which allow us to index the results of a deterministic function (i.e. one that is guaranteed to produce the same result for a given input). This is how we can build an index on a date column which ignores the time element:
create index t23_fbi_idx on t23( trunc(col_34));
We can also build domain indexes on text columns. And there are special indexes for partitioned tables.
All of these are covered in more detail in the documentation. Find out more.
I would assume that the author is referring to a composite index when he talks about a "two-part single index". The term "composite index" is a far more common way to refer to an index on multiple columns of a table.
If you have a single composite index on two columns, there is only one index structure that needs to be maintained during an insert so the overhead of index maintenance is not much different than the overhead of maintaining one single-column index.
CREATE TABLE t1 (
col1 NUMBER,
col2 NUMBER,
col3 NUMBER
);
CREATE INDEX t1_composite_idx
ON t1( col1, col2 );
On the other hand, if you create separate indexes on each column individually, Oracle has to maintain two separate index structures which does roughly double the amount of index maintenance that is needed
CREATE TABLE t1 (
col1 NUMBER,
col2 NUMBER,
col3 NUMBER
);
CREATE INDEX t1_idx1
ON t1( col1 );
CREATE INDEX t1_idx2
ON t1( col2 );
I would be rather leery, however, of the "factor of three" that the author quotes, however. There are a lot of variables that come into play that are not captured by that particular rule of thumb. It's useful to remember that adding indexes imposes potentially substantial costs on insert operations but it's much more useful to measure the actual cost that you are imposing when you are weighing the trade-offs to creating another index.
Are there any more kinds of indexes?
As for your last question-- Oracle has quite a few different types of indexes (particularly if we are counting composite indexes as a different type of index). This answer has been solely dealing with b*-tree indexes which are what people normally mean when they refer to "indexes" without qualifiers. Oracle, however, supports a number of different types of indexes-- b*-tree indexes, bitmap indexes, Text indexes, etc. It creates LOB indexes. It supports user-defined extensible indexes. And within each type of index, there are often dozens of different options. For example, you can create a function-based b*-tree index or a bitmap join index, you can specify custom lexers for an Oracle Text index, or you can define your own index structure for your own custom type.
Since the author does not seem to actually ever define the term, I can only guess that they mean a two-part single index is a composite key comprised of two columns and a single-part single index is an index based on a single column.
I'm Working with a new Oracle DB, with one table having the following indexes:
Index 1: ColA, ColB
Index 2: ColA
Is the second index redundant, and Will this have a negative impact on performance?
Google is my best friend :
http://www.orafaq.com/node/926
The main point of this article is :
If 2 indexes ( I1 and I2 ) exist for a table and
the number of columns in Index I1 is less or equal to the number of column in index I2 and
index I1 has the same columns in the same order as leading columns of index I2
Then
If index I1 is UNIQUE then
If index I2 is used to support Foregh Key or for Index Overload then
Do Nothing
Else
Index I2 can be DROPPED
End If
Else
Index I1 can be DROPPED
End If
End If
And I'm agree with that !
In fact, search "duplicate indexes" in Google to have different kind of answer.
The second index is different and is not redundant per se.
How about this query:
SELECT DISTINCT ColA FROM TABLE WHERE ColA IS NOT NULL;
Oracle can answer this question entirely from Index 2. Now, index 2 would be expected to be small (less blocks) than index 1. This means, it is a better index for the above query.
If your application never does a query that suits Index2 better than Index1, then it is redundant for your application.
Indexes are always a performance tradeoff. When an insert, update or delete is performed there is extra work to do in order to maintain each additional index.
Is this more than compensated for by the increased performance provided by the index? Depends on your application and data usage.
The second index is sort of redundant - any operation that uses Index2 could use Index 1. Also, writes will be slightly slower since there is another index to update.
That said, Index2 is not entirely redundant as it could be a bit faster since the index itself is probably going to be significantly smaller.
there is a chance that if your statistics go out of date, the optimizer might choose index 2 when index 1 is needed. (a hint to the optimizer would solve that, of course.)