Say you have a hash partitioned table with a local index like this:
CREATE TABLE foo (
bar NUMBER,
baz NUMBER,
a NUMBER
)
PARTITION BY HASH (bar) (
PARTITION H0,
PARTITION H1
);
-- LOCAL index
CREATE INDEX foo_n1 ON foo (baz, a) LOCAL;
Assume you know in advance which partition some data is in, and would like to query the table with a known value for baz (which is locally indexed).
Is there any real benefit from querying a direct partition (FROM foo PARTITION (H0)):
SELECT *
FROM foo PARTITION (H0)
WHERE 1=1
AND baz = 1234
Versus just querying all partitions FROM foo:
SELECT *
FROM foo
WHERE 1=1
AND baz = 1234
?
I would think not.
Filtering on BAZ = 1234 will result in a range scan over the btree index, which will find the first entry of 1234 in O(log n) time; all other entries will be adjacent and it is just a matter of navigating through them. The main work is finding the first entry in the index.
Even if you had 64 partitions, each with a million rows, for a grand total of 64 million rows, the difference between log(64,000,000) and log(1,000,000) to find the first entry of 1234 appears pretty immaterial.
Edit: as #astentx mentioned, not using the PARTITION syntax will result in 64 * log(1,000,000), while using PARTITION should result in log(1,000,000). Assuming base 10, this is a difference of 384 vs 6 comparisons, which is a lot bigger than my previous estimation.
It seems to me that the direct partition query would only be significant for full fast index scans and index skip scans, because these scans take O(n) to find the first entry in the index, so minimizing the total row count would have a larger material effect. It would likewise help for full table scans for the same reasoning. However I can't see how this would help for range scans and unique scans.
Is this correct?
Additionally, I'm not sure a relatively large deduction in the number of comparisons while searching an index will have a significant affect most of the time. My guess is that the query will spend most of its time doing table access for each rowid that is returned from the index, and that the 364 vs 6 comparisons to look up the correct item in the index would take up a relatively small amount of time. I have not tested it however.
Related
I need some help on how to perform auto partition on integer column, similar to how we do on date column like PARTITION BY RANGE (DIM_DT_ID) INTERVAL (NUMTODSINTERVAL(1,'DAY')).
I have 90 million rows and it sucks in performance and our SLA on query is 2 seconds, i would like to perform partition. What is the best approach and how do i enable auto partition on a Integer column
Our query will always filter by these columns like
select * from <tbname>
where ObjectID =1346785
and patentnumber=23456.
"i'm just making an example here, as i cant paste the original query for legality sake"
Fair enough, but the advice we give you will only be as good as the information you give us. So far, nothing you have posted suggests you need Partitioning.
The pasted query would perform well with a compound index, and would probably benefit from compression of the leading column:
create index your_table_lookup_index
on your_table(ObjectID, patentnumber) compress 1;
If that's a unique combination then make the index unique.
how do i enable auto partition on a Integer column
However, if you think you do have a genuine use case for Partitioning then we can use Interval Partitioning with integers as well as dates. This statement will create a table partitioned on objectid with a partition for every ten values.
create table your_table (
objectid number,
patentnumber number,
created_date date
)
partition by range (objectid)
interval (10)
(
partition p_00010 values less than (10)
);
On your posted figures that would be about 400 partitions with around 225000 rows per partition. Is that a good choice? Who can tell? You know your data and your use cases, we don't: perhaps a partition per objectid (i.e. with interval (1)) would be better.
You already have a table so you need to split it into Partitions. The standard of way of doing this would be
create a new table with your partitioning strategy (like above) but with the default partition ranged for values less than (MAXVALUE)
use partition exchange to move the existing table data into the new
structure
drop the old table and rename the table to the old table; resolve
foreign keys and other dependencies.
iteratively split the partition into the required range
This is a fairly time-consuming process. You have tagged your question [oracle12c]; if you're using Oracle 12c R2 you should definitely look at its online conversion mechanism, which is a single command. Find out more.
Remember that Partitioning for performance is a tricky game. While it can improve queries which return a large number of rows aligned with the Partition key it can make no difference to other queries, or even impair their performance. In particular, any query which does not include the partition key (objectid in your case) will likely perform worse after partitioning the table .
Final aside: as you know but for the benefit of future Seekers, Partitioning is a chargeable extra to the Enterprise Edition license. We're not allowed to use it unless we've paid for it.
I have a table created which is partitioned by year and I have the below two local non unique indexes created on the table
idx1 : (year, wk, pd, sku)
idx2 : (sku, str )
My undersatnding is that idx2 is redundant and and you could just create
one index with (year,wk, pd, sku, str) to take adavantage of skip scans.
Any thoughts or comments?
Indexing the partition key is redundant as partition pruning will effectively "index" that field, and it pruning occurs before any index reads. At which point, skip scans will act normally, assuming the optimizer doesn't score the skip scan higher than a full partition scan.
Assuming you have a query specifying only sku and str, then idx2 is not redundant.
Using idx1, you've still got at least 52 weeks x ?? (don't know what pd is...) = 52 or more index sub-trees for an INDEX SKIP SCAN to look through, per partition.
An INDEX SKIP SCAN on idx1 will do 52+ index probes. An INDEX RANGE SCAN on idx2 will to one probe per partition (which you could cut down to 1 overall probe by making that a global index).
So, idx2 offers benefits. Whether those benefits justify the cost of a 2nd index on the table depend on your usage patterns (writes vs reads) and performance goals.
I have to sum a huge number of data with aggregation and where clause, using this query
what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term.
I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms
The problem is that this query takes more than 15 min. because I have huge tables
any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000
If you have a sufficient amount of free disk space, then create a materialized view
over the join of the three tables
fast-refreshable on commit (don't use the ANSI join syntax here, even if tempted to do so, or the mview won't be fast-refreshable ... a strange bug in Oracle)
with query rewrite enabled
properly physically organized for quick calculations
The query rewrite is optional. If you can modify the above insert-select, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
hash partitioning by Plag_User_Terms.ISN (with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
single-table hash clustering by Plag_User_Terms.ISN if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
either use nested loops joins, since the number of rows processed seems to be quite low (assumed by the estimations in the execution plan)
or full-scan the plag_correlations3 table in parallel
Bottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.
I've been having some difficulty scaling up the application and decided to ask a question here.
Consider a relational database (say mysql). Let's say it allows users to make posts and these are stored in the post table (has fields: postid, posterid, data, timestamp). So, when you go to retrieve all posts by you sorted by recency, you simply get all posts with posterid = you and order by date. Simple enough.
This process will use timestamp as the index since it has the highest cardinality and correctly so. So, beyond looking into the indexes, it'll take literally 1 row fetch from disk to complete this task. Awesome!
But let's say it's been 1 million more posts (in the system) by other users since you last posted. Then, in order to get your latest post, the database will peg the index on timestamp again, and it's not like we know how many posts have happened since then (or should we at least manually estimate and set preferred key)? Then we wasted looking into a million and one rows just to fetch a single row.
Additionally, a set of posts from multiple arbitrary users would be one of the use cases, so I cannot make fields like userid_timestamp to create a sub-index.
Am I seeing this wrong? Or what must be changed fundamentally from the application to allow such operation to occur at least somewhat efficiently?
Indexing
If you have a query: ... WHERE posterid = you ORDER BY timestamp [DESC], then you need a composite index on {posterid, timestamp}.
Finding all posts of a given user is done by a range scan on the index's leading edge (posterid).
Finding user's oldest/newest post can be done in a single index seek, which is proportional to the B-Tree height, which is proportional to log(N) where N is number of indexed rows.
To understand why, take a look at Anatomy of an SQL Index.
Clustering
The leafs of a "normal" B-Tree index hold "pointers" (physical addresses) to indexed rows, while the rows themselves reside in a separate data structure called "table heap". The heap can be eliminated by storing rows directly in leafs of the B-Tree, which is called clustering. This has its pros and cons, but if you have one predominant kind of query, eliminating the table heap access through clustering is definitely something to consider.
In this particular case, the table could be created like this:
CREATE TABLE T (
posterid int,
`timestamp` DATETIME,
data VARCHAR(50),
PRIMARY KEY (posterid, `timestamp`)
);
The MySQL/InnoDB clusters all its tables and uses primary key as clustering key. We haven't used the surrogate key (postid) since secondary indexes in clustered tables can be expensive and we already have the natural key. If you really need the surrogate key, consider making it alternate key and keeping the clustering established through the natural key.
For queries like
where posterid = 5
order by timestamp
or
where posterid in (4, 578, 222299, ...etc...)
order by timestamp
make an index on (posterid, timestamp) and the database should pick it all by itself.
edit - i just tried this with mysql
CREATE TABLE `posts` (
`id` INT(11) NOT NULL,
`ts` INT NOT NULL,
`data` VARCHAR(100) NULL DEFAULT NULL,
INDEX `id_ts` (`id`, `ts`),
INDEX `id` (`id`),
INDEX `ts` (`ts`),
INDEX `ts_id` (`ts`, `id`)
)
ENGINE=InnoDB
I filled it with a lot of data, and
explain
select * from posts where id = 5 order by ts
picks the id_ts index
Assuming you use hash tables to implement your Data Base - yes. Hash tables are not ordered, and you have no other way but to iterate all elements in order to find the maximal.
However, if you use some ordered DS, such as a B+ tree (which is actually pretty optimized for disks and thus data bases), it is a different story.
You can store elements in your B+ tree ordered by user (primary order/comparator) and date (secondary comparator, descending). Once you have this DS, finding the first element can be achieved in O(log(n)) disk seeks by finding the first element matching the primary criteria (user-id).
I am not familiar with the implementations of data bases, but AFAIK, some of them do allow you to create an index, based on a B+ tree - and by doing so, you can achieve finding the last post of a user more efficiently.
P.S.
To be exact, the concept of "greatest" element or ordering is not well defined in Relational Algebra. There is no max operator. To get the max element of a table R with a single column a one should actually create the Cartesian product of that table and find this entry. There is no max nor sort operator in strict relational algebra (though it does exist in SQL)
(Assuming set, and not multiset semantics):
MAX = R \ Project(Select(R x R, R1.a < R2.a),R1.a)
I'm Working with a new Oracle DB, with one table having the following indexes:
Index 1: ColA, ColB
Index 2: ColA
Is the second index redundant, and Will this have a negative impact on performance?
Google is my best friend :
http://www.orafaq.com/node/926
The main point of this article is :
If 2 indexes ( I1 and I2 ) exist for a table and
the number of columns in Index I1 is less or equal to the number of column in index I2 and
index I1 has the same columns in the same order as leading columns of index I2
Then
If index I1 is UNIQUE then
If index I2 is used to support Foregh Key or for Index Overload then
Do Nothing
Else
Index I2 can be DROPPED
End If
Else
Index I1 can be DROPPED
End If
End If
And I'm agree with that !
In fact, search "duplicate indexes" in Google to have different kind of answer.
The second index is different and is not redundant per se.
How about this query:
SELECT DISTINCT ColA FROM TABLE WHERE ColA IS NOT NULL;
Oracle can answer this question entirely from Index 2. Now, index 2 would be expected to be small (less blocks) than index 1. This means, it is a better index for the above query.
If your application never does a query that suits Index2 better than Index1, then it is redundant for your application.
Indexes are always a performance tradeoff. When an insert, update or delete is performed there is extra work to do in order to maintain each additional index.
Is this more than compensated for by the increased performance provided by the index? Depends on your application and data usage.
The second index is sort of redundant - any operation that uses Index2 could use Index 1. Also, writes will be slightly slower since there is another index to update.
That said, Index2 is not entirely redundant as it could be a bit faster since the index itself is probably going to be significantly smaller.
there is a chance that if your statistics go out of date, the optimizer might choose index 2 when index 1 is needed. (a hint to the optimizer would solve that, of course.)