Full table scan when querying against an indexed column - performance

I created an index for one table, a simple index just like that:
CREATE INDEX IDX_TRANSACAO_NOVA_STATUS ON TRANSACAO_NOVA(STATUS) TABLESPACE COMVENIF;
This table has 1000K registers insinde and the status table just 5 or 6 possible values. After created the index i expected that the query bellow would have a better performance:
select * from transacao_nova tn where tn.status = 'XXX'
but, the explain plan still show me a full scan with 16.000 cost.
any help? i'm not a dba but i need to improve this performance.
thanks in advance.

If there are only 5 or 6 different status values and a million records the query optimizer may be deciding it is not worth using the index to do a range scan that would still return a substantial number of all the records in the table.
You might look into using an index-clustered table for this application.

If data in the status column es skewed (not uniform: some values appear very often and others appear very rarely), you can accelerate queries for the rare values by refreshing statistics (and verifying that you are calculating a histogram for the status column. This will make Oracle use the index in the cases in which it is more efficient.
http://docs.oracle.com/cd/E11882_01/server.112/e16638/stats.htm#autoId12
Be aware that automatically determining if a column needs a histogram is not a good idea as it may lead to inconsistent behaviour. It is better to manually specify histograms when needed. Also, histograms affect every query that uses those columns, so they should be collected with care.

You might need to generate new statistics on the table.
http://docs.oracle.com/cd/B19306_01/server.102/b14211/stats.htm

A common mistake is to assume that an index range scan will be better than a full scan because you only want some "small" fraction of the total rows in the table. But if the rows you want are scattered throughout the table's storage extents, locating them by an index lookup can be slower than just scanning the entire table. I can't say for sure that's the case in your situation, but it's a possibility.
For a more in-depth discussion of this topic I recommend this paper.

Related

Bind variables results in full table scan in Oracle

Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:
This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).

Slowly changing dimension join performance

General Overview: I have an Oracle table 'product' that contains approximately 80 million records and I would like to improve the performance of joins that use this table. In most cases we are interested in a very small subset of records from (table) 'product' with (column) 'valid_until' date (value) 'mm/dd/9999'.
Possible solutions:
Partition 'mm/dd/9999' and use partition exchange to quickly load new data.
Use an index on 'valid_until' date.
Do you guys have any other possible Oracle solutions or ideas?
Based on needing to find 1% of records, I would expect an index to be adequate. It might pay to include the PK of the table as well if the query is just to find that for the current products.
If there is not a need to identify records by other valid_until dates then it might be worth using Oracle's equivalent of a partial index by indexing on:
case value_until
when date '...whatever the date is...'
then valid_until
else null
end
... but that would mean changing the schema or the tool that generates the queries or both.
You might keep an eye on the table's statistics to make sure that the cardinality of the selected rows is subject to a reasonably accurate estimation.
I wouldn't go for a partition-based solution as a first choice, as the overhead of row-migration during the update of the valid_until values would be fairly high, but if an index cannot deliver the query performance then by all means try.

Oracle: Possible to replace a standard non-unique single column index with an unique combined index?

I´m currently working on optimzing my database schema in regards of index structures. As I´d like to increase my DDL performance I´m searching for potential drop candidates on my Oracle 12c system. Here´s the scenario in which I don´t know what the consequences for the query performance might be if I drop the index.
Given two indexes on the same table:
- non-unique, single column index IX_A (indexes column A)
- unique, combined index UQ_AB (indexes column A, then B)
Using index monitoring I found that the query optimizer didn´t choose UQ_AB, but only IX_A (probably because it´s smaller and thus faster to read). As UQ_AB contains column A and additionally column B I´d like to drop IX_A. Though I´m not sure if I get any performance penalties if I do so. Does the higher selectivity of the combined unique index have any influence on the execution plans?
It could do, though it's quite likely to be minor (usually). Of course it depends on various things, for example how large the values in column B are.
You can look at various columns in USER_INDEXES to compare the two indexes, such as:
BLEVEL: tells you the "height" of the index tree (well, height is BLEVEL+1)
LEAF_BLOCKS: how many data blocks are occupied by the index values
DISTINCT_KEYS: how "selective" the index is
(You need to have analyzed the table first for these to be accurate). That will give you an idea of how much work Oracle needs to do to find a row using the index.
Of course the only way to really be sure is to benchmark and compare timings or even trace output.

Poor Performance of Mondrian w/ Degenerate Dimensions

I have an application that collects performance metrics and stores them in a datamart. I then use Mondrian to enable analysis and ad-hoc exploration of the data. I'm collecting about 5e6 rows per day and total size of the METRIC table is about 300M rows.
We "color" our data based on the metrics comparison to an SLA. There are exactly 5 distinct values for color. When we do simple MDX queries to get, for example, a color distribution of the data for a specific date range, say 1 day, we see queries like below:
2014-06-11 23:17:08,042 DEBUG [sql] - 223: SqlTupleReader.readTuples
[[Color].[Color]]: executing sql [select "METRIC"."COLOR" as "c0"
from "METRIC" "METRIC" group by "METRIC"."COLOR" order by
"METRIC"."COLOR" ASC NULLS LAST] 2014-06-11 23:17:58,747 DEBUG [sql] -
223: , exec 50704 ms
In order to improve performance, the datamart includes aggregate tables at the hour and day levels, and both aggregate tables include the COLOR column.
I understand that Mondrian is very dependent on the underlying database performance, but there is really no way to tune this. I can create an index on COLOR (because a full scan of the index will be marginally faster than a full scan of the table), but it seems silly to create an index with 5 distinct value on a 300M row table. The day aggregate table has about 500K rows and would be significantly faster executing virtually the same query against this table, but Mondrian always seems to go to the base fact table for these dimension queries.
My question is, is there some way to avoid this query? If I can't avoid it, is it possible to get Mondrian to use the aggregate tables for this type of query? I have specified approxRowCount in the single level of this dimension/hierarchy and that eliminated the similar query to get the count of values. I haven't dug into the source of Mondrian yet to determine if there is a possibility of using the aggregate table or if there is some configuration on my part that is preventing it.
Edit for Clarification:
I probably didn't do a good job of asking my question-let me try and clarify. My MDX query looks something like:
select [Color].[Color].Members on columns,
{[Measures].[Metric Value], [Measures].[Count]} on rows
from [Metric]
where [Time].[2014].[June].[11]
I can look at this and hand write a SQL query that answers this query
select COLOR, avg(VALUE), sum(FACT_COUNT)
from AGG_DAY_METRIC
where YEAR = 2014
and MONTH = 6
and DAY_OF_MONTH = 11
group by COLOR
The database answers this query in about 100ms scanning approx 4K rows. It takes Mondrian several minutes to answer the
query because it does several queries that don't answer the MDX query directly, but rather get information about the
dimension. In the case above, the database has to scan 300M rows, taking 50 seconds, to return that there are 5 possible
colors. If color was in a normal dimension table there would only be 5 rows, but in a degenerate dimension there can be 100s
of millions of rows.
So my questions are:
a) Is there a way to tell Mondrian the values of a degenerate dimension and avoid these queries?
b) Is there a way to have Mondrian answer these queries from aggregate tables?
This problem was solved, not by modifying anything in the Mondrian schema or the application, but the database. The database in this case was Oracle and we were able to create a materialized view with query rewrite enabled.
The materialized view is created from the exact query issued by Mondrian. Since the color values don't change very frequently (almost never in our case), the materialized view does a full refresh once a day.
In this case the queries went from taking minute(s) to milliseconds. If your facing an issue like this and your database is Oracle this is a good approach to speeding up the tuples resolution for degenerate dimensions with low cardinality.
It's hard to give any specific directions without knowing more about your schema, but it looks to me you have to make sure that the number of rows with certain colours (count) has to be marked defined as an aggregate measure (Count or Max Number).
Please note that these aggregates are not calculated continuously (I think it would be to heavy for the backing data-store, and Mondrian won't keep a flowing set in memory for incoming facts).
The aggregation can be specified to be ran/rebuilt at specific times (nightly, hourly...). This would make Mondrian a bit unsuitable for real-time analysis, but you should be able to do almost instant queries on historical data.
If your dimension has 5 distinct values in a 300M fact table it should not be a degenerate dimension. It should be in a separate dimension table. A degenerate dimension should ONLY be used if its cardinality is close to the full fact table row count, making a separate table pointless, as there would be no significant storage savings and joining the dimension results in a lot of data being read;
If you put the colors on a separate dim table, any "Read Tuples" query will return results in a few ms, and your problem is solved.
However, more to the point of your question, Mondrian should be able to pick the dim values from the agg tables. Unless you have distinct-count aggregators in the cube, in which case you're in a tricky situation (unless there's an agg table that exactly matches the level of detail you need, Mondrian will very likely scan the fact table).
You should also set the highCardinality attribute of this degenerate dimension to True. Even with only 5 distinct values, having highCardinality=false tells Mondrian it's safe to scan the whole dimension to populate the list of members. Setting it to true stops this scan.
You should also add an index to this column. It's always a good idea to add indexes to every key and degenerate dimension column in a fact table. With an index the DB should answer much faster that SQL query.
Finally, you have a 300M row fact table. What DBMS are you using? Is it a Column oriented DB? If not, you should try them as a possible alternative to your data store. Column oriented DB have a significant performance increase over Row oriented DBs for Mondrian-like queries. There are a few good options out there, you should test drive them.

TSql, building indexes before or after data input

Performance question about indexing large amounts of data. I have a large table (~30 million rows), with 4 of the columns indexed to allow for fast searching. Currently I set the indexs (indices?) up, then import my data. This takes roughly 4 hours, depending on the speed of the db server. Would it be quicker/more efficient to import the data first, and then perform index building?
I'd temper af's answer by saying that it would probably be the case that "index first, insert after" would be slower than "insert first, index after" where you are inserting records into a table with a clustered index, but not inserting records in the natural order of that index. The reason being that for each insert, the data rows themselves would be have to be ordered on disk.
As an example, consider a table with a clustered primary key on a uniqueidentifier field. The (nearly) random nature of a guid would mean that it is possible for one row to be added at the top of the data, causing all data in the current page to be shuffled along (and maybe data in lower pages too), but the next row added at the bottom. If the clustering was on, say, a datetime column, and you happened to be adding rows in date order, then the records would naturally be inserted in the correct order on disk and expensive data sorting/shuffling operations would not be needed.
I'd back up Winston Smith's answer of "it depends", but suggest that your clustered index may be a significant factor in determining which strategy is faster for your current circumstances. You could even try not having a clustered index at all, and see what happens. Let me know?
Inserting data while indices are in place causes DBMS to update them after every row. Because of this, it's usually faster to insert the data first and create indices afterwards. Especially if there is that much data.
(However, it's always possible there are special circumstances which may cause different performance characteristics. Trying it is the only way to know for sure.)
It will depend entirely on your particular data and indexing strategy. Any answer you get here is really a guess.
The only way to know for sure, is to try both and take appropriate measurements, which won't be difficult to do.

Resources