Index needed for max(col)? - performance

I'm currently doing some data loading for a kind of warehouse solution. I get an data export from the production each night, which then must be loaded. There are no other updates on the warehouse tables. To only load new items for a certain table I'm currently doing the following steps:
get the current max value y for a specific column (id for journal tables and time for event tables)
load the data via a query like where x > y
To avoid performance issues (I load around 1 million rows per day) I removed most indices from the tables (there are only needed for production, not in the warehouse). But that way the retrieval of the max value takes some time...so my question is:
What is the best way to get the current max value for a column without an index on that column? I just read about using the stats but I don't know how to handle columns with 'timestamp with timezone'. Disabling the index before load, and recreate it afterwards takes much too long...

The minimum and maximum values that are computed as part of column-level statistics are estimates. The optimizer only needs them to be reasonably close, not completely accurate. I certainly wouldn't trust them as part of a load process.
Loading a million rows per day isn't terribly much. Do you have an extremely small load window? I'm a bit hard-pressed to believe that you can't afford the cost of indexing the row(s) you need to do a min/ max index scan.
If you want to avoid indexes, however, you probably want to store the last max value in a separate table that you maintain as part of the load process. After you load rows 1-1000 in table A, you'd update the row in this summary table for table A to indicate that the last row you've processed is row 1000. The next time in, you would read the value from the summary table and start at 1001.

If there is no index on the column, the only way for the DBMS to find the maximum value in the column is a complete table scan, which takes a long time for large tables.
I suppose a DBMS could try to keep track of the minimum and maximum values in the column (storing the values in the system catalog) as it does inserts, updates and deletes - but deletes are why no DBMS I know of tries to keep statistics up to date with per-row operations. If you delete the maximum value, finding the new maximum requires a table scan if the column is not indexed (and if it is indexed, the index makes it trivial to find the maximum value, so the information does not have to be stored in the system catalog). This is why they're called 'statistics'; they're an approximation to the values that apply. But when you request 'SELECT MAX(somecol) FROM sometable', you aren't asking for statistical maximum; you're asking for the actual current maximum.

Have the process that creates the extract file also extract a single row file with the min/max you want. I assume that piece is scripted on some cron or scheduler, so shouldn't be too much to ask to add min/max calcs to that script ;)
If not, just do a full scan. Million rows isn't much really, esp in a data warehouse environment.

This code was written with oracle, but should be compatible with most SQL versions:
This gets the key of the max(high_val) in the table according to the range.
select high_val, my_key
from (select high_val, my_key
from mytable
where something = 'avalue'
order by high_val desc)
where rownum <= 1
What this says is: Sort mytable by high_val descending for values where something = 'avalue'. Only grab the top row, which will provide you with the max(high_val) in the selected range and the my_key to that table.

Related

Oracle Auto Partition strategy on Integer column

I need some help on how to perform auto partition on integer column, similar to how we do on date column like PARTITION BY RANGE (DIM_DT_ID) INTERVAL (NUMTODSINTERVAL(1,'DAY')).
I have 90 million rows and it sucks in performance and our SLA on query is 2 seconds, i would like to perform partition. What is the best approach and how do i enable auto partition on a Integer column
Our query will always filter by these columns like
select * from <tbname>
where ObjectID =1346785
and patentnumber=23456.
"i'm just making an example here, as i cant paste the original query for legality sake"
Fair enough, but the advice we give you will only be as good as the information you give us. So far, nothing you have posted suggests you need Partitioning.
The pasted query would perform well with a compound index, and would probably benefit from compression of the leading column:
create index your_table_lookup_index
on your_table(ObjectID, patentnumber) compress 1;
If that's a unique combination then make the index unique.
how do i enable auto partition on a Integer column
However, if you think you do have a genuine use case for Partitioning then we can use Interval Partitioning with integers as well as dates. This statement will create a table partitioned on objectid with a partition for every ten values.
create table your_table (
objectid number,
patentnumber number,
created_date date
)
partition by range (objectid)
interval (10)
(
partition p_00010 values less than (10)
);
On your posted figures that would be about 400 partitions with around 225000 rows per partition. Is that a good choice? Who can tell? You know your data and your use cases, we don't: perhaps a partition per objectid (i.e. with interval (1)) would be better.
You already have a table so you need to split it into Partitions. The standard of way of doing this would be
create a new table with your partitioning strategy (like above) but with the default partition ranged for values less than (MAXVALUE)
use partition exchange to move the existing table data into the new
structure
drop the old table and rename the table to the old table; resolve
foreign keys and other dependencies.
iteratively split the partition into the required range
This is a fairly time-consuming process. You have tagged your question [oracle12c]; if you're using Oracle 12c R2 you should definitely look at its online conversion mechanism, which is a single command. Find out more.
Remember that Partitioning for performance is a tricky game. While it can improve queries which return a large number of rows aligned with the Partition key it can make no difference to other queries, or even impair their performance. In particular, any query which does not include the partition key (objectid in your case) will likely perform worse after partitioning the table .
Final aside: as you know but for the benefit of future Seekers, Partitioning is a chargeable extra to the Enterprise Edition license. We're not allowed to use it unless we've paid for it.

Bind variables results in full table scan in Oracle

Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:
This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).

Poor Performance of Mondrian w/ Degenerate Dimensions

I have an application that collects performance metrics and stores them in a datamart. I then use Mondrian to enable analysis and ad-hoc exploration of the data. I'm collecting about 5e6 rows per day and total size of the METRIC table is about 300M rows.
We "color" our data based on the metrics comparison to an SLA. There are exactly 5 distinct values for color. When we do simple MDX queries to get, for example, a color distribution of the data for a specific date range, say 1 day, we see queries like below:
2014-06-11 23:17:08,042 DEBUG [sql] - 223: SqlTupleReader.readTuples
[[Color].[Color]]: executing sql [select "METRIC"."COLOR" as "c0"
from "METRIC" "METRIC" group by "METRIC"."COLOR" order by
"METRIC"."COLOR" ASC NULLS LAST] 2014-06-11 23:17:58,747 DEBUG [sql] -
223: , exec 50704 ms
In order to improve performance, the datamart includes aggregate tables at the hour and day levels, and both aggregate tables include the COLOR column.
I understand that Mondrian is very dependent on the underlying database performance, but there is really no way to tune this. I can create an index on COLOR (because a full scan of the index will be marginally faster than a full scan of the table), but it seems silly to create an index with 5 distinct value on a 300M row table. The day aggregate table has about 500K rows and would be significantly faster executing virtually the same query against this table, but Mondrian always seems to go to the base fact table for these dimension queries.
My question is, is there some way to avoid this query? If I can't avoid it, is it possible to get Mondrian to use the aggregate tables for this type of query? I have specified approxRowCount in the single level of this dimension/hierarchy and that eliminated the similar query to get the count of values. I haven't dug into the source of Mondrian yet to determine if there is a possibility of using the aggregate table or if there is some configuration on my part that is preventing it.
Edit for Clarification:
I probably didn't do a good job of asking my question-let me try and clarify. My MDX query looks something like:
select [Color].[Color].Members on columns,
{[Measures].[Metric Value], [Measures].[Count]} on rows
from [Metric]
where [Time].[2014].[June].[11]
I can look at this and hand write a SQL query that answers this query
select COLOR, avg(VALUE), sum(FACT_COUNT)
from AGG_DAY_METRIC
where YEAR = 2014
and MONTH = 6
and DAY_OF_MONTH = 11
group by COLOR
The database answers this query in about 100ms scanning approx 4K rows. It takes Mondrian several minutes to answer the
query because it does several queries that don't answer the MDX query directly, but rather get information about the
dimension. In the case above, the database has to scan 300M rows, taking 50 seconds, to return that there are 5 possible
colors. If color was in a normal dimension table there would only be 5 rows, but in a degenerate dimension there can be 100s
of millions of rows.
So my questions are:
a) Is there a way to tell Mondrian the values of a degenerate dimension and avoid these queries?
b) Is there a way to have Mondrian answer these queries from aggregate tables?
This problem was solved, not by modifying anything in the Mondrian schema or the application, but the database. The database in this case was Oracle and we were able to create a materialized view with query rewrite enabled.
The materialized view is created from the exact query issued by Mondrian. Since the color values don't change very frequently (almost never in our case), the materialized view does a full refresh once a day.
In this case the queries went from taking minute(s) to milliseconds. If your facing an issue like this and your database is Oracle this is a good approach to speeding up the tuples resolution for degenerate dimensions with low cardinality.
It's hard to give any specific directions without knowing more about your schema, but it looks to me you have to make sure that the number of rows with certain colours (count) has to be marked defined as an aggregate measure (Count or Max Number).
Please note that these aggregates are not calculated continuously (I think it would be to heavy for the backing data-store, and Mondrian won't keep a flowing set in memory for incoming facts).
The aggregation can be specified to be ran/rebuilt at specific times (nightly, hourly...). This would make Mondrian a bit unsuitable for real-time analysis, but you should be able to do almost instant queries on historical data.
If your dimension has 5 distinct values in a 300M fact table it should not be a degenerate dimension. It should be in a separate dimension table. A degenerate dimension should ONLY be used if its cardinality is close to the full fact table row count, making a separate table pointless, as there would be no significant storage savings and joining the dimension results in a lot of data being read;
If you put the colors on a separate dim table, any "Read Tuples" query will return results in a few ms, and your problem is solved.
However, more to the point of your question, Mondrian should be able to pick the dim values from the agg tables. Unless you have distinct-count aggregators in the cube, in which case you're in a tricky situation (unless there's an agg table that exactly matches the level of detail you need, Mondrian will very likely scan the fact table).
You should also set the highCardinality attribute of this degenerate dimension to True. Even with only 5 distinct values, having highCardinality=false tells Mondrian it's safe to scan the whole dimension to populate the list of members. Setting it to true stops this scan.
You should also add an index to this column. It's always a good idea to add indexes to every key and degenerate dimension column in a fact table. With an index the DB should answer much faster that SQL query.
Finally, you have a 300M row fact table. What DBMS are you using? Is it a Column oriented DB? If not, you should try them as a possible alternative to your data store. Column oriented DB have a significant performance increase over Row oriented DBs for Mondrian-like queries. There are a few good options out there, you should test drive them.

Full table scan when querying against an indexed column

I created an index for one table, a simple index just like that:
CREATE INDEX IDX_TRANSACAO_NOVA_STATUS ON TRANSACAO_NOVA(STATUS) TABLESPACE COMVENIF;
This table has 1000K registers insinde and the status table just 5 or 6 possible values. After created the index i expected that the query bellow would have a better performance:
select * from transacao_nova tn where tn.status = 'XXX'
but, the explain plan still show me a full scan with 16.000 cost.
any help? i'm not a dba but i need to improve this performance.
thanks in advance.
If there are only 5 or 6 different status values and a million records the query optimizer may be deciding it is not worth using the index to do a range scan that would still return a substantial number of all the records in the table.
You might look into using an index-clustered table for this application.
If data in the status column es skewed (not uniform: some values appear very often and others appear very rarely), you can accelerate queries for the rare values by refreshing statistics (and verifying that you are calculating a histogram for the status column. This will make Oracle use the index in the cases in which it is more efficient.
http://docs.oracle.com/cd/E11882_01/server.112/e16638/stats.htm#autoId12
Be aware that automatically determining if a column needs a histogram is not a good idea as it may lead to inconsistent behaviour. It is better to manually specify histograms when needed. Also, histograms affect every query that uses those columns, so they should be collected with care.
You might need to generate new statistics on the table.
http://docs.oracle.com/cd/B19306_01/server.102/b14211/stats.htm
A common mistake is to assume that an index range scan will be better than a full scan because you only want some "small" fraction of the total rows in the table. But if the rows you want are scattered throughout the table's storage extents, locating them by an index lookup can be slower than just scanning the entire table. I can't say for sure that's the case in your situation, but it's a possibility.
For a more in-depth discussion of this topic I recommend this paper.

How to predict table sizes Oracle?

I'm trying to do a growth prediction on some tables I have and for that I've got to do some calculations on my row sizes, how many rows I generate by day and well.. the maths.
I'm calculating the average size of each row in my table as the sum of the average size of each field. So basicaly:
SELECT 'COL1' , avg(vsize(COL1)) FROM TABLE union
SELECT 'COL2' , avg(vsize(COL2)) FROM TABLE
Sum that up, multiply by the number of entries of a day and work the predictions from there.
Turns out that for one of the tables I've looked the resulting size is a lot smaller than I thought it would be and got me wondering if my method was right.
Also, I did not consider indexes sizes for my predictions - and of course I should.
My questions are:
Is this method I'm using reliable?
Tips on how could I work the predictions for the Indexes?
I've done my googling, but the methods I find are all about the segments and extends or else calculations based in the whole table. I will need the step with the actual row of my table to do the predictions (I have to analyse the data in the table in order to figure how many records a day).
And finally, this is an approximation. I know I'm missing some bytes here and there with overheads and stuff. I just want to make sure I'm only missing bytes and not gigas :)
1) Your method is sound to calculate the average size of a row. (Though be aware that if your column contains null, you should use avg(nvl(vsize(col1), 0)) instead of avg(vsize(COL1))). However, it doesn't take into account the physical arrangement of rows.
First of all, it doesn't take into account the header info (from both blocks and rows): you can't fit 8k data into 8k blocks. See the documentation on data block format for more information.
Then, rows are not always stored neatly packed. Oracle lets some space in each blocks so that the rows can grow when they are updated (governed by the pctfree parameter). Also when the rows are deleted the empty space is not reclaimed right away (if you're not using ASSM with locally managed tablespaces, the amount of free space required for a block to return to the list of available blocks depends on pctused).
If you already have some representative data in your table, you can estimate the amount of extra space you will need by comparing the space physically used (all_tables.blocks*block_size after having gathered statistics) to the average row length.
By the way Oracle can easily give you a good estimate of the average row length: gather statistics on the table and query all_tables.avg_row_len.
2) Most of the time (read: unless there is a bug or you fall into an atypical use of the index), the index will grow proportionaly to the number of rows.
If you have representative data, you can have a good estimation of its future size by multiplying its actual size by the relative growth of the number of rows.
The last time Oracle published their formulae for estimating the size of schema objects was in Oracle 8.0, which means the linked document is ten years out of date. However, I don't expect very much has changed in how Oracle reserves segment header, block header, or row header information.

Resources