How to predict table sizes Oracle? - oracle

I'm trying to do a growth prediction on some tables I have and for that I've got to do some calculations on my row sizes, how many rows I generate by day and well.. the maths.
I'm calculating the average size of each row in my table as the sum of the average size of each field. So basicaly:
SELECT 'COL1' , avg(vsize(COL1)) FROM TABLE union
SELECT 'COL2' , avg(vsize(COL2)) FROM TABLE
Sum that up, multiply by the number of entries of a day and work the predictions from there.
Turns out that for one of the tables I've looked the resulting size is a lot smaller than I thought it would be and got me wondering if my method was right.
Also, I did not consider indexes sizes for my predictions - and of course I should.
My questions are:
Is this method I'm using reliable?
Tips on how could I work the predictions for the Indexes?
I've done my googling, but the methods I find are all about the segments and extends or else calculations based in the whole table. I will need the step with the actual row of my table to do the predictions (I have to analyse the data in the table in order to figure how many records a day).
And finally, this is an approximation. I know I'm missing some bytes here and there with overheads and stuff. I just want to make sure I'm only missing bytes and not gigas :)

1) Your method is sound to calculate the average size of a row. (Though be aware that if your column contains null, you should use avg(nvl(vsize(col1), 0)) instead of avg(vsize(COL1))). However, it doesn't take into account the physical arrangement of rows.
First of all, it doesn't take into account the header info (from both blocks and rows): you can't fit 8k data into 8k blocks. See the documentation on data block format for more information.
Then, rows are not always stored neatly packed. Oracle lets some space in each blocks so that the rows can grow when they are updated (governed by the pctfree parameter). Also when the rows are deleted the empty space is not reclaimed right away (if you're not using ASSM with locally managed tablespaces, the amount of free space required for a block to return to the list of available blocks depends on pctused).
If you already have some representative data in your table, you can estimate the amount of extra space you will need by comparing the space physically used (all_tables.blocks*block_size after having gathered statistics) to the average row length.
By the way Oracle can easily give you a good estimate of the average row length: gather statistics on the table and query all_tables.avg_row_len.
2) Most of the time (read: unless there is a bug or you fall into an atypical use of the index), the index will grow proportionaly to the number of rows.
If you have representative data, you can have a good estimation of its future size by multiplying its actual size by the relative growth of the number of rows.

The last time Oracle published their formulae for estimating the size of schema objects was in Oracle 8.0, which means the linked document is ten years out of date. However, I don't expect very much has changed in how Oracle reserves segment header, block header, or row header information.

Related

Bind variables results in full table scan in Oracle

Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:
This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).

Full table scan when querying against an indexed column

I created an index for one table, a simple index just like that:
CREATE INDEX IDX_TRANSACAO_NOVA_STATUS ON TRANSACAO_NOVA(STATUS) TABLESPACE COMVENIF;
This table has 1000K registers insinde and the status table just 5 or 6 possible values. After created the index i expected that the query bellow would have a better performance:
select * from transacao_nova tn where tn.status = 'XXX'
but, the explain plan still show me a full scan with 16.000 cost.
any help? i'm not a dba but i need to improve this performance.
thanks in advance.
If there are only 5 or 6 different status values and a million records the query optimizer may be deciding it is not worth using the index to do a range scan that would still return a substantial number of all the records in the table.
You might look into using an index-clustered table for this application.
If data in the status column es skewed (not uniform: some values appear very often and others appear very rarely), you can accelerate queries for the rare values by refreshing statistics (and verifying that you are calculating a histogram for the status column. This will make Oracle use the index in the cases in which it is more efficient.
http://docs.oracle.com/cd/E11882_01/server.112/e16638/stats.htm#autoId12
Be aware that automatically determining if a column needs a histogram is not a good idea as it may lead to inconsistent behaviour. It is better to manually specify histograms when needed. Also, histograms affect every query that uses those columns, so they should be collected with care.
You might need to generate new statistics on the table.
http://docs.oracle.com/cd/B19306_01/server.102/b14211/stats.htm
A common mistake is to assume that an index range scan will be better than a full scan because you only want some "small" fraction of the total rows in the table. But if the rows you want are scattered throughout the table's storage extents, locating them by an index lookup can be slower than just scanning the entire table. I can't say for sure that's the case in your situation, but it's a possibility.
For a more in-depth discussion of this topic I recommend this paper.

Index needed for max(col)?

I'm currently doing some data loading for a kind of warehouse solution. I get an data export from the production each night, which then must be loaded. There are no other updates on the warehouse tables. To only load new items for a certain table I'm currently doing the following steps:
get the current max value y for a specific column (id for journal tables and time for event tables)
load the data via a query like where x > y
To avoid performance issues (I load around 1 million rows per day) I removed most indices from the tables (there are only needed for production, not in the warehouse). But that way the retrieval of the max value takes some time...so my question is:
What is the best way to get the current max value for a column without an index on that column? I just read about using the stats but I don't know how to handle columns with 'timestamp with timezone'. Disabling the index before load, and recreate it afterwards takes much too long...
The minimum and maximum values that are computed as part of column-level statistics are estimates. The optimizer only needs them to be reasonably close, not completely accurate. I certainly wouldn't trust them as part of a load process.
Loading a million rows per day isn't terribly much. Do you have an extremely small load window? I'm a bit hard-pressed to believe that you can't afford the cost of indexing the row(s) you need to do a min/ max index scan.
If you want to avoid indexes, however, you probably want to store the last max value in a separate table that you maintain as part of the load process. After you load rows 1-1000 in table A, you'd update the row in this summary table for table A to indicate that the last row you've processed is row 1000. The next time in, you would read the value from the summary table and start at 1001.
If there is no index on the column, the only way for the DBMS to find the maximum value in the column is a complete table scan, which takes a long time for large tables.
I suppose a DBMS could try to keep track of the minimum and maximum values in the column (storing the values in the system catalog) as it does inserts, updates and deletes - but deletes are why no DBMS I know of tries to keep statistics up to date with per-row operations. If you delete the maximum value, finding the new maximum requires a table scan if the column is not indexed (and if it is indexed, the index makes it trivial to find the maximum value, so the information does not have to be stored in the system catalog). This is why they're called 'statistics'; they're an approximation to the values that apply. But when you request 'SELECT MAX(somecol) FROM sometable', you aren't asking for statistical maximum; you're asking for the actual current maximum.
Have the process that creates the extract file also extract a single row file with the min/max you want. I assume that piece is scripted on some cron or scheduler, so shouldn't be too much to ask to add min/max calcs to that script ;)
If not, just do a full scan. Million rows isn't much really, esp in a data warehouse environment.
This code was written with oracle, but should be compatible with most SQL versions:
This gets the key of the max(high_val) in the table according to the range.
select high_val, my_key
from (select high_val, my_key
from mytable
where something = 'avalue'
order by high_val desc)
where rownum <= 1
What this says is: Sort mytable by high_val descending for values where something = 'avalue'. Only grab the top row, which will provide you with the max(high_val) in the selected range and the my_key to that table.

How can I determine the maximum row size just from the column datatype sizes?

How can I work out the maximum row size in a table, if I'm only given the datatype lengths (from all_tab_cols.data_length column) of the columns in the table (i.e. no statistics or ANALYZE)? There's extra complications in that this is an IOT, so there's index tree size to consider as well.
The formulas sizing indexes are involved -- they're generally focused on calculating rows per block to determine actual disk space required for N rows -- and the last time I can find that Oracle included this in their documentation was in Oracle 8.0. See here.
I'd recreate the table, generate some data with a Data Generator and actually measure it.
I generally wouldn't go further than 10s of MB of generated data as sufficient for a guide.
In some places, they tend to generate over-generous data length (eg standardizing on string sizes of 20, 50, 100 bytes/characters). Even on names you may allow 30 characters when most people are in the 5-10 length. As such estimates derived from field sizes, rather than actual lengths, will be VERY vague and you'll have a very large error margin.

Amortizing the calculation of distribution (and percentile), applicable on App Engine?

This is applicable to Google App Engine, but not necessarily constrained for it.
On Google App Engine, the database isn't relational, so no aggregate functions (such as sum, average etc) can be implemented. Each row is independent of each other. To calculate sum and average, the app simply has to amortize its calculation by recalculating for each individual new write to the database so that it's always up to date.
How would one go about calculating percentile and frequency distribution (i.e. density)? I'd like to make a graph of the density of a field of values, and this set of values is probably on the order of millions. It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.
Is there some algorithm to calculate or approximate density/frequency/percentile distribution that can be calculated over a period of time?
By the way, the data is indeterminate in that the maximum and minimum may be all over the place. So the distribution would have to take approximately 95% of the data and only do a density based on that.
Getting the whole row (with that limit of 1000 at a time...) over and over again in order to get a single number per row is sure unappealing. So denormalize the data by recording that single number in a separate entity that holds a list of numbers (to a limit of I believe 1 MB per query, so with 4-byte numbers no more than 250,000 numbers per list).
So when adding a number also fetch the latest "added data values list" entity, if full make a new one instead, append the new number, save it. Probably no need to be transactional if a tiny error in the statistics is no killer, as you appear to imply.
If the data for an item can be changed have separate entities of the same kind recording the "deleted" data values; to change one item's value from 23 to 45, add 23 to the latest "deleted values" list, and 45 to the latest "added values" one -- this covers item deletion as well.
It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.
This is the most obvious approach to me, why are you are you trying to avoid it?

Resources