Does the multiplication factor of a column's length somehow influence the database performance?
In other words, what is the difference between the performance of the following two tables:
TBL1:
- CLMN1 VARCHAR2(63)
- CLMN2 VARCHAR2(129)
- CLMN3 VARCHAR2(250)
and
TBL2:
- CLMN1 VARCHAR2(64)
- CLMN2 VARCHAR2(128)
- CLMN3 VARCHAR2(256)
Should we always attempt to make a column's length to some power of 2 or does only the maximum size matter?
Some of the developers claim that there is some link between the multiplication factor of the length of the columns in a database, because it influences how Oracle distributes and saves the data on the disk and shares its cache in memory. Can someone prove or disprove this?
There is no difference in performance. And there are no hidden optimizations done because of power of 2.
The only thing that does make a difference in how things are stored is the actual data. 100 characters stored in a VARCHAR2(2000) column are stored exactly the same way as 100 characters stored in a VARCHAR2(500) column.
Think of the length as a business constraint, not as part of the data type. The only thing that should driver your decision about the length are the business constraints about the data that is put in there.
Edit: the only situation where the length does make a difference, is when you need an index on that column. Older Oracle versions (< 10) did have a limit on the key length and that was checked when creating the index.
Even though it's possible in Oracle 11, it might not be the wisest choice to have an index on a value with 4000 characters.
Edit 2:
So I was curious and setup a simple test:
create table narrow (id varchar(40));
create table wide (id varchar(4000));
Then filled both tables with strings composed of 40 'X'. If there was indeed a (substantial) difference between the storage, this should show up somehow when retrieving the data, right?
Both tables have exactly 1048576 rows.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> set autotrace traceonly statistics
SQL> select count(*) from wide;
Statistics
----------------------------------------------------------
0 recursive calls
1 db block gets
6833 consistent gets
0 physical reads
0 redo size
349 bytes sent via SQL*Net to client
472 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> select count(*) from narrow;
Statistics
----------------------------------------------------------
0 recursive calls
1 db block gets
6833 consistent gets
0 physical reads
0 redo size
349 bytes sent via SQL*Net to client
472 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
So the full table scan for both tables did exactly the same. So what happens when we actually select the data?
SQL> select * from wide;
1048576 rows selected.
Statistics
----------------------------------------------------------
4 recursive calls
2 db block gets
76497 consistent gets
0 physical reads
0 redo size
54386472 bytes sent via SQL*Net to client
769427 bytes received via SQL*Net from client
69907 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1048576 rows processed
SQL> select * from narrow;
1048576 rows selected.
Statistics
----------------------------------------------------------
4 recursive calls
2 db block gets
76485 consistent gets
0 physical reads
0 redo size
54386472 bytes sent via SQL*Net to client
769427 bytes received via SQL*Net from client
69907 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1048576 rows processed
SQL>
There is a slight difference in consistent gets, but that could be due to caching.
Related
I have statistis of autotrace, before and after the modification of my query.
Does this statistics imply some significant performance improvements?
The statistics Before/After as below.
BEFORE AFTER
----- -----
recursive calls 5 3
db block gets 16 8
consistent gets 45 44
physical reads 2 1
redo size 1156 600
bytes sent via SQL*Net to client 624 624
bytes received via SQL*Net from client 519 519
SQL*Net roundtrips to/from client 2 2
sorts (memory) 0 1
sorts (disk) 0 0
rows processed 1 1
I won't read too much into it just by the Auto-trace information. You might also want to check the explain plan and the actual run time of query to see if performance has improved, Also ensure that you have gathered latest stats on all of your tables being used in the query.
I have a multi-threaded application that uses 10 threads, each of which on average result in an insert of 40K rows into a table. These inserts will occur 24/7 around the clock with no pauses.
I ran some performance tests and noted the following:
With CACHE 20 and single threaded, each insert took about 3.5 seconds on average
With CACHE 20 and 10 threads, each insert took 100 seconds on average
After removing the primary key and the sequence, each insert, regardless of the number of threads used, took 3.1 seconds.
With CACHE 400000 and 10 threads, each insert took 5.6 seconds on average. (Incidentally, the average originally was 8, then dropped down to 5.6 over time)
I'm performing an INSERT like this:
INSERT INTO foo (id, bar, baz)
SELECT (foo_id_seq.nextval, bar, baz)
FROM (
SELECT bar, baz
FROM ...
)
Given my constraints of 10 threads processing 40K records each on average, how can I calculate the optimal cache size of a sequence?
I'm tempted to set the cache size = (10 threads * 40K records) == 400,000, but I would be worried about any trade offs that I haven't read about in the docs.
Moreover, the insert with 400K cache size is still 100% worse than the insert with no sequence/pk. Granted this is an acceptable time.
The docs say:
The CACHE clause preallocates a set of sequence numbers and keeps them in memory so that sequence numbers can be accessed faster. When the last of the sequence numbers in the cache has been used, the database reads another set of numbers into the cache.
Sequence numbers can be kept in the sequence cache in the System Global Area (SGA). Sequence numbers can be accessed more quickly in the sequence cache than they can be read from disk.
Follow these guidelines for fast access to all sequence numbers: Be sure the sequence cache can hold all the sequences used concurrently by your applications. Increase the number of values for each sequence held in the sequence cache.
I thinks with a 3.5 second insert time, your cache size is largely irrelevant! I would look at where the time is being spent; and I would start with an execution plan (or preferably a SQL Monitor report ) for the query.
oracle version:10.2.0.4.0
table: va_edges_detail_temp
The fields are the following:
source_label: varchar2
target_label: varchar2
edge_weight: number
The following query:
select v.*, level
from va_edges_detail_temp v
start with v.source_label = 'smith'
connect by nocycle prior v.target_label = v.source_label
order by level;
When there are 552 rows in the table it only takes 0.005 seconds.
When there are 6600 rows in the table, execution never finishes. I waited for hours, but it does not finish, returns no result but shows no error either.
What's the matter?
Well, its too wide question.
In common it depends on your data. And count of rows provided via connecting of rows in va_edges_detail_temp. Its may be n^2 or n^4 or
even n!.
In any case its may increase dramatically and may not
Another part of performance its memory size. If resulted rows set are
fits into RAM oracle do it in memory. If not Oracle will try to fold data into hard drive. Its time-expensive operation in common.
I need to insert 60GB of data into cassandra per day.
This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key
In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row
Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration
More specific details about my configuration:
CREATE TABLE stuff (
stuff_id text,
stuff_column text,
value blob,
PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=39600 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.
A typical request is going to need a random selection of 20 - 60 of those floats.
Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.
As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.
Test Method
I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.
Write results
Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps
The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:
Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.
I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.
You'd be better off using 1 row per set with 150,000 columns per row. Using TTL is good idea to have an auto-cleaning process.
I have a question regarding the columns LOW_VALUE and HIGH_VALUE in the view USER_TAB_COLUMNS (or equivalent).
I was just wondering if these values are always correct, as in, if you have a column with 500k rows with value 1, 500k rows with value of 5 and 1 row with a value of 1000, the LOW_VALUE should be 1 (after you convert the raw figure) and HIGH_VALUE should be 1000 (after you convert the raw figure). However, are there any circumstances where Oracle would 'miss' this outlier value and instead have 5 for HIGH_VALUE?
Also, what is the purpose of these 2 values?
Thanks
As with all optimizer-related statistics, these values are estimates with varying degrees of accuracy from whenever statistics were gathered on the table. As such, it is entirely expected that they would be close but not completely accurate and entirely possible that they would be wildly incorrect.
When you gather statistics, you specify a percentage of the rows (or blocks) that should be sampled. It is possible to specify a 100% sample size, in which case Oracle would examine every row, but it is relatively rare to ask for a sample size nearly that large. It is much more efficient to ask for a much smaller sample size (either explicitly or by letting Oracle automatically determine the sample size). If your sample of rows happens not to include the one row with a value of 1000, the HIGH_VALUE would not be 1000, the HIGH_VALUE would be 5 assuming that is the largest value that the sample saw.
Statistics are also a snapshot in time. By default, 11g will gather statistics every night on objects that have undergone enough change since the last time that statistics were gathered on that object to warrant refreshing the statistics though you can disable that job or change the parameters. So if you gather statistics today with a 100% sample size in order to get a HIGH_VALUE of 1000 and then insert one row with a value of 3000 and never modify the table again, it's likely that Oracle would never gather statistics on that table again (unless you explicitly requested it to) and that the HIGH_VALUE would remain 1000 forever.
Assuming that there is no histogram on the column (which is another whole discussion), Oracle uses the LOW_VALUE and HIGH_VALUE to estimate how selective a particular predicate would be. If the LOW_VALUE is 1, the HIGH_VALUE is 1000, there are 1,000,000 rows in the table, there is no histogram on the column, and you run a query like
SELECT *
FROM some_table
WHERE column_name BETWEEN 100 and 101
Oracle will guess that the data is uniformly distributed between 1 and 1000 so that this query would return 1,000 rows (multiplying the number of rows in the table (1 million) by the fraction of the range the query covers (1/1000)). This selectivity estimate, in turn, would drive the optimizer's determination of whether it would be more efficient to use an index or to do a table scan, what join methods to use, what order to evaluate the various predicates, etc. If you have a non-uniform distribution of data, however, you'll likely end up with a histogram on the column which gives Oracle more detailed information about the distribution of data in the column than the LOW_VALUE and HIGH_VALUE provide.