Given: Suppose an 8K (8,192 bytes) block has six rows each exactly 1,000 bytes for a total of 6,000 bytes. The six rows are end-on-end from row 1 to row 6 with no space between them. Assume the block header is no more than 192 bytes, so we have (at least) 2,000 logically contiguous bytes free in our block. Assume each row has less than 255 columns. Assume it is not a clustered table, so this block only has rows from a single table. Assume no compression.
When:
Row 3 is updated from 1,000 bytes to 1,100 bytes by changing a single column's value (e.g. like a VARCHAR2 column's value). Assume that there are subsequent columns with non-null values in this row.
Then:
Does Oracle intra-block migrate the entire row 3 from its current location between rows 2 and 4 to now be after row 6 leaving roughly a 1,000 byte gap between row 2 and row 4 (except for row 2's forwarding address to its new location within the block)?
or
Does Oracle move only a row piece of row 3 to now be after row 6 leaving a gap between the end of one of row 3's row piece and the start of row 4? If so, is the row split based upon the location of the column that was updated. All columns up to the changed column remain in the same location on the block, but all subsequent columns, including the column that was changed, move into the new row piece located after the end of row 6?
or
Does Oracle split the row into three row pieces and move only the updated column to after the end of row 6 leaving a gap between two of row 3's row pieces where the updated column used to be?
or
Does Oracle do something else?
Note: This question is for gaining an understanding of the concepts involved rather than trying to solve an actual problem.
Using Oracle 19c Enterprise Edition.
Thank you in advance.
Related
I have been asked in an interview: "can I create index in all the columns of a table (suppose there is a table which has 20 columns, 1. can we have one index for 20 columns and also 2. can we have separate index on each 20 columns of table)".
In oracle you can max use 32 columns
Columns Per index (or clustered index) 32 columns maximum
see logical limit for reference
https://docs.oracle.com/cd/B28359_01/server.111/b28320/limits003.htm
In Oracle the space allocated for data after INSERT INTO operation is not cleaned up when deleting rows from the table. Instead, after DELETE FROM operation some "waste space" is left. So what happens when I do INSERT INTO after DELETE FROM - does it span this "waste space" or allocates new space again?
See the Oracle Concepts manual:
By default, a table is organized as a heap, which means that the
database places rows where they fit best rather than in a
user-specified order. Thus, a heap-organized table is an unordered
collection of rows. As users add rows, the database places the rows in
the first available free space in the data segment. Rows are not
guaranteed to be retrieved in the order in which they were inserted.
It is the high water mark (HWM) concept.
high water mark (HWM)
The boundary between used and unused space in a segment.
Very well explained in detail by Thomas Kyte here.
Standard Oracle table is a heap-organized table. It is a table with rows stored in no particular order.
If you want to claim the free space, then you need to reorganize the table.
If I have an Oracle table with a VARCHAR2 column with length 4000, and I have inserted a 4000 character string into the table, and then update the row with a 1 character string, do I need to do anything to make the other 3999 characters of space available for reuse, or is it automatically available for reuse?
After the update, 3999 bytes of space (assuming that 1 character = 1 byte in your database character set) is freed up in the block in which the row resides. That space will be immediately available if other rows in that block need to expand in size or if other columns in that row need to expand in size. Of course, since most databases use 8k blocks and the largest block size is 32k, it is likely that there are relatively few rows in this particular block since the original row was so large.
Oracle also tracks how full blocks are and uses that information to make them available for subsequent insert operations. The mechanics of this depend on the type of tablespace (locally or dictionary managed), the segment space management policy (automatic or manual), and table-level parameters like pctused. At a high level, though, freeing up 4k of space within a single data block will almost certainly cause the data block to be made available for future insert operations (or for update operations that cause rows in other blocks to need to be migrated to a new block or chained across multiple blocks). So the space will almost certainly be available to be reused.
I am adding a column with datatype varchar2(1000), This column will be used to store a large set of message(approximately (600 characters).Does it effect the performance of query for having large datatype length, if so how? I will be having a query selecting that column occasionally. Does a table consume extra memory here even if the value in that field in some places 100 characters?
Does it affect performance? It depends.
If "adding a column" implies that you have an existing table with existing data that you're adding a new column to, are you going to populate the new column for old data? If so, depending on your PCTFREE settings and the existing size of the rows, increasing the size of every row by an average of 600 bytes could well lead to row migration which could potentially increase the amount of I/O that queries need to perform to fetch a row. You may want to create a new table with the new column and move the old data to the new table while simultaneously populating the new column if this is a concern.
If you have queries that involve full table scans on the table, anything that you do that increases the size of the table will negatively impact the speed of those queries since they now have to read more data.
When you increase the size of a row, you decrease the number of rows per block. That would tend to increase the pressure on your buffer cache so you'd either be caching fewer rows from this table or you'd be aging out some other blocks faster. Either of those could lead to individual queries doing more physical I/O rather than logical I/O and thus running longer.
A VARCHAR2(1000) will only use whatever space is actually required to store a particular value. If some rows only need 100 bytes, Oracle would only allocate 100 bytes within the block. If other rows need 900 bytes, Oracle would allocate 900 bytes within the block.
I need to insert 60GB of data into cassandra per day.
This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key
In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row
Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration
More specific details about my configuration:
CREATE TABLE stuff (
stuff_id text,
stuff_column text,
value blob,
PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=39600 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.
A typical request is going to need a random selection of 20 - 60 of those floats.
Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.
As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.
Test Method
I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.
Write results
Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps
The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:
Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.
I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.
You'd be better off using 1 row per set with 150,000 columns per row. Using TTL is good idea to have an auto-cleaning process.