Large Datatype length performance impact in Oracle? - oracle

I am adding a column with datatype varchar2(1000), This column will be used to store a large set of message(approximately (600 characters).Does it effect the performance of query for having large datatype length, if so how? I will be having a query selecting that column occasionally. Does a table consume extra memory here even if the value in that field in some places 100 characters?

Does it affect performance? It depends.
If "adding a column" implies that you have an existing table with existing data that you're adding a new column to, are you going to populate the new column for old data? If so, depending on your PCTFREE settings and the existing size of the rows, increasing the size of every row by an average of 600 bytes could well lead to row migration which could potentially increase the amount of I/O that queries need to perform to fetch a row. You may want to create a new table with the new column and move the old data to the new table while simultaneously populating the new column if this is a concern.
If you have queries that involve full table scans on the table, anything that you do that increases the size of the table will negatively impact the speed of those queries since they now have to read more data.
When you increase the size of a row, you decrease the number of rows per block. That would tend to increase the pressure on your buffer cache so you'd either be caching fewer rows from this table or you'd be aging out some other blocks faster. Either of those could lead to individual queries doing more physical I/O rather than logical I/O and thus running longer.
A VARCHAR2(1000) will only use whatever space is actually required to store a particular value. If some rows only need 100 bytes, Oracle would only allocate 100 bytes within the block. If other rows need 900 bytes, Oracle would allocate 900 bytes within the block.

Related

Heroku Row Limit

Why does Heroku have a row limit on their Hobby plan if there is already an overall database size limit? I'm confused because I've reached my row limit, but I'm nowhere near the size limit. Does the amount of rows you store affect what it costs for them to manage it or is that cost only affected by the amount of bytes in your data?
Edit: Also, what constitutes a row, because I added 50 items to a table but it only added one row to my row limit? I thought each item you add to a table is a "row" on the table.
It is to stop people using custom data types to store more than 1 row worth of info in a single row. They want to limit the amount of data people can store so they limit the number of rows, but to do this without limiting row size they also need an overall size limit.
The Heroku Postgres dev plan will be limited to 10,000 rows. Not limit had been enforced. This is a global limit.

Oracle performance database size increases

I have a high level question. Say I have a SQL query that takes 30ms to complete, it runs against an indexed column on a table with 1million records. Now if the table size is increased to 5million records should I expect the query to take 5 times as long (as 5 times the indexes have to be searched), so 150ms. I apologise if this is too simplistic, I have a program that is running 10 SQL (indexed) against a table that is going to be increased by this factor, the queries currently take 300ms and I am concerned this would increase to 1.5s. Any help would be appreciated!
You can think of an index lookup as doing a search through a binary tree followed by a fetch of the page with the appropriate data. Typically, the index would fit in memory and the search through the index would be quite fast. Multiplying the data size of 10 would increase the depth of the tree by 3 or 4. With in-memory comparison operations this would not be noticeable for most queries. (There are other types of indexes besides binary b-trees, but this is a convenient model for thinking about performance.)
The data fetch then could incur the overhead of reading a page from disk. That should still be quite fast.
So, the easy answer to your question is: no. However, this assumes that the query is something like:
select t.*
from table t
where t.indexcol = CONSTANTVALUE
And, it assumes that the query only returns one row. Things that might affect the performance as the table size increases include:
The size of the returned data set increases with the size of the table. Returning more values necessarily takes longer. For some queries, the performance is more dependent on the mechanism for returning values than calculating/fetching the data.
The query contains join or group by.
The statistics of the table are up-to-date, so the optimizer doesn't accidentally choose a full table scan rather than an index lookup.
You are in a memory-constrained environment where the index doesn't fit in memory. Or, the entire table is in memory when smaller but incurs the overhead of a cache-miss as it gets larger.

Cassandra Wide Vs Skinny Rows for large columns

I need to insert 60GB of data into cassandra per day.
This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key
In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row
Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration
More specific details about my configuration:
CREATE TABLE stuff (
stuff_id text,
stuff_column text,
value blob,
PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=39600 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.
A typical request is going to need a random selection of 20 - 60 of those floats.
Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.
As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.
Test Method
I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.
Write results
Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps
The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:
Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.
I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.
You'd be better off using 1 row per set with 150,000 columns per row. Using TTL is good idea to have an auto-cleaning process.

Time for updating a sqlite3 index fluctuates too much

I have a large-ish sqlite3 (3.6.22) database (about 1 GB, 5 million rows) with a single table indexed on one column. The problem is that the time to do a typical INSERT transaction fluctuates widely. I insert about 10000 rows at a time (wrapped in a transaction of course). Often it takes about 1.5 seconds, but about every fifth transaction it suddenly takes several minutes for the very same transaction to complete. I've done a lot of experimentation, and I've discovered that the phenomena only occurs if there is an index, which makes me think it is updating the index which takes a lot of time.
I need more consistent performance. A bit higher average insertion times times would be ok, if I can only avoid that some transactions suddenly takes 200x as long as the previous one... What should I do?
Here's the schema. The strings in blocks.md5 are always exactly 32 bytes long and likely unique. The rolling.value column will contain very large 64-bit integers.
CREATE TABLE blocks (blob char(32) NOT NULL,
offset long NOT NULL,
md5 char(32) NOT NULL,
row_md5 char(32));
CREATE TABLE rolling (value INT NOT NULL);
CREATE INDEX index_md5 ON blocks (md5);
CREATE UNIQUE INDEX index_rolling ON rolling (value);
I don't know exactly how sqlite indexes are implemented, but I'd expect the behavior you describe if they were storing the index on disk or reordering the data.
Imagine a scenario where when they are allocating blocks for the index, they start some page with N slots for data. When the page fills up, they have to allocate another and split the data between them.
When you're inserting your data, the ordering of the MD5 will be as random as it gets, so every page will fill up independently. There isn't any reasonable way for the indexing strategy to know that.
Other databases will even recommend using different indexing strategies than normal for strings, especially in the case of something like random MD5s.
Trying to do this in an all memory database would tell you whether its algorithmic or disk access.
I've only really tried to avoid this in an offline system where I could sort data before inserting. After it was all inserted I would index it and that was as fast as I could find. If you're doing 10k at a time, that might be your use case, though I don't know.

Oracle : table with unused columns impact performance?

I have a table in my oracle db with 100 columns. 50 columns in this table are not used by the program accessing this table. (i.e. the select queries only select the relevant columns and NOT using '*')
My question is this :
If I recreate the same table with only the columns I need will it improve queries performance using the same query I used with the original table (remember that only the relevant columns are selected)
It is well worth mentioning the the program makes these queries a reasonable amount of times per second!
P.S. :
This is an existing project I am working on and the table design was made a long time ago for other products as well (thats why we have unused columns now)
So the effect of this will be that the average row will be smaller, if the extra columns have got data that will no longer be in the table. Therefore the table can be smaller, and not only will it use less space on disk it will use less memory space in the SGA, and caching will be more efficient.
Therefore, if you access the table via a full table scan then it will be faster to read the segment, but if you use index-based access mechanisms then the only performance improvement is likely to be through an improved chance of fetching the block from cache.
[Edited]
This SO thread suggests "it always pulls a tuple...". Hence, you are likely to see some performance improvement, not sure major or minor, as already mentioned.

Resources