I have a total of 50,000 records with each row has 20 columns of a combination of character, date, and numeric fields. I need to estimate how many mega bytes of database space Oracle will require for table, indexes, and other considerations such as block size.
Could you please help.
Thanks so much!
Go my answers. Thank you for your time.
Check view USER_SEGMENTS, there you see data size (column BYTES) for tables and indexes. Take these values and divide by 50'000, then you get the average size per row and you can estimate the total size.
I think 50'000 rows as sample size should be fine. If the sample is too small then your estimation would be poor.
Nowadays you typically use Locally Managed Tablespaces where PCTINCREASE does not apply. For block_size use default 8K. Don't waste so much effort for such little data.
Related
I am trying to understand the metrics around the mark cache on an AggregatingMergeTree on 21.8-altinitystable.
What is the difference between these columns on the system.parts table? primary_key_bytes_in_memory and primary_key_bytes_in_memory_allocated? Do they represent the portion of mark_bytes that are in memory in the mark cache?
Are they related in any way with the MarkCacheBytes metric in the system.asynchronous_metrics table?
I have a 4Gb mark cache size, MarkCacheBytes shows it being completely used but the
sum of both primary_key_bytes_in_memory and primary_key_bytes_in_memory_allocated across tables and parts is much lower (like respectively 1 and 2 Gb).
Thanks
Filippo
Sorry, for previous answer.
I try to explain more details:
What is the difference between these columns on the system.parts table? primary_key_bytes_in_memory and primary_key_bytes_in_memory_allocated?
According to the source
https://github.com/ClickHouse/ClickHouse/blob/229d35408b61a814dc1cb5a4cefcfa852efa13fe/src/Storages/System/StorageSystemParts.cpp#L181-L184
primary_key_bytes_in_memory - it's size of primary.idx loaded in memory
primary_key_bytes_in_memory_allocated - during loading in memory primary.idx splitted by columns and during split allocated memory is little bit bigger than raw size
Do they represent the portion of mark_bytes that are in memory in the
mark cache?
no, it represented only primary.idx representation in memory for selected part
Are they related in any way with the MarkCacheBytes metric in the system.asynchronous_metrics table
No, field above are not related to MarkCache, MarkCache related metrics show only loaded <column_name>.mrk2 files into memory. And CacheHit, CacheMiss for this mark cache
Every record in primary.idx contains values for primary key fields and number of granula for each one row from 8192 rows in raw data it's a granual
every record in <column_name>.mrk2 contains offset in compressed file <column_name>.bin for begin, offset in decompressed block and number of rows for <column_name> contains in granula
I hope it help you to figure out
primary_key_inmemory_* it's for primary.idx
MarksCache it's for *.mrk files
Look https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings#server-mark-cache-size
and
https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes#a-table-with-a-primary-key
for details
Why does Heroku have a row limit on their Hobby plan if there is already an overall database size limit? I'm confused because I've reached my row limit, but I'm nowhere near the size limit. Does the amount of rows you store affect what it costs for them to manage it or is that cost only affected by the amount of bytes in your data?
Edit: Also, what constitutes a row, because I added 50 items to a table but it only added one row to my row limit? I thought each item you add to a table is a "row" on the table.
It is to stop people using custom data types to store more than 1 row worth of info in a single row. They want to limit the amount of data people can store so they limit the number of rows, but to do this without limiting row size they also need an overall size limit.
The Heroku Postgres dev plan will be limited to 10,000 rows. Not limit had been enforced. This is a global limit.
I have an H2 database that has ballooned to several Gigabytes in size, causing all sorts of operational problems. The database size didn't seem right. So I took one little slice of it, just one table, to try to figure out what's going on.
I brought this table into a test environment:
The columns add up to 80 bytes per row, per my calculations.
The table has 280,000 rows.
For this test, all indexes were removed.
The table should occupy approximately
80 bytes per row * 280,000 rows = 22.4 MB on disk.
However, it is physically taking up 157 MB.
I would expect to see some overhead here and there, but why is this database a full 7x larger than can be reasonably estimated?
UPDATE
Output from CALL DISK_SPACE_USED
There's always indices, etc. to be taken into account.
Can you try:
CALL DISK_SPACE_USED('my_table');
Also, I would also recommend running SHUTDOWN DEFRAG and calculating the size again.
Setting MV_STORE=FALSE on database creation solves the problem. Whole database (not the test slice from the example) is now approximately 10x smaller.
Update
I had to revisit this topic recently and had to run a comparison to MySQL. On my test dataset, when MV_STORE=FALSE, the H2 database takes up 360MB of disk space, while the same data on MySQL 5.7 InnoDB with default-ish configurations takes up 432MB. YMMV.
I have a huge table in a data warehouse (Vertica). I am accessing this table in chunks for optimization purposes. The way I am deciding my chunks is pretty straightforward. I have a primary key column say A and I take a MAX(A). I have a chunk size of 20000 and I have now created (A/20000)+1 chunks. I frame query for each chunk and retrieve the data .
There problem with this approach is as follows:
My number of chunks is dependent on MAX(A) and MAX(A) is growing very fast and thereby my number of chunks increases with it as well.
I have decided on number 20000 because that is what gives me optimal performance. But distribution of primary key within the chunks of 20000 is so scattered. For example the 0-20000 might contain only 3 elements and range 20000-40000 might contain 500 elements and no ranges come close to 20000.
I am trying to figure whether there are any good approximation algorithm for this problem which minimizes the number of chunks and fill in close to 20000 primary keys in one chunk.
Any pointers towards the solution is appreciated.
I'm not sure what optimization purposes means, but I think the best approach would be to create a timestamp column, or use an eligible timestamp column to partition on. You could then partition on a larger frame of reference so there isn't a wide range between the partitions.
If the table is partitioned, it will be able to benefit from partition pruning. This means that Vertica can eliminate the storage containers during query execution which do not match on the timestamp predicate.
Otherwise, you can look at the segmentation clause and use the max/min from the storage containers. This could be slightly more complicated.
I am adding a column with datatype varchar2(1000), This column will be used to store a large set of message(approximately (600 characters).Does it effect the performance of query for having large datatype length, if so how? I will be having a query selecting that column occasionally. Does a table consume extra memory here even if the value in that field in some places 100 characters?
Does it affect performance? It depends.
If "adding a column" implies that you have an existing table with existing data that you're adding a new column to, are you going to populate the new column for old data? If so, depending on your PCTFREE settings and the existing size of the rows, increasing the size of every row by an average of 600 bytes could well lead to row migration which could potentially increase the amount of I/O that queries need to perform to fetch a row. You may want to create a new table with the new column and move the old data to the new table while simultaneously populating the new column if this is a concern.
If you have queries that involve full table scans on the table, anything that you do that increases the size of the table will negatively impact the speed of those queries since they now have to read more data.
When you increase the size of a row, you decrease the number of rows per block. That would tend to increase the pressure on your buffer cache so you'd either be caching fewer rows from this table or you'd be aging out some other blocks faster. Either of those could lead to individual queries doing more physical I/O rather than logical I/O and thus running longer.
A VARCHAR2(1000) will only use whatever space is actually required to store a particular value. If some rows only need 100 bytes, Oracle would only allocate 100 bytes within the block. If other rows need 900 bytes, Oracle would allocate 900 bytes within the block.