Why does Heroku have a row limit on their Hobby plan if there is already an overall database size limit? I'm confused because I've reached my row limit, but I'm nowhere near the size limit. Does the amount of rows you store affect what it costs for them to manage it or is that cost only affected by the amount of bytes in your data?
Edit: Also, what constitutes a row, because I added 50 items to a table but it only added one row to my row limit? I thought each item you add to a table is a "row" on the table.
It is to stop people using custom data types to store more than 1 row worth of info in a single row. They want to limit the amount of data people can store so they limit the number of rows, but to do this without limiting row size they also need an overall size limit.
The Heroku Postgres dev plan will be limited to 10,000 rows. Not limit had been enforced. This is a global limit.
Related
I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).
I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.
I have a total of 50,000 records with each row has 20 columns of a combination of character, date, and numeric fields. I need to estimate how many mega bytes of database space Oracle will require for table, indexes, and other considerations such as block size.
Could you please help.
Thanks so much!
Go my answers. Thank you for your time.
Check view USER_SEGMENTS, there you see data size (column BYTES) for tables and indexes. Take these values and divide by 50'000, then you get the average size per row and you can estimate the total size.
I think 50'000 rows as sample size should be fine. If the sample is too small then your estimation would be poor.
Nowadays you typically use Locally Managed Tablespaces where PCTINCREASE does not apply. For block_size use default 8K. Don't waste so much effort for such little data.
I am adding a column with datatype varchar2(1000), This column will be used to store a large set of message(approximately (600 characters).Does it effect the performance of query for having large datatype length, if so how? I will be having a query selecting that column occasionally. Does a table consume extra memory here even if the value in that field in some places 100 characters?
Does it affect performance? It depends.
If "adding a column" implies that you have an existing table with existing data that you're adding a new column to, are you going to populate the new column for old data? If so, depending on your PCTFREE settings and the existing size of the rows, increasing the size of every row by an average of 600 bytes could well lead to row migration which could potentially increase the amount of I/O that queries need to perform to fetch a row. You may want to create a new table with the new column and move the old data to the new table while simultaneously populating the new column if this is a concern.
If you have queries that involve full table scans on the table, anything that you do that increases the size of the table will negatively impact the speed of those queries since they now have to read more data.
When you increase the size of a row, you decrease the number of rows per block. That would tend to increase the pressure on your buffer cache so you'd either be caching fewer rows from this table or you'd be aging out some other blocks faster. Either of those could lead to individual queries doing more physical I/O rather than logical I/O and thus running longer.
A VARCHAR2(1000) will only use whatever space is actually required to store a particular value. If some rows only need 100 bytes, Oracle would only allocate 100 bytes within the block. If other rows need 900 bytes, Oracle would allocate 900 bytes within the block.
I need to insert 60GB of data into cassandra per day.
This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key
In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row
Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration
More specific details about my configuration:
CREATE TABLE stuff (
stuff_id text,
stuff_column text,
value blob,
PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=39600 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.
A typical request is going to need a random selection of 20 - 60 of those floats.
Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.
As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.
Test Method
I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.
Write results
Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps
The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:
Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.
I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.
You'd be better off using 1 row per set with 150,000 columns per row. Using TTL is good idea to have an auto-cleaning process.
A brand new application with Oracle as DataStore is going to be pushed in Production. The Databases use CBO and I have indentified some columns to do indexing. I am expecting the total number of records in a particular table to be 4 million after 6 months. After that very few records will be added and there will not be any updates in the records of Indexed columns. I mean most of the updates will be on NonIndexed columns.
Is it advisable to create Index now? or I need to wait for a couple of months?
If table requires indexes, you will incur a lot of poor performance (full table scan + actual I/O) after the number of rows in the table goes beyond what might reasonably be kept the cache. Assume that is 20000 rows. We'll call it magic number. You hit 20000 rows in a week of production. After that the queries and updates on the table will grow progressively slower, on average, as more rows are added.
You are probably worried about the overhead of inserting new rows with indexed fields. That is a one-time hit. You a trading that against dozens of queries and updates when you delay adding indexes.
The trade off is largely in favor of adding indexes right now. Especially since we do not know what that magic number (20000?) really is. Could be larger. Or smaller.