Cassandra Wide Vs Skinny Rows for large columns - performance

I need to insert 60GB of data into cassandra per day.
This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key
In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row
Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration
More specific details about my configuration:
CREATE TABLE stuff (
stuff_id text,
stuff_column text,
value blob,
PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=39600 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.
A typical request is going to need a random selection of 20 - 60 of those floats.
Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.
As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.
Test Method
I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.
Write results
Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps

The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:
Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.
I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.

You'd be better off using 1 row per set with 150,000 columns per row. Using TTL is good idea to have an auto-cleaning process.

Related

Heroku Row Limit

Why does Heroku have a row limit on their Hobby plan if there is already an overall database size limit? I'm confused because I've reached my row limit, but I'm nowhere near the size limit. Does the amount of rows you store affect what it costs for them to manage it or is that cost only affected by the amount of bytes in your data?
Edit: Also, what constitutes a row, because I added 50 items to a table but it only added one row to my row limit? I thought each item you add to a table is a "row" on the table.
It is to stop people using custom data types to store more than 1 row worth of info in a single row. They want to limit the amount of data people can store so they limit the number of rows, but to do this without limiting row size they also need an overall size limit.
The Heroku Postgres dev plan will be limited to 10,000 rows. Not limit had been enforced. This is a global limit.

Hive partition scenario and how it impacts performance

I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).
I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.

Large Datatype length performance impact in Oracle?

I am adding a column with datatype varchar2(1000), This column will be used to store a large set of message(approximately (600 characters).Does it effect the performance of query for having large datatype length, if so how? I will be having a query selecting that column occasionally. Does a table consume extra memory here even if the value in that field in some places 100 characters?
Does it affect performance? It depends.
If "adding a column" implies that you have an existing table with existing data that you're adding a new column to, are you going to populate the new column for old data? If so, depending on your PCTFREE settings and the existing size of the rows, increasing the size of every row by an average of 600 bytes could well lead to row migration which could potentially increase the amount of I/O that queries need to perform to fetch a row. You may want to create a new table with the new column and move the old data to the new table while simultaneously populating the new column if this is a concern.
If you have queries that involve full table scans on the table, anything that you do that increases the size of the table will negatively impact the speed of those queries since they now have to read more data.
When you increase the size of a row, you decrease the number of rows per block. That would tend to increase the pressure on your buffer cache so you'd either be caching fewer rows from this table or you'd be aging out some other blocks faster. Either of those could lead to individual queries doing more physical I/O rather than logical I/O and thus running longer.
A VARCHAR2(1000) will only use whatever space is actually required to store a particular value. If some rows only need 100 bytes, Oracle would only allocate 100 bytes within the block. If other rows need 900 bytes, Oracle would allocate 900 bytes within the block.

Pytables time performance

I'm working on a project related to text detection in natural images. I have to train a classifier and for that i'm using Pytables to store information. I have:
62 classes (a-z,A-Z,0-9)
Each class has between 100 and 600 tables
Each table has 1 single column to store a 32bit Float
Each column has between 2^2 and 2^8 rows (depending on parameters)
My problem is that after I train the classifier, it takes a lot of time to read the information in the test. For example: One database has 27900 tables (62 classes * 450 tables per class) and there are 4 rows per table , it took aprox 4hs to read and retrieve all the information I need. The test program read each table 390 times (for classes A-Z, a-z) and 150 times for classes 0-9 to get all the info I need. Is that normal?
I tried to use the index option for the unique column , but I dont see any performance. I work on a VirtualMachine with 2GB Ram on a HP Pavillion Dv6 (4GB Ram DDR3, Core2 Duo).
This is likely because column lookup on tables is one of the slower operations you can do and this is where ALL of your information lives. You have two basic options to increase performance for Tables with many columns and few rows:
Pivot this structure such that you have a Table with many rows and few columns.
Move to a more efficient data structure like a CArray or EArray for every row / column.
Additionally, you can try using compression to speed things up. This is sort of generic advice, because you haven't included any code.

How HBase partitions table across regionservers?

Please tell me how HBase partitions table across regionservers.
For example, let's say my row keys are integers from 0 to 10M and I have 10 regionservers.
Does this mean that first regionserver will store all rows with keys with values 0 - 10M, second 1M - 2M, third 2M-3M , ... tenth 9M - 10M ?
I would like my row key to be timestamp, but I case most queries would apply to latest dates, all queries would be processed by only one regionserver, is it true?
Or maybe this data would be spread differently?
Or maybe can I somehow create more regions than I have region servers, so (according to given example) server 1 would have keys 0 - 0,5M and 3M - 3,5M, this way my data would be spread more equally, is this possible?
update
I just found that there's option hbase.hregion.max.filesize, do you think this will solve my problem?
WRT partitionning, you can read Lars' blog post on HBase's architecture or Google's Bigtable paper which HBase "clones".
If your row key is only a timestamp, then yes the region with the biggest keys will always be hit with new requests (since a region is only served by a single region server).
Do you want to use timestamps in order to do short scans? If so, consider salting your keys (search google for how Mozilla did it with Sorocco).
Can your prefix the timestamp with any ID? For example, if you only request data for specific users, then prefix the ts with that user ID and it will give you a much better load distribution.
If not, then use UUIDs or anything else that will randomly distribute your keys.
About hbase.hregion.maxfilesize
Setting the maxfilesize on that table (which you can do with the shell), doesn't make it that each region is exactly X MB (where X is the value you set) big. So let's say your row keys are all timestamps, which means that each new row key is bigger than the previous one. This means that it will always be inserted in the region with the empty end key (the last one). At some point, one of the files will grow bigger than maxfilesize (through compactions), and that region will be split around the middle. The lower keys will be in their own region, the higher keys in another one. But since your new row key is always bigger than the previous, this means that you will only write to that new region (and so on).
tl;dr even though you have more than 1,000 regions, with this schema the region with the biggest row keys will always get the writes, which means that the hosting region server will become a bottleneck.
Option hbase.hregion.max.filesize which is by default 256MB sets max region size, after reaching this limit region is split. This means, that my data will be stored in multiple regions of 256MB and possibly one smaller.
So
I would like my row key to be timestamp, but I case most queries would apply to latest dates, all queries would be processed by only one regionserver, is it true?
This is not true, because latest data will be also split in regions of size 256MB and stored on different regionservers.

Resources