Vertica GET_COMPLIANCE_STATUS() discrepancy - vertica

We use Vertica Community Edition which allows us to store data up to 1 TB. Vertica is hosted on-premise and we have allocated 153 GB for it to use, out of which 57 GB (39%) is used so far.
When I run SELECT GET_COMPLIANCE_STATUS(), it shows I have used 0.91 TB (91%) of the allowed disk space. I have executed SELECT AUDIT_LICENSE_SIZE() to make sure we get the latest compliance data.
I am wondering why these numbers do not match.

The disk usage is the compressed/encoded ROS files.
For license calculation Vertica uses uncompressed files to sum over the total size. 1 TB for community edition is uncompressed size of the files.
Disk usage can vary depending upon how many projections you create.
Note: Additional Projections does not count to the license size. Only additional Tables and External tables do.

As #minatverma says - the audit size is the uncompressed size of the data.
It's actually the answer to the question of how many terabytes the export files would occupy if you exported all data tables to CSV files, not counting the delimiters and counting 0 bytes for NULL values.
This has only a very theoretical correlation with the size of the ROS files on disk. Vertica is a columnar database. Each column , roughly , is one file.
So, if you have, for example, a gender column that can only assume 'M' or 'F', have the projection ordered by this column first, and the column encoded as Run-Length-Encoding (RLE), this file will not occupy more than some twenty bytes - whether the table has 100 or 1 million rows: The value 'F', followed by the integer 500002 (the value occurs so many times), and the value 'M', followed by the integer 499998.
So, you see, they have little to do with each other: in the CSV file, you have one million times 1 byte for that.

Related

Relationship between primary_key_bytes_in_memory and mark cache size

I am trying to understand the metrics around the mark cache on an AggregatingMergeTree on 21.8-altinitystable.
What is the difference between these columns on the system.parts table? primary_key_bytes_in_memory and primary_key_bytes_in_memory_allocated? Do they represent the portion of mark_bytes that are in memory in the mark cache?
Are they related in any way with the MarkCacheBytes metric in the system.asynchronous_metrics table?
I have a 4Gb mark cache size, MarkCacheBytes shows it being completely used but the
sum of both primary_key_bytes_in_memory and primary_key_bytes_in_memory_allocated across tables and parts is much lower (like respectively 1 and 2 Gb).
Thanks
Filippo
Sorry, for previous answer.
I try to explain more details:
What is the difference between these columns on the system.parts table? primary_key_bytes_in_memory and primary_key_bytes_in_memory_allocated?
According to the source
https://github.com/ClickHouse/ClickHouse/blob/229d35408b61a814dc1cb5a4cefcfa852efa13fe/src/Storages/System/StorageSystemParts.cpp#L181-L184
primary_key_bytes_in_memory - it's size of primary.idx loaded in memory
primary_key_bytes_in_memory_allocated - during loading in memory primary.idx splitted by columns and during split allocated memory is little bit bigger than raw size
Do they represent the portion of mark_bytes that are in memory in the
mark cache?
no, it represented only primary.idx representation in memory for selected part
Are they related in any way with the MarkCacheBytes metric in the system.asynchronous_metrics table
No, field above are not related to MarkCache, MarkCache related metrics show only loaded <column_name>.mrk2 files into memory. And CacheHit, CacheMiss for this mark cache
Every record in primary.idx contains values for primary key fields and number of granula for each one row from 8192 rows in raw data it's a granual
every record in <column_name>.mrk2 contains offset in compressed file <column_name>.bin for begin, offset in decompressed block and number of rows for <column_name> contains in granula
I hope it help you to figure out
primary_key_inmemory_* it's for primary.idx
MarksCache it's for *.mrk files
Look https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings#server-mark-cache-size
and
https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes#a-table-with-a-primary-key
for details

Hive partition scenario and how it impacts performance

I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).
I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.

Is there record number limit in LevelDB?

Is there maximum key number limit in LevelDB or key number limit for productivity (like in Kyoto Cabinet: the number of records determines number of buckets that must be calculated before database is created; if number of records override that limit db loses productivity, but keep working)?
No limit that I would know about. However if database gets very large then most of the data is stored in the last level (about 89%) and merging with it may become expensive (a lot of files in the last level will overlap with to be merged data in the previous level).
Another thing that say at 40Gb you'll have 20480 2Mb files in one folder and it could be that your file system's performance will degrade with so many files.
To know for sure you need to experiment.

Cassandra Wide Vs Skinny Rows for large columns

I need to insert 60GB of data into cassandra per day.
This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key
In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row
Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration
More specific details about my configuration:
CREATE TABLE stuff (
stuff_id text,
stuff_column text,
value blob,
PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=39600 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.
A typical request is going to need a random selection of 20 - 60 of those floats.
Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.
As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.
Test Method
I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.
Write results
Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps
The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:
Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.
I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.
You'd be better off using 1 row per set with 150,000 columns per row. Using TTL is good idea to have an auto-cleaning process.

Pytables time performance

I'm working on a project related to text detection in natural images. I have to train a classifier and for that i'm using Pytables to store information. I have:
62 classes (a-z,A-Z,0-9)
Each class has between 100 and 600 tables
Each table has 1 single column to store a 32bit Float
Each column has between 2^2 and 2^8 rows (depending on parameters)
My problem is that after I train the classifier, it takes a lot of time to read the information in the test. For example: One database has 27900 tables (62 classes * 450 tables per class) and there are 4 rows per table , it took aprox 4hs to read and retrieve all the information I need. The test program read each table 390 times (for classes A-Z, a-z) and 150 times for classes 0-9 to get all the info I need. Is that normal?
I tried to use the index option for the unique column , but I dont see any performance. I work on a VirtualMachine with 2GB Ram on a HP Pavillion Dv6 (4GB Ram DDR3, Core2 Duo).
This is likely because column lookup on tables is one of the slower operations you can do and this is where ALL of your information lives. You have two basic options to increase performance for Tables with many columns and few rows:
Pivot this structure such that you have a Table with many rows and few columns.
Move to a more efficient data structure like a CArray or EArray for every row / column.
Additionally, you can try using compression to speed things up. This is sort of generic advice, because you haven't included any code.

Resources