How do I extract/calculate the partition key size:
nodetool cfhistograms, reports Max Partition key associated with Cell count, but doesn't say which anything about large key(s).
is there a way to extract the largest partition key or atleast get a listing of all partition keys + size?
How to calculate partition key size: https://shermandigital.com/blog/calculate-the-size-of-a-cassandra-table/
Cassandra stores at least 15 bytes worth of metadata for each column. Counter columns require an additional eight bytes of overhead as do expiring columns (columns with the time-to-live value set). In addition to metadata, we need space for the name of each column and the value stored within it
And then
Every partition key requires 23 bytes of metadata
So you calculate the size of columns (including partition key columns) with:
column_size = column_metadata + column_name_value + column_value
Related
I have billions of rows in a table
CREATE TABLE sample ( PN String,
CHROM String,
POS UInt32)
ENGINE = MergeTree
PARTITION BY PN
ORDER BY (CHROM, POS)
SETTINGS index_granularity = 8192;
each PN having about 5 million rows
I want to return all rows in order of CHROM, POS
select * from sample order by CHROM, POS
This runs out of memory.
Since the data is already stored in CHROM, POS order albeit partitioned by PN is there a way to 'stream' all data from the partitions and merge them in order without needing much memory.
The sorted data is stored in the scope of partition, therefore, to sort ALL ROWS from all partitions need to load whole content to memory and then sort it.
Using the partition key as PARTITION BY CHROM or PARTITION BY (CHROM, POS) will work much better.
As alternative, you can enable external sort (see max_bytes_before_external_sort) to collect pre-sorted data on disk instead of memory.
I have multiple partitions on my table as below.
Partition- Day_20190509 with high value of 20190510
Partition- Day_20190520 with high value of 20190521
Partition- Day_99999999 with MAXVALUE as high.
I want to create three new partitions for Day_20190510,Day_20190513,Day_20190514 with high values as 20190513,20190514,20190520 respectively.
I believe this can done using SPLIT partitions but could not understand how I can create partitions in between. Can someone assist with the query for this?
I tried using partition split option but could not understand what will be my range part and new partitions
ALTER TABLE table_name SPLIT PARTITION partition_name
AT (range_part_value)
INTO
(
PARTITION new_part1
[TABLESPACE tablespace_name],
PARTITION new_part2
[TABLESPACE tablespace_name]
);
Values you described as high values (20190513,20190514,20190520) in new partitions (Day_20190510,Day_20190513,Day_20190514) belong to the current partition: Day_20190520 (Value range: 20190511 - 20190521)
So current partition Day_20190520 must be split as follows:
ALTER TABLE table_name SPLIT PARTITION Day_20190520 INTO
(PARTITION Day_20190510 VALUES LESS THAN (20190514), -- 20190513 + 1
PARTITION Day_20190513 VALUES LESS THAN (20190515), -- 20190514 + 1
PARTITION Day_20190514 VALUES LESS THAN (20190521), -- 20190520 + 1
PARTITION Day_20190520_1);
Hope this will solve your problem.
I have a table with 2017 and 2018 year data. Need to create monthly partition on that table.
So I created one non partitioned table and loaded all the data from original table. now I am converting the new table to a monthly partitioned table.
When I am altering getting error as
ORA-14300: partitioning key maps to a partition outside maximum
permitted number of partitions
My Script is
ALTER TABLE ORDERHDR_PART MODIFY
PARTITION BY RANGE (LASTUPDATE) INTERVAL(NUMTOYMINTERVAL(1, 'MONTH'))
(
PARTITION ORDERHDR_PART_JAN VALUES less than (TO_DATE('01-02-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_FEB VALUES less than (TO_DATE('01-03-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_MAR VALUES less than (TO_DATE('01-04-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_APR VALUES less than (TO_DATE('01-05-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_MAY VALUES less than (TO_DATE('01-06-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_JUN VALUES less than (TO_DATE('01-07-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_JUL VALUES less than (TO_DATE('01-08-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_AUG VALUES less than (TO_DATE('01-09-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_SEP VALUES less than (TO_DATE('01-10-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_OCT VALUES less than (TO_DATE('01-11-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_NOV VALUES less than (TO_DATE('01-12-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_DEC VALUES less than (TO_DATE('01-01-2019','DD-MM-YYYY'))
)ONLINE;
I think your approach is wrong.
First create a partitioned table, e.g.
CREATE TABLE ORDERHDR_PART (....)
PARTITION BY RANGE (LASTUPDATE) INTERVAL (NUMTOYMINTERVAL(1, 'MONTH'))
(
PARTITION ORDERHDR_INITIAL VALUES less than (DATE '2000-01-01')
);
Then transfer existing data to the new table.
Either you use a simple INSERT INTO ORDERHDR_PART SELECT * FROM ORDERHDR_2017;
Oracle will create monthly partitions automatically based on LASTUPDATE value.
With this methods you would duplicate (temporary) your data and/or you may face a performance issue.
The other method is to use Exchanging Partitions, should be like this
ALTER TABLE ORDERHDR_PART
EXCHANGE PARTITION FOR (DATE '2017-01-01')
WITH TABLE ORDERHDR_2017
INCLUDING INDEXES;
I don't know whether "PARTITION FOR (DATE '2017-01-01')" is created automatically, perhaps you have to run INSERT INTO ORDERHDR_PART (LASTUPDATE) VALUES (DATE '2017-01-01'); ROLLBACK; in order to create it first.
You will get one partition for all months, afterwards you can split the partition with Splitting into Multiple Partitions. Should be like this:
ALTER TABLE ORDERHDR_PART SPLIT PARTITION FOR (DATE '2017-01-01') INTO (
PARTITION ORDERHDR_PART_JAN VALUES less than (TO_DATE('01-02-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_FEB VALUES less than (TO_DATE('01-03-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_MAR VALUES less than (TO_DATE('01-04-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_APR VALUES less than (TO_DATE('01-05-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_MAY VALUES less than (TO_DATE('01-06-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_JUN VALUES less than (TO_DATE('01-07-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_JUL VALUES less than (TO_DATE('01-08-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_AUG VALUES less than (TO_DATE('01-09-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_SEP VALUES less than (TO_DATE('01-10-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_OCT VALUES less than (TO_DATE('01-11-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_NOV VALUES less than (TO_DATE('01-12-2018','DD-MM-YYYY')),
PARTITION ORDERHDR_PART_DEC VALUES less than (TO_DATE('01-01-2019','DD-MM-YYYY'))
);
Note, by default you cannot drop the inital partition of a RANGE partitioned table. If you face this problem execute:
ALTER TABLE ORDERHDR_PART SET INTERVAL ();
ALTER TABLE ORDERHDR_PART DROP PARTITION ORDERHDR_INITIAL;
ALTER TABLE ORDERHDR_PART SET INTERVAL (NUMTOYMINTERVAL(1, 'MONTH'));
I would like to run a procedure that merges table partitions that match a certain criteria.
As example - table1 is range partitions by date and has 5 partitions.
Partitions = empire1, empire2, rebels1, rebels2, yoda1.
Table DESC:
INVOICE_NO NOT NULL NUMBER
INVOICE_DATE NOT NULL DATE
COMMENTS VARCHAR2(500)
it is partitioned by INVOICE_DATE as follows
PARTITION REBELS1 VALUES LESS THAN (TO_DATE('01-JAN-2014','DD-MON-YYYY')),
PARTITION REBELS2 VALUES LESS THAN (TO_DATE('01-JAN-2015','DD-MON-YYYY')),
PARTITION EMPIRE1 VALUES LESS THAN (TO_DATE('01-JAN-2016','DD-MON-YYYY')),
PARTITION EMPIRE2 VALUES LESS THAN (TO_DATE('01-JAN-2017','DD-MON-YYYY')),
PARTITION YODA VALUES LESS THAN (TO_DATE('01-JAN-2018','DD-MON-YYYY')),
I need to grab all partitions named rebel% and yoda% and merge them into one new partition called 'jawa'.
In the end only 3 partitions would exist, empire1, empire2 and jawa.
So I have a table A and table B, where table A data was inserted from table B.
essentially table A is same as table B, only difference is that table A has a date_partition column where table B does not have.
the table A schema is as such:
ID int
school_bg_dt string
log_on_count int
active_count int
table B schema is:
ID int
school_bg_dt bigint
log_on_count int
active_count int
date_partition string
here is my query of inserting table B to table A which have an error I coudlnt figure out:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE A PARTITION(date_partition=school_bg_dt)
SELECT ID, cast(school_bg_dt as BIGINT), log_on_count, active_count FROM table
B;
However, I got error that the inpurt does not recognize operation near the date_partition..
not sure whats to do here, please help...
so the design it is to make each school_bg_dt key as a partition as it has many unique data with that key.
From here:
In the dynamic partition inserts, users can give partial partition specifications, which means just specifying the list of partition column names in the PARTITION clause. The column values are optional. If a partition column value is given, we call this a static partition, otherwise it is a dynamic partition. Each dynamic partition column has a corresponding input column from the select statement. This means that the dynamic partition creation is determined by the value of the input column. The dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.
So, try:
FROM B
INSERT OVERWRITE TABLE A PARTITION(date_partition)
SELECT ID, cast(school_bg_dt as BIGINT), log_on_count, active_count, school_bg_dt as date_partition;
Also, note that if you're creating many partitions, you should update the following conf settings:
hive.exec.max.dynamic.partitions.pernode - Maximum number of dynamic
partitions allowed to be created in each mapper/reducer node (default = 100)
hive.exec.max.dynamic.partitions - Maximum number of dynamic
partitions allowed to be created in total (default = 1000)
hive.exec.max.created.files - Maximum number of HDFS files created by all mappers/reducers in a MapReduce job (default = 100000)