How sorting works on Partitioned skewed data set in Hive - sorting

I am having a dataset with below heirarchy in Hive, size of datsets is in TB's.
-Country
-Year
-In_stock
-Zone
-trans_dt
I need to sort trans_dt in ascending order within Zone (one of the zone contributes 80% of dataset) which is the most efficient way to sort this dataset. I have total 10 countries and each country holds around 100 Zones.
Once trans_dt is sorted I need to perform operation on collect_sets hence sorting is very important in my case.
I tried partitioning the dataset by Year and then applied Cluster by on trans_dt desc, but it seems the sorting is not working as expected.
CREATE TABLE TEST.TABLENAME STORED AS ORC AS
(
COUNTRY INT,
YEAR STRING,
In_stock INT,
ZONE STRING,
TRANS_DT STRING,
|
|
MORE COLUMNS
)
FROM SOURCETABLE PARTITION BY (YEAR,IN_STOCK)
INSERT OVERWRITE TABLE TEST.TABLENAME PARTITION(YEAR, IN_STOCK)
SELECT
|
|
FROM SOURCETABLE CLUSTER BY trans_dt;
SELECT
a.country,
a.zone,
collect_set(a.prod_id) AS prod_seq
count(*) as count
FROM SOURCETABLE WHERE IN_STCK=1
GROUP BY
a.country,
a.zone
;

Related

ClickHouse: Materialized view is not be timely optimized to merge the partitions

I created a table and two materialized views recursively.
Table:
CREATE TABLE `log_details` (
date String,
event_time DateTime,
username String,
city String)
ENGINE = MergeTree()
ORDER BY (date, event_time)
PARTITION BY date TTL event_time + INTERVAL 1 MONTH
Materialized views:
CREATE MATERIALIZED VIEW `log_u_c_day_mv`
ENGINE = SummingMergeTree()
PARTITION BY date
ORDER BY (date, username, city)
AS
SELECT date, username, city, count() as times
FROM `log_details`
GROUP BY date, username, city
CREATE MATERIALIZED VIEW `log_u_day_mv`
ENGINE = SummingMergeTree()
PARTITION BY date
ORDER BY (date, username)
AS
SELECT date, username, SUM(times) as total_times
FROM `.inner.log_u_c_day_mv`
GROUP BY date, username
Insert into log_details → Insert into log_u_c_day_mv → Insert into log_u_day_mv.
log_u_day_mv is not be optimized after 15 minutes inserting log_u_c_day_mv even over one day.
I tried to optimize log_u_day_mv manually and it works.
OPTIMIZE TABLE `.inner.log_u_day_mv` PARTITION 20210110
But ClickHouse does not timely optimize it.
How to solve it?
Data always is not fully aggregated/collapsed in MT.
If you do optimize final the next insert into creates a new part.
CH does not merge parts by time. Merge scheduler selects parts by own algorithm based on the current node workload / number of parts / size of parts.
SummingMT MUST BE QUERIED with sum / groupby ALWAYS.
select sum(times), username
from log_u_day_mv
group by username
DO NOT USE from log_u_day_mv FINAL it reads excessive columns!!!!!!!!!!!!!!

How to decide the partition key for clickhouse

I want to know what's the best practice for the partition key.
In my project, we have a table with event_date, app_id and other columns. The app_id will be growing and could be thousands.
The select query is based on event_date and app_id.
The simple data schema is as below:
CREATE TABLE test.test_custom_partition (
company_id UInt64,
app_id String,
event_date DateTime,
event_name String ) ENGINE MergeTree()
PARTITION BY (toYYYYMMDD(event_date), app_id)
ORDER BY (app_id, company_id, event_date)
SETTINGS index_granularity = 8192;
the select query is like below:
select event_name from test_custom_partition
where event_date >= '2020-07-01 00:00:00' AND event_date <= '2020-07-15 00:00:00'
AND app_id = 'test';
I want to use (toYYYYMMDD(event_date), app_id) as the partition key, as the query could read the minimal data parts. But it could cause the partitions more than 1000, from the document I see
A merge only works for data parts that have the same value for the
partitioning expression. This means you shouldn't make overly granular
partitions (more than about a thousand partitions). Otherwise, the
SELECT query performs poorly because of an unreasonably large number
of files in the file system and open file descriptors.
Or should I use the partition key only toYYYYMMDD(event_date)?
also, could anyone explain why the partition shouldn't more than 1000 partitions? even if the query only use a small set of the data part, it still could cause performance issue?
Thanks

get latest data from hive table with multiple partition columns

I have a hive table with below structure
ID string,
Value string,
year int,
month int,
day int,
hour int,
minute int
This table is refreshed every 15 mins and it is partitioned with year/month/day/hour/minute columns. Please find below samples on partitions.
year=2019/month=12/day=29/hour=19/minute=15
year=2019/month=12/day=30/hour=00/minute=45
year=2019/month=12/day=30/hour=08/minute=45
year=2019/month=12/day=30/hour=09/minute=30
year=2019/month=12/day=30/hour=09/minute=45
I want to select only latest partition data from the table. I tried to use max() statements with those partition columns, but its not very efficient as data size is huge.
Please let me know, how can i get the data in a convenient way using hive sql.
If the latest partition is always in current date, then you can filter current date partition and use rank() to find records with latest hour, minute:
select * --list columns here
from
(
select s.*, rank() over(order by hour desc, minute desc) rnk
from your_table s
where s.year=year(current_date) --filter current day (better pass variables calculated if possible)
and s.month=lpad(month(current_date),2,0)
and s.day=lpad(day(current_date),2,0)
-- and s.hour=lpad(hour(current_timestamp),2,0) --consider also adding this
) s
where rnk=1 --latest hour, minute
And if the latest partition is not necessarily equals current_date then you can use rank() over (order by s.year desc, s.month desc, s.day desc, hour desc, minute desc), without filter on date this will scan all the table and is not efficient.
It will perform the best if you can calculate partition filters in the shell and pass as parameters. See comments in the code.

Oracle Merge Partition (based on partition values)

I would like to run a procedure that merges table partitions that match a certain criteria.
As example - table1 is range partitions by date and has 5 partitions.
Partitions = empire1, empire2, rebels1, rebels2, yoda1.
Table DESC:
INVOICE_NO NOT NULL NUMBER
INVOICE_DATE NOT NULL DATE
COMMENTS VARCHAR2(500)
it is partitioned by INVOICE_DATE as follows
PARTITION REBELS1 VALUES LESS THAN (TO_DATE('01-JAN-2014','DD-MON-YYYY')),
PARTITION REBELS2 VALUES LESS THAN (TO_DATE('01-JAN-2015','DD-MON-YYYY')),
PARTITION EMPIRE1 VALUES LESS THAN (TO_DATE('01-JAN-2016','DD-MON-YYYY')),
PARTITION EMPIRE2 VALUES LESS THAN (TO_DATE('01-JAN-2017','DD-MON-YYYY')),
PARTITION YODA VALUES LESS THAN (TO_DATE('01-JAN-2018','DD-MON-YYYY')),
I need to grab all partitions named rebel% and yoda% and merge them into one new partition called 'jawa'.
In the end only 3 partitions would exist, empire1, empire2 and jawa.

Partitioning a table with one value in one partition and the rest in another partition

For example, I have a table name Emp and it has empname, designation, salary as columns. I would like this table to have 2 partitions, like list of employees who are managers in one partition and rest(engineer, peon, clerk) in one partition.
can someone help on how to create it
In this case you will have to use LIST based partition. Create a pertition where ROLE = MANAGER and create another partition which is default. Here is an example which will help you.
Exclude values from oracle partition
Example
CREATE TABLE EMPLOYEE (EMP_ID VARCHAR2(25),
EMP_NAME VARCHAR2(250),
ROLE VARCHAR2(100)
)
PARTITION BY LIST (ROLE)
(
PARTITION part_managers
VALUES ('MANAGER'),
PARTITION part_others
VALUES (DEFAULT)
);
Please refer the following URL and example:
For example, the following SQL statement splits the sales_Q4_2007 partition of the partitioned by range table sales splits into five partitions corresponding to the quarters of the next year. In this example, the partition sales_Q4_2008 implicitly becomes the high bound of the split partition.
ALTER TABLE sales SPLIT PARTITION sales_Q4_2007 INTO
( PARTITION sales_Q4_2007 VALUES LESS THAN (TO_DATE('01-JAN-2008','dd-MON-yyyy')),
PARTITION sales_Q1_2008 VALUES LESS THAN (TO_DATE('01-APR-2008','dd-MON-yyyy')),
PARTITION sales_Q2_2008 VALUES LESS THAN (TO_DATE('01-JUL-2008','dd-MON-yyyy')),
PARTITION sales_Q3_2008 VALUES LESS THAN (TO_DATE('01-OCT-2008','dd-MON-yyyy')),
PARTITION sales_Q4_2008);
For the sample table customers partitioned by list, the following statement splits the partition Europe into three partitions.
ALTER TABLE list_customers SPLIT PARTITION Europe INTO
(PARTITION western-europe VALUES ('GERMANY', 'FRANCE'),
PARTITION southern-europe VALUES ('ITALY'),
PARTITION rest-europe);
https://docs.oracle.com/database/121/VLDBG/GUID-01C14320-0D7B-48BE-A5AD-003DDA761277.htm
You will get some idea about this.

Resources