get latest data from hive table with multiple partition columns - performance

I have a hive table with below structure
ID string,
Value string,
year int,
month int,
day int,
hour int,
minute int
This table is refreshed every 15 mins and it is partitioned with year/month/day/hour/minute columns. Please find below samples on partitions.
year=2019/month=12/day=29/hour=19/minute=15
year=2019/month=12/day=30/hour=00/minute=45
year=2019/month=12/day=30/hour=08/minute=45
year=2019/month=12/day=30/hour=09/minute=30
year=2019/month=12/day=30/hour=09/minute=45
I want to select only latest partition data from the table. I tried to use max() statements with those partition columns, but its not very efficient as data size is huge.
Please let me know, how can i get the data in a convenient way using hive sql.

If the latest partition is always in current date, then you can filter current date partition and use rank() to find records with latest hour, minute:
select * --list columns here
from
(
select s.*, rank() over(order by hour desc, minute desc) rnk
from your_table s
where s.year=year(current_date) --filter current day (better pass variables calculated if possible)
and s.month=lpad(month(current_date),2,0)
and s.day=lpad(day(current_date),2,0)
-- and s.hour=lpad(hour(current_timestamp),2,0) --consider also adding this
) s
where rnk=1 --latest hour, minute
And if the latest partition is not necessarily equals current_date then you can use rank() over (order by s.year desc, s.month desc, s.day desc, hour desc, minute desc), without filter on date this will scan all the table and is not efficient.
It will perform the best if you can calculate partition filters in the shell and pass as parameters. See comments in the code.

Related

The ambiguity w.r.t date field in Dim_time

I have come across a fact table fact_trips - composed of columns like
driver_id,
vehicle_id,
date ( in the int form - 'YYYYMMDD')
timestamp (in milliseconds - bigint )
miles,
time_of_trip
I have another dim_time - composed of columns like
date ( in the int form - 'YYYYMMDD'),
timestamp (in milliseconds - bigint ),
month,
year,
day_of_week
day
Now when I want to see the trips grouped based on year, I have to join the two tables based on timestamp (in bigint) and then group by year from dim_time.
Why the hell do we keep date in int form then? Because ultimately, I have to join on timestamp. What needs to be changed?
Also, the dim_time does not have a primary key, hence there are multiple entries for the same date. So, when I join the tables, I get more rows in return than expected.
You should have 2 Dim tables:
DIM_DATE: PK = YYYYMMDD
DIM_TIME: PK = number. Will hold the same number of records as however many milliseconds there are in a day (assuming you are holding time at the millisecond grain rather than second, minute, etc)

How to decide the partition key for clickhouse

I want to know what's the best practice for the partition key.
In my project, we have a table with event_date, app_id and other columns. The app_id will be growing and could be thousands.
The select query is based on event_date and app_id.
The simple data schema is as below:
CREATE TABLE test.test_custom_partition (
company_id UInt64,
app_id String,
event_date DateTime,
event_name String ) ENGINE MergeTree()
PARTITION BY (toYYYYMMDD(event_date), app_id)
ORDER BY (app_id, company_id, event_date)
SETTINGS index_granularity = 8192;
the select query is like below:
select event_name from test_custom_partition
where event_date >= '2020-07-01 00:00:00' AND event_date <= '2020-07-15 00:00:00'
AND app_id = 'test';
I want to use (toYYYYMMDD(event_date), app_id) as the partition key, as the query could read the minimal data parts. But it could cause the partitions more than 1000, from the document I see
A merge only works for data parts that have the same value for the
partitioning expression. This means you shouldn't make overly granular
partitions (more than about a thousand partitions). Otherwise, the
SELECT query performs poorly because of an unreasonably large number
of files in the file system and open file descriptors.
Or should I use the partition key only toYYYYMMDD(event_date)?
also, could anyone explain why the partition shouldn't more than 1000 partitions? even if the query only use a small set of the data part, it still could cause performance issue?
Thanks

Delete data based on the count & timestamp using pl\sql

I'm new to PL\SQL programming and I'm from DBA background. I got one requirement to delete data from both main table and reference table but need to follow below logic while deleting data because we need to delete 30M of data from the tables so we're reducing data based on the "State_ID" column below.
Following conditions need to consider
1. As per sample data given below(Main Table), sort data based on timestamp with desc order and leave the first 2 rows of data for each "State_id" and delete rest of the data from the both tables based on "state_id" column.
2. select state_id,count() from maintable group by state_id order by timestamp desc Having count()>2;
So if state_id=1 has 5 rows then has to delete 3 rows of data by leaving first 2 rows for state_id=1 and repeat for other state_id values.
Also same matching data should be deleted from the reference table as well.
Please someone help me on this issue. Thanks.
enter image description here
Main table
You should be able to do each table delete as a single SQL command. Anything else would essentially force row-by-row processing, which is the last thing you want for that much data. Something like this:
delete from main_table m
where m.row_id not in (
with keep_me as (
select row_id,
row_number() over (partition by state_id
order by time_stamp desc) id_row_number
from main_table where id_row_number<3)
select row_id from keep_me)
or
delete from main_table m
where m.row_id in (
with delete_me as (
select row_id,
row_number() over (partition by state_id
order by time_stamp desc) id_row_number
from main_table where id_row_number>2)
select row_id from delete_me)

FILTER WHERE at count in ClickHouse

I'm trying to migrate one of my Postgres tables at ClickHouse. Here what I came up with at ClickHouse:
CREATE TABLE loads(
country_id UInt16,
partner_id UInt32,
is_unique UInt8,
ip String,
created_at DateTime
) ENGINE=MergeTree PARTITION BY toYYYYMM(created_at) ORDER BY (created_at);
is_unique here is a Boolean with 0 or 1. I wanna know count for aggregates: country_id, partner_id and created_at, but also I wanna know how much from these loads are unique loads. At Postgres it looks like:
SELECT
count(*) AS loads,
count(*) FILTER (WHERE is_unique) AS uniq,
country_id,
partner_id,
created_at::date AS ts
FROM loads
GROUP BY ts, country_id, partner_id
Is it possible at ClickHouse or should I think again about how to aggregate the data? I didn't find any clues at manual except count can get expr instead of asterisk, but count(is_unique = 1) doesn't work and just returns the same amount as count(*).
I just found an answer in minutes after posting:
SELECT count(*), countIf(is_unique = 1) /* .. */
Good luck.

How sorting works on Partitioned skewed data set in Hive

I am having a dataset with below heirarchy in Hive, size of datsets is in TB's.
-Country
-Year
-In_stock
-Zone
-trans_dt
I need to sort trans_dt in ascending order within Zone (one of the zone contributes 80% of dataset) which is the most efficient way to sort this dataset. I have total 10 countries and each country holds around 100 Zones.
Once trans_dt is sorted I need to perform operation on collect_sets hence sorting is very important in my case.
I tried partitioning the dataset by Year and then applied Cluster by on trans_dt desc, but it seems the sorting is not working as expected.
CREATE TABLE TEST.TABLENAME STORED AS ORC AS
(
COUNTRY INT,
YEAR STRING,
In_stock INT,
ZONE STRING,
TRANS_DT STRING,
|
|
MORE COLUMNS
)
FROM SOURCETABLE PARTITION BY (YEAR,IN_STOCK)
INSERT OVERWRITE TABLE TEST.TABLENAME PARTITION(YEAR, IN_STOCK)
SELECT
|
|
FROM SOURCETABLE CLUSTER BY trans_dt;
SELECT
a.country,
a.zone,
collect_set(a.prod_id) AS prod_seq
count(*) as count
FROM SOURCETABLE WHERE IN_STCK=1
GROUP BY
a.country,
a.zone
;

Resources