hive ouput sorted values in a each partition

hive ouput sorted values in a each partition - sorting

I have two tables with schema
table: run1
id string
week_date string
metric double
table: run2
id string
metric double
week_date string
statistic int
Partition Information
col_name data_type comment
week_date string
statistic int
I would like to group the data in to equal sizes for each week date and then write the contents to a new table which is partitioned based on the week date as well the statistic(Statistic is nothing but the bucket id).
I find that the query partitions results correctly however the contents within a partition are not sorted
Below is the query I am using and the data I am using as well as the output from one of the partitions
Query:
insert overwrite table run2 partition(week_date, statistic) select id,metric, week_date, ntile(3) over (PARTITION BY week_date order by metric) as statistic from run1 distribute by week_date sort by metric desc;
Input:
B0001 2015-01-08 200.0
B0002 2015-01-08 200.0
B0003 2015-01-08 800.0
B0004 2015-01-08 600.0
B0005 2015-01-08 5400.0
B0006 2015-01-08 1100.0
B0007 2015-01-08 100.0
B0008 2015-01-08 300.0
Output of Partition: week_date=2015-01-08/statistic=2
B0003^A800.0
B0008^A300.0
B0004^A600.0
I was expecting the contents to be sorted by the metric value, however it is not. If I do not insert results to another table and just do a simple select I do see that the contents are indeed sorted. Is there something special that needs to be done when performing inserts?

Related

ClickHouse: Materialized view is not be timely optimized to merge the partitions

I created a table and two materialized views recursively.
Table:
CREATE TABLE `log_details` (
date String,
event_time DateTime,
username String,
city String)
ENGINE = MergeTree()
ORDER BY (date, event_time)
PARTITION BY date TTL event_time + INTERVAL 1 MONTH
Materialized views:
CREATE MATERIALIZED VIEW `log_u_c_day_mv`
ENGINE = SummingMergeTree()
PARTITION BY date
ORDER BY (date, username, city)
AS
SELECT date, username, city, count() as times
FROM `log_details`
GROUP BY date, username, city
CREATE MATERIALIZED VIEW `log_u_day_mv`
ENGINE = SummingMergeTree()
PARTITION BY date
ORDER BY (date, username)
AS
SELECT date, username, SUM(times) as total_times
FROM `.inner.log_u_c_day_mv`
GROUP BY date, username
Insert into log_details →　Insert into log_u_c_day_mv → Insert into log_u_day_mv.
log_u_day_mv is not be optimized after 15 minutes inserting log_u_c_day_mv even over one day.
I tried to optimize log_u_day_mv manually and it works.
OPTIMIZE TABLE `.inner.log_u_day_mv` PARTITION 20210110
But ClickHouse does not timely optimize it.
How to solve it?

Data always is not fully aggregated/collapsed in MT.
If you do optimize final the next insert into creates a new part.
CH does not merge parts by time. Merge scheduler selects parts by own algorithm based on the current node workload / number of parts / size of parts.
SummingMT MUST BE QUERIED with sum / groupby ALWAYS.
select sum(times), username
from log_u_day_mv
group by username
DO NOT USE from log_u_day_mv FINAL it reads excessive columns!!!!!!!!!!!!!!

How to decide the partition key for clickhouse

I want to know what's the best practice for the partition key.
In my project, we have a table with event_date, app_id and other columns. The app_id will be growing and could be thousands.
The select query is based on event_date and app_id.
The simple data schema is as below:
CREATE TABLE test.test_custom_partition (
company_id UInt64,
app_id String,
event_date DateTime,
event_name String ) ENGINE MergeTree()
PARTITION BY (toYYYYMMDD(event_date), app_id)
ORDER BY (app_id, company_id, event_date)
SETTINGS index_granularity = 8192;
the select query is like below:
select event_name from test_custom_partition
where event_date >= '2020-07-01 00:00:00' AND event_date <= '2020-07-15 00:00:00'
AND app_id = 'test';
I want to use (toYYYYMMDD(event_date), app_id) as the partition key, as the query could read the minimal data parts. But it could cause the partitions more than 1000, from the document I see
A merge only works for data parts that have the same value for the
partitioning expression. This means you shouldn't make overly granular
partitions (more than about a thousand partitions). Otherwise, the
SELECT query performs poorly because of an unreasonably large number
of files in the file system and open file descriptors.
Or should I use the partition key only toYYYYMMDD(event_date)?
also, could anyone explain why the partition shouldn't more than 1000 partitions? even if the query only use a small set of the data part, it still could cause performance issue?
Thanks

How sorting works on Partitioned skewed data set in Hive

I am having a dataset with below heirarchy in Hive, size of datsets is in TB's.
-Country
-Year
-In_stock
-Zone
-trans_dt
I need to sort trans_dt in ascending order within Zone (one of the zone contributes 80% of dataset) which is the most efficient way to sort this dataset. I have total 10 countries and each country holds around 100 Zones.
Once trans_dt is sorted I need to perform operation on collect_sets hence sorting is very important in my case.
I tried partitioning the dataset by Year and then applied Cluster by on trans_dt desc, but it seems the sorting is not working as expected.
CREATE TABLE TEST.TABLENAME STORED AS ORC AS
(
COUNTRY INT,
YEAR STRING,
In_stock INT,
ZONE STRING,
TRANS_DT STRING,
|
|
MORE COLUMNS
)
FROM SOURCETABLE PARTITION BY (YEAR,IN_STOCK)
INSERT OVERWRITE TABLE TEST.TABLENAME PARTITION(YEAR, IN_STOCK)
SELECT
|
|
FROM SOURCETABLE CLUSTER BY trans_dt;
SELECT
a.country,
a.zone,
collect_set(a.prod_id) AS prod_seq
count(*) as count
FROM SOURCETABLE WHERE IN_STCK=1
GROUP BY
a.country,
a.zone
;

how to set up dynamic partition where the column keys will be the partitions

So I have a table A and table B, where table A data was inserted from table B.
essentially table A is same as table B, only difference is that table A has a date_partition column where table B does not have.
the table A schema is as such:
ID int
school_bg_dt string
log_on_count int
active_count int
table B schema is:
ID int
school_bg_dt bigint
log_on_count int
active_count int
date_partition string
here is my query of inserting table B to table A which have an error I coudlnt figure out:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE A PARTITION(date_partition=school_bg_dt)
SELECT ID, cast(school_bg_dt as BIGINT), log_on_count, active_count FROM table
B;
However, I got error that the inpurt does not recognize operation near the date_partition..
not sure whats to do here, please help...
so the design it is to make each school_bg_dt key as a partition as it has many unique data with that key.

From here:
In the dynamic partition inserts, users can give partial partition specifications, which means just specifying the list of partition column names in the PARTITION clause. The column values are optional. If a partition column value is given, we call this a static partition, otherwise it is a dynamic partition. Each dynamic partition column has a corresponding input column from the select statement. This means that the dynamic partition creation is determined by the value of the input column. The dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.
So, try:
FROM B
INSERT OVERWRITE TABLE A PARTITION(date_partition)
SELECT ID, cast(school_bg_dt as BIGINT), log_on_count, active_count, school_bg_dt as date_partition;
Also, note that if you're creating many partitions, you should update the following conf settings:
hive.exec.max.dynamic.partitions.pernode - Maximum number of dynamic
partitions allowed to be created in each mapper/reducer node (default = 100)
hive.exec.max.dynamic.partitions - Maximum number of dynamic
partitions allowed to be created in total (default = 1000)
hive.exec.max.created.files - Maximum number of HDFS files created by all mappers/reducers in a MapReduce job (default = 100000)

Accelerate SQLite Query

I'm currently learning SQLite (called by Python).
According to my previous question (Reorganising Data in SQLLIte), I want to store multiple time series (Training data) in my database.
I have defined the following fields:
CREATE TABLE VARLIST
(
VarID INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL
)
CREATE TABLE DATAPOINTS
(
DataID INTEGER PRIMARY KEY,
timeID INTEGER,
VarID INTEGER,
value REAL
)
CREATE TABLE TIMESTAMPS
(
timeID INTEGER PRIMARY KEY AUTOINCREMENT,
TRAININGS_ID INT,
TRAINING_TIME_SECONDS FLOAT
)
VARLIST has 8 entries, TIMESTAMPS 1e5 entries and DATAPOINTS around 5e6.
When I now want to extract data for a given TrainingsID and VarID, I try it like:
SELECT
(SELECT TIMESTAMPS.TRAINING_TIME_SECONDS
FROM TIMESTAMPS
WHERE t.timeID = timeID) AS TRAINING_TIME_SECONDS,
(SELECT value
FROM DATAPOINTS
WHERE DATAPOINTS.timeID = t.timeID and DATAPOINTS.VarID = 2) as value
FROM
(SELECT timeID
FROM TIMESTAMPS
WHERE TRAININGS_ID = 96) as t;
The command EXPLAIN QUERY PLAN delivers:
0|0|0|SCAN TABLE TIMESTAMPS
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SCAN TABLE DATAPOINTS
This basically works.
But there are two problems:
Minor problem: If there is a timeID where no data for the requested VarID is availabe, I get an line with the valueNone`.
I would prefer this line to be skipped.
Big problem: the search is incredibly slow (approx 5 minutes using http://sqlitebrowser.org/).
How do I best improve the performance?
Are there better ways to formulate the SELECT command, or should I modify the database structure itself?

Ok, based on the hints I have got I could extremly accelerate the search by applieng INDEXES as:
CREATE INDEX IF NOT EXISTS DP_Index on DATAPOINTS (VarID,timeID,DataID);
CREATE INDEX IF NOT EXISTS TS_Index on TIMESTAMPS(TRAININGS_ID,timeID);
The EXPLAIN QUERY PLAN output now reads as:
0|0|0|SEARCH TABLE TIMESTAMPS USING COVERING INDEX TS_Index (TRAININGS_ID=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE DATAPOINTS USING INDEX DP_Index (VarID=? AND timeID=?)
Thanks for your comments.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

hive ouput sorted values in a each partition - sorting

Related

ClickHouse: Materialized view is not be timely optimized to merge the partitions

How to decide the partition key for clickhouse

How sorting works on Partitioned skewed data set in Hive

how to set up dynamic partition where the column keys will be the partitions

Accelerate SQLite Query

Categories

Resources