Table TTL on SummingMergeTree - clickhouse

I have a table:
CREATE TABLE metric (
cid UInt32,
sid UInt32,
sub String,
cc UInt32,
ic UInt32,
cmc UInt32,
acc UInt32,
ts_update DATETIME DEFAULT now()
) ENGINE = SummingMergeTree((cc, ic, cmc, acc))
PARTITION BY (cid, sid, sub)
ORDER BY tuple()
TTL ts_update + INTERVAL 5 MINUTE;
I am calling
INSERT INTO metric (cid, sid, sub, cc, ic, cmc, acc, ts_update)
VALUES (1000, 2000, 'test', 10, 1, 30, 40, now())
every 10 seconds for 5 minutes (TTL).
At the end of 5 minutes the entire row will be deleted since ts_update field is not updated every time I insert in summing merge tree.
All I want to do is, if no row is inserted to partition (cid, sid, sub) in 5 minutes, delete the row but if any insertion is made, update the TTL as new ts_update + 5 minutes.
How can I achieve this?

CREATE TABLE metric (
cid UInt32,
sid UInt32,
sub String,
cc SimpleAggregateFunction(sum, UInt64),
ic SimpleAggregateFunction(sum, UInt64),
cmc SimpleAggregateFunction(sum, UInt64),
acc SimpleAggregateFunction(sum, UInt64),
ts_update SimpleAggregateFunction(max, DateTime) DEFAULT now()
) ENGINE = AggregatingMergeTree()
ORDER BY (cid, sid, sub)
TTL ts_update + INTERVAL 5 MINUTE;
Though it will not work. TTL removes old rows before SUMS have calculated. There is no solution.

Related

Is it possible to partition by aggregation column with engine AggregatingMergeTree()

Created a materialized view with the engine = AggregatingMergeTree() when inserted into the 'default'.tbl crashes with an exception.
CREATE MATERIALIZED VIEW default.tbl_view ON CLUSTER test
(
id_key String,
uid String,
dt Date,
data_timeStamp AggregateFunction(min, Date)
)
ENGINE = AggregatingMergeTree()
PARTITION BY dt
ORDER BY (id_key , uid)
AS SELECT id_key as id_key,
toDate(data_timeStamp) as dt,
uid as uid,
minState(toDate(data_timeStamp)) as data_timeStamp
FROM `default`.tbl pe
GROUP BY id_key, uid
DB::Exception: Illegal type AggregateFunction(min, Date) of argument of function toDate: while pushing to view default.tbl_view (be181a81-ea4d-4118-9b0d-6fb31b48d93e). (ILLEGAL_TYPE_OF_ARGUMENT)
How can I create a view aggregation data_timeStamp Aggregate Function(min, Date) group by id_key, uid and partition By data_timeStamp ? (Clickhouse partitioning)
I tried to do it using SimpleAggregateFunction.
I created a table with the AggregatingMergeTree engine, then I will insert data into it through the materialized view
CREATE TABLE `default`.tbl ON CLUSTER test
(
key_id String,
uid String,
data_timeStamp AggregateFunction(min, Date),
dt Date
)
Engine = AggregatingMergeTree
PARTITION BY dt
ORDER BY (key_id, uid)
CREATE MATERIALIZED VIEW `default`.tbl_view ON CLUSTER test
TO `default`.tbl
AS SELECT key_id as key_id,
toDate(data_timeStamp) as dt,
uid as uid,
minState(toDate(data_timeStamp)) as data_timeStamp
FROM `default`.tb2 pe
GROUP BY key_id, uid
The error message you got:
DB::Exception: Illegal type AggregateFunction(min, Date) of argument of function toDate: while pushing to view default.tbl_view (be181a81-ea4d-4118-9b0d-6fb31b48d93e). (ILLEGAL_TYPE_OF_ARGUMENT)
Is because your MV query is wrong. You're using AggregateFunction in your FROM table so you should first merge the results and then use -State combinator to push the data to another AggregateFunction column. You can use SimpleAggregateFunction to create the partitioning key.
CREATE TABLE `default`.tbl
(
id_key String,
uid String,
data_timeStamp AggregateFunction(min, Date),
dt Date
)
Engine = AggregatingMergeTree
PARTITION BY dt
ORDER BY (id_key, uid);
CREATE MATERIALIZED VIEW default.tbl_view
(
id_key String,
uid String,
dt Date,
partition SimpleAggregateFunction(min, Date),
data_timeStamp_ AggregateFunction(min, Date)
)
ENGINE = AggregatingMergeTree()
PARTITION BY (dt, partition)
ORDER BY (id_key, uid)
AS
SELECT id_key, uid, dt, min(partition) as partition, minState(data_timeStamp_) as data_timeStamp_
FROM (SELECT id_key,
toDate(minMerge(data_timeStamp)) as dt,
uid as uid,
minMerge(data_timeStamp)::date partition,
minMerge(data_timeStamp)::date as data_timeStamp_
FROM `default`.tbl pe
GROUP BY id_key, uid
) GROUP BY id_key, uid, dt;
INSERT INTO default.tbl SELECT
CAST(number, 'String'),
'1',
minState(CAST(now(), 'date') - number),
CAST(now(), 'date') - number
FROM numbers(5)
GROUP BY
1,
2,
4
OPTIMIZE TABLE tbl_view FINAL;

Clickhouse query takes long time to execute with array joins and group by

I have a table student which has over 90 million records. The create table query is as follows:
CREATE TABLE student(
id integer,
student_id FixedString(15) NOT NULL,
teacher_array Nested(
teacher_id String,
teacher_name String,
teacher_role_id smallint
),
subject_array Nested(
subject_id String,
subject_name String,
subject_category_id smallint
),
year integer NOT NULL
)
ENGINE=MergeTree()
PRIMARY KEY id
PARTITION BY year
ORDER BY id
SETTINGS index_granularity = 8192
The following query takes 5 seconds to execute:
SELECT count(distinct id) as student_count,
(
SELECT count(distinct id)
FROM student
ARRAY JOIN teacher_array
WHERE hasAny(subject_array.subject_category_id, [1, 2]) AND (teacher_array.teacher_role_id NOT IN (1))
) AS total_student_count,
count(*) OVER () AS total_result_count,
teacher_array.teacher_role_id AS teacher_id
FROM
(
SELECT *
FROM student
ARRAY JOIN subject_array
)
ARRAY JOIN teacher_array
WHERE (subject_array.subject_category_id IN (1, 2)) AND (teacher_array.teacher_role_id NOT IN (1))
GROUP BY teacher_array.teacher_role_id
ORDER BY student_count DESC
LIMIT 0, 10
Expecting the query to run within 500 milliseconds is there any workaround for this? Tried using uniq and groupBitmap still the execution time comes around 2 seconds.

Create materialized view based on aggregate materialized view

The base table
CREATE TABLE IF NOT EXISTS test_sessions
(
session_id UInt64,
session_name String,
created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (session_id);
With the following data
INSERT INTO test_sessions (session_id, session_name, created_at) VALUES
(1, 'start', '2021-01-31 00:00:00'),
(1, 'stop', '2021-01-31 01:00:00'),
(2, 'start', '2021-01-31 01:00:00')
;
Created 2 materialized views to get closed sessions
CREATE MATERIALIZED VIEW IF NOT EXISTS test_session_aggregate_states
(
session_id UInt64,
started_at AggregateFunction(minIf, DateTime, UInt8),
stopped_at AggregateFunction(maxIf, DateTime, UInt8)
)
ENGINE = AggregatingMergeTree
PARTITION BY tuple()
ORDER BY (session_id)
POPULATE AS
SELECT session_id,
minIfState(created_at, session_name = 'start') AS started_at,
maxIfState(created_at, session_name = 'stop') AS stopped_at
FROM test_sessions
GROUP BY session_id;
CREATE VIEW IF NOT EXISTS test_session_completed
(
session_id UInt64,
started_at DateTime,
stopped_at DateTime
)
AS
SELECT session_id,
minIfMerge(started_at) AS started_at,
maxIfMerge(stopped_at) AS stopped_at
FROM test_session_aggregate_states
GROUP BY session_id
HAVING (started_at != '0000-00-00 00:00:00') AND
(stopped_at != '0000-00-00 00:00:00')
;
It works normally: return 1 row with existing "start" and "stop"
SELECT * FROM test_session_completed;
-- 1,2021-01-31 00:00:00,2021-01-31 01:00:00
Trying to create a materialized view based on test_session_completed with joins to other tables (there are no joins in the example)
CREATE MATERIALIZED VIEW IF NOT EXISTS test_mv
(
session_id UInt64
)
ENGINE = MergeTree
PARTITION BY tuple()
ORDER BY (session_id)
POPULATE AS
SELECT session_id
FROM test_session_completed
;
Writing a test queries to test the test_mv
INSERT INTO test_sessions (session_id, session_name, created_at) VALUES
(3, 'start', '2021-01-31 02:00:00'),
(3, 'stop', '2021-01-31 03:00:00');
SELECT * FROM test_session_completed;
-- SUCCESS
-- 3,2021-01-31 02:00:00,2021-01-31 03:00:00
-- 1,2021-01-31 00:00:00,2021-01-31 01:00:00
SELECT * FROM test_mv;
-- FAILURE
-- 1
-- EXPECTED RESULT
-- 3
-- 1
How to fill test_mv based on test_session_completed ?
ClickHouse version: 20.11.4.13
Impossible to create MV over view.
MV is an insert trigger and it's impossible to get state completed without having state started in the same table. If you don't need to check that started happen before completed then you can make simpler MV and just check where completed.
You don't need minIfState you can use min (SimpleAggregateFunction). It will reduce stored data and will improve performance.
I think the second MV is excessive.
Check this:
https://den-crane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf
https://youtu.be/ckChUkC3Pns?list=PLO3lfQbpDVI-hyw4MyqxEk3rDHw95SzxJ&t=9371
I would do this:
CREATE TABLE IF NOT EXISTS test_sessions
(
session_id UInt64,
session_name String,
created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (session_id);
CREATE MATERIALIZED VIEW IF NOT EXISTS test_session_aggregate_states
(
session_id UInt64,
started_at SimpleAggregateFunction(min, DateTime),
stopped_at SimpleAggregateFunction(max, DateTime)
)
ENGINE = AggregatingMergeTree
PARTITION BY tuple()
ORDER BY (session_id)
POPULATE AS
SELECT session_id,
minIf(created_at, session_name = 'start') AS started_at,
maxIf(created_at, session_name = 'stop') AS stopped_at
FROM test_sessions
GROUP BY session_id;
INSERT INTO test_sessions (session_id, session_name, created_at) VALUES
(3, 'start', '2021-01-31 02:00:00'),
(3, 'stop', '2021-01-31 03:00:00');
completed sessions:
SELECT session_id,
min(started_at) AS started_at,
max(stopped_at) AS stopped_at
FROM test_session_aggregate_states
GROUP BY session_id
HAVING (started_at != '0000-00-00 00:00:00') AND
(stopped_at != '0000-00-00 00:00:00');
┌─session_id─┬──────────started_at─┬──────────stopped_at─┐
│ 1 │ 2021-01-31 00:00:00 │ 2021-01-31 01:00:00 │
└────────────┴─────────────────────┴─────────────────────┘
And using argMaxState you can aggregate multiple start stop within one session_id

Unable to use sub-partitioned column in query plan (oracle)

I have a 20 GB table which for some requirement, have to be range partitioned on DATE1 field and list sub-partitioned on DATE2 field.
Created a virtual column (VC) on that table to extract numeric month value from DATE2 field and use this VC as the sub-partition key.
Per requirement, we'll have 30 partitions on DATE1 and each of them will have 12 sub-partitions on VC.
The max size of any sub-partition can be up to 5 GB.
N.B. I could not implement Multi-column partitioning as our inbuilt partition manager does not support them.
Also, I could not implement a RANGE-RANGE partition-sub-partitioning as the two Date fields (DATE1 and DATE2) have no-sync in dates coming in them, causing INSERT operations to fail.
Next, I've a simple view created on top of this table. All Date fields including VC are exposed in this view. While querying SELECT * FROM vw; Plan shows PARTITION RANGE SINGLE as expected.
Now, I've a web front end through which I can click on the DATE2 to open some more details. It basically passed DATE2 as filter to query on another table and displays huge records (approx. 3 million).
The PROBLEM is, on clicking the DATE2 field, I'm not able to hit the sub-partition as its based on MONTH value (VC) and not the date.
Thus, I want a PARTITION LIST SINGLE instead of the current PARTITION LIST ALL in plan.
The QUESTION is, how to write a select query to select the SUB partition. I know I have to use the VC in the filter to achieve it, but, the irony is the web-application cannot pass VC in the backend especially when we display only DATE values (not VC).
Also, if we CANNOT hit the sub-partition, is there any way we can better the performance by using INDEXES or PARALLELISM?
Please help.
--***********************************************
--TABLE CREATION STATEMENT
--***********************************************
CREATE TABLE M_DTX
(
R_ID NUMBER(3),
R_AMT NUMBER(5),
DATE1 DATE,
DATE2 DATE,
VC NUMBER(2) GENERATE ALWAYS AS (EXTRACT(MONTH FROM DATE2))
)
PARTITION BY RANGE (DATE1)
SUBPARTITION BY LIST (VC)
SUBPARTITION TEMPLATE (
SUBPARTITION M1 VALUES (1),
SUBPARTITION M2 VALUES (2),
SUBPARTITION M3 VALUES (3),
SUBPARTITION M4 VALUES (4),
SUBPARTITION M5 VALUES (5),
SUBPARTITION M6 VALUES (6),
SUBPARTITION M7 VALUES (7),
SUBPARTITION M8 VALUES (8),
SUBPARTITION M9 VALUES (9),
SUBPARTITION M10 VALUES (10),
SUBPARTITION M11 VALUES (11),
SUBPARTITION M12 VALUES (12)
TABLESPACE M_DATA
)
(
PARTITION M_DTX_2015060100
VALUES LESS THAN (
TO_DATE(' 2015-06-01 00:00:01', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
) SEGMENT CREATION DEFERRED
PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS NOLOGGING
STORAGE( INITIAL 1048576 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT
) TABLESPACE M_DATA);
--******************************************
--VIEW ON TOP OF M_DTX:
--******************************************
CREATE OR REPLACE VIEW v_dtx AS
SELECT r_id, TRUNC(date2) date2_dd, vc, SUM(r_amt) amt
FROM m_dtx WHERE date1 = TRUNC(sysdate)
GROUP BY r_id, TRUNC(date2), vc;
--******************************************
--QUERY FIRED FROM WEB-APPLICATION (AFTER CLICKING ON date2_DD):
--******************************************
SELECT * FROM m_dtx WHERE date1 = trunc(sysdate) AND date2 = ''date2_dd'';
--this is where its bypassing the sub-partition as I could not substitute month or VC ...
The only way I found to hit the subpartition is to write the query this way:
SELECT *
FROM m_dtx
WHERE date1 = trunc(sysdate)
AND date2 = ''date2_dd''
and vc = EXTRACT(MONTH FROM ''date2_dd'');
So, you don't need to get the vc from the previous screen, because you have the date2.
I've tested with:
SELECT *
FROM m_dtx
WHERE date1 = trunc(sysdate)
AND date2 = to_date('10-dec-2014','dd-mon-yyyy')
and vc = EXTRACT(MONTH FROM to_date('10-dec-2014','dd-mon-yyyy'));

Oracle Partition Interval by day - wrong high value?

I created a table with the following partition interval:
create table
pos_data_two (
start_date TIMESTAMP,
store_id NUMBER,
inventory_id NUMBER(6),
qty_sold NUMBER(3)
)
PARTITION BY RANGE (start_date)
INTERVAL(NUMTODSINTERVAL (1, 'DAY'))
(
PARTITION pos_data_p2 VALUES LESS THAN (TO_DATE('30.10.2013', 'DD.MM.YYYY'))
);
When I insert a a row with the timestamp value
'31.10.2013 00:00:00'
The high value of the new created partition is:
TIMESTAMP' 2013-11-01 00:00:00'
Is that correct? Shouldn't it be 2013-10-31 00:00:00 ??
(Disclaimer: I'm just guessing here)
You're partitioning by days, so values for a given date fall into the same partition.
The row you're inserting has a start_date that's exactly at midnight, so Oracle has to decide whether to put it onto the previous day or onto the next day.
Apparently, Oracle is using the rule
lower_bound <= value < upper_bound
to decide which interval a value should go into, so your value
2013-10-31 00:00:00
goes into the interval
[2013-10-31 00:00:00; 2013-11-01 00:00:00 [

Resources