Is it possible to partition by aggregation column with engine AggregatingMergeTree() - clickhouse

Created a materialized view with the engine = AggregatingMergeTree() when inserted into the 'default'.tbl crashes with an exception.
CREATE MATERIALIZED VIEW default.tbl_view ON CLUSTER test
(
id_key String,
uid String,
dt Date,
data_timeStamp AggregateFunction(min, Date)
)
ENGINE = AggregatingMergeTree()
PARTITION BY dt
ORDER BY (id_key , uid)
AS SELECT id_key as id_key,
toDate(data_timeStamp) as dt,
uid as uid,
minState(toDate(data_timeStamp)) as data_timeStamp
FROM `default`.tbl pe
GROUP BY id_key, uid
DB::Exception: Illegal type AggregateFunction(min, Date) of argument of function toDate: while pushing to view default.tbl_view (be181a81-ea4d-4118-9b0d-6fb31b48d93e). (ILLEGAL_TYPE_OF_ARGUMENT)
How can I create a view aggregation data_timeStamp Aggregate Function(min, Date) group by id_key, uid and partition By data_timeStamp ? (Clickhouse partitioning)
I tried to do it using SimpleAggregateFunction.
I created a table with the AggregatingMergeTree engine, then I will insert data into it through the materialized view
CREATE TABLE `default`.tbl ON CLUSTER test
(
key_id String,
uid String,
data_timeStamp AggregateFunction(min, Date),
dt Date
)
Engine = AggregatingMergeTree
PARTITION BY dt
ORDER BY (key_id, uid)
CREATE MATERIALIZED VIEW `default`.tbl_view ON CLUSTER test
TO `default`.tbl
AS SELECT key_id as key_id,
toDate(data_timeStamp) as dt,
uid as uid,
minState(toDate(data_timeStamp)) as data_timeStamp
FROM `default`.tb2 pe
GROUP BY key_id, uid

The error message you got:
DB::Exception: Illegal type AggregateFunction(min, Date) of argument of function toDate: while pushing to view default.tbl_view (be181a81-ea4d-4118-9b0d-6fb31b48d93e). (ILLEGAL_TYPE_OF_ARGUMENT)
Is because your MV query is wrong. You're using AggregateFunction in your FROM table so you should first merge the results and then use -State combinator to push the data to another AggregateFunction column. You can use SimpleAggregateFunction to create the partitioning key.
CREATE TABLE `default`.tbl
(
id_key String,
uid String,
data_timeStamp AggregateFunction(min, Date),
dt Date
)
Engine = AggregatingMergeTree
PARTITION BY dt
ORDER BY (id_key, uid);
CREATE MATERIALIZED VIEW default.tbl_view
(
id_key String,
uid String,
dt Date,
partition SimpleAggregateFunction(min, Date),
data_timeStamp_ AggregateFunction(min, Date)
)
ENGINE = AggregatingMergeTree()
PARTITION BY (dt, partition)
ORDER BY (id_key, uid)
AS
SELECT id_key, uid, dt, min(partition) as partition, minState(data_timeStamp_) as data_timeStamp_
FROM (SELECT id_key,
toDate(minMerge(data_timeStamp)) as dt,
uid as uid,
minMerge(data_timeStamp)::date partition,
minMerge(data_timeStamp)::date as data_timeStamp_
FROM `default`.tbl pe
GROUP BY id_key, uid
) GROUP BY id_key, uid, dt;
INSERT INTO default.tbl SELECT
CAST(number, 'String'),
'1',
minState(CAST(now(), 'date') - number),
CAST(now(), 'date') - number
FROM numbers(5)
GROUP BY
1,
2,
4
OPTIMIZE TABLE tbl_view FINAL;

Related

Clickhouse query takes long time to execute with array joins and group by

I have a table student which has over 90 million records. The create table query is as follows:
CREATE TABLE student(
id integer,
student_id FixedString(15) NOT NULL,
teacher_array Nested(
teacher_id String,
teacher_name String,
teacher_role_id smallint
),
subject_array Nested(
subject_id String,
subject_name String,
subject_category_id smallint
),
year integer NOT NULL
)
ENGINE=MergeTree()
PRIMARY KEY id
PARTITION BY year
ORDER BY id
SETTINGS index_granularity = 8192
The following query takes 5 seconds to execute:
SELECT count(distinct id) as student_count,
(
SELECT count(distinct id)
FROM student
ARRAY JOIN teacher_array
WHERE hasAny(subject_array.subject_category_id, [1, 2]) AND (teacher_array.teacher_role_id NOT IN (1))
) AS total_student_count,
count(*) OVER () AS total_result_count,
teacher_array.teacher_role_id AS teacher_id
FROM
(
SELECT *
FROM student
ARRAY JOIN subject_array
)
ARRAY JOIN teacher_array
WHERE (subject_array.subject_category_id IN (1, 2)) AND (teacher_array.teacher_role_id NOT IN (1))
GROUP BY teacher_array.teacher_role_id
ORDER BY student_count DESC
LIMIT 0, 10
Expecting the query to run within 500 milliseconds is there any workaround for this? Tried using uniq and groupBitmap still the execution time comes around 2 seconds.

Create materialized view based on aggregate materialized view

The base table
CREATE TABLE IF NOT EXISTS test_sessions
(
session_id UInt64,
session_name String,
created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (session_id);
With the following data
INSERT INTO test_sessions (session_id, session_name, created_at) VALUES
(1, 'start', '2021-01-31 00:00:00'),
(1, 'stop', '2021-01-31 01:00:00'),
(2, 'start', '2021-01-31 01:00:00')
;
Created 2 materialized views to get closed sessions
CREATE MATERIALIZED VIEW IF NOT EXISTS test_session_aggregate_states
(
session_id UInt64,
started_at AggregateFunction(minIf, DateTime, UInt8),
stopped_at AggregateFunction(maxIf, DateTime, UInt8)
)
ENGINE = AggregatingMergeTree
PARTITION BY tuple()
ORDER BY (session_id)
POPULATE AS
SELECT session_id,
minIfState(created_at, session_name = 'start') AS started_at,
maxIfState(created_at, session_name = 'stop') AS stopped_at
FROM test_sessions
GROUP BY session_id;
CREATE VIEW IF NOT EXISTS test_session_completed
(
session_id UInt64,
started_at DateTime,
stopped_at DateTime
)
AS
SELECT session_id,
minIfMerge(started_at) AS started_at,
maxIfMerge(stopped_at) AS stopped_at
FROM test_session_aggregate_states
GROUP BY session_id
HAVING (started_at != '0000-00-00 00:00:00') AND
(stopped_at != '0000-00-00 00:00:00')
;
It works normally: return 1 row with existing "start" and "stop"
SELECT * FROM test_session_completed;
-- 1,2021-01-31 00:00:00,2021-01-31 01:00:00
Trying to create a materialized view based on test_session_completed with joins to other tables (there are no joins in the example)
CREATE MATERIALIZED VIEW IF NOT EXISTS test_mv
(
session_id UInt64
)
ENGINE = MergeTree
PARTITION BY tuple()
ORDER BY (session_id)
POPULATE AS
SELECT session_id
FROM test_session_completed
;
Writing a test queries to test the test_mv
INSERT INTO test_sessions (session_id, session_name, created_at) VALUES
(3, 'start', '2021-01-31 02:00:00'),
(3, 'stop', '2021-01-31 03:00:00');
SELECT * FROM test_session_completed;
-- SUCCESS
-- 3,2021-01-31 02:00:00,2021-01-31 03:00:00
-- 1,2021-01-31 00:00:00,2021-01-31 01:00:00
SELECT * FROM test_mv;
-- FAILURE
-- 1
-- EXPECTED RESULT
-- 3
-- 1
How to fill test_mv based on test_session_completed ?
ClickHouse version: 20.11.4.13
Impossible to create MV over view.
MV is an insert trigger and it's impossible to get state completed without having state started in the same table. If you don't need to check that started happen before completed then you can make simpler MV and just check where completed.
You don't need minIfState you can use min (SimpleAggregateFunction). It will reduce stored data and will improve performance.
I think the second MV is excessive.
Check this:
https://den-crane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf
https://youtu.be/ckChUkC3Pns?list=PLO3lfQbpDVI-hyw4MyqxEk3rDHw95SzxJ&t=9371
I would do this:
CREATE TABLE IF NOT EXISTS test_sessions
(
session_id UInt64,
session_name String,
created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (session_id);
CREATE MATERIALIZED VIEW IF NOT EXISTS test_session_aggregate_states
(
session_id UInt64,
started_at SimpleAggregateFunction(min, DateTime),
stopped_at SimpleAggregateFunction(max, DateTime)
)
ENGINE = AggregatingMergeTree
PARTITION BY tuple()
ORDER BY (session_id)
POPULATE AS
SELECT session_id,
minIf(created_at, session_name = 'start') AS started_at,
maxIf(created_at, session_name = 'stop') AS stopped_at
FROM test_sessions
GROUP BY session_id;
INSERT INTO test_sessions (session_id, session_name, created_at) VALUES
(3, 'start', '2021-01-31 02:00:00'),
(3, 'stop', '2021-01-31 03:00:00');
completed sessions:
SELECT session_id,
min(started_at) AS started_at,
max(stopped_at) AS stopped_at
FROM test_session_aggregate_states
GROUP BY session_id
HAVING (started_at != '0000-00-00 00:00:00') AND
(stopped_at != '0000-00-00 00:00:00');
┌─session_id─┬──────────started_at─┬──────────stopped_at─┐
│ 1 │ 2021-01-31 00:00:00 │ 2021-01-31 01:00:00 │
└────────────┴─────────────────────┴─────────────────────┘
And using argMaxState you can aggregate multiple start stop within one session_id

Oracle sql select data from table where attribute is nested table

I have a objects:
create type t_history_rec is object
(
date_from date,
current float
);
create type t_history is table of t_history_rec;
and table defined:
create table person
(
id integer primary key,
name varchar2(30),
history t_history
);
and I want to get select name, history.date_from, history.current like this:
name1 date1 current1
name1 date2 current2
name2 date3 current3
...
How to do this?
Cannot verify this, but you could try something like this:
select p.name, pp.date_from, pp.current
from person p, table(p.history) pp;
You have some errors. current is reserved
create or replace type t_history_rec is object
(
date_from date,
curr float
);
/
create type t_history is table of t_history_rec;
/
Table definition needs store as
create table person
(
id integer primary key,
name varchar2(30),
history t_history
) NESTED TABLE history STORE AS col1_tab;
insert into person (id, name, history) values (1, 'aa', t_history(t_history_rec(sysdate, 1)));
insert into person (id, name, history) values (2, 'aa', t_history(t_history_rec(sysdate, 1), t_history_rec(sysdate, 1)));
Then select is:
SELECT t1.name, t2.date_from, t2.curr FROM person t1, TABLE(t1.history) t2;

Why does full outer join in HIVE gives weird result when one of the join fields is missing?

I'm comparing the behavior between SQL engines. Oracle has the behavior I would expect from a SQL engine for full outer joins:
Oracle
CREATE TABLE sql_test_a
(
ID VARCHAR2(4000 BYTE),
FIRST_NAME VARCHAR2(200 BYTE),
LAST_NAME VARCHAR2(200 BYTE)
);
CREATE TABLE sql_test_b
(
NUM VARCHAR2(4000 BYTE),
FIRST_NAME VARCHAR2(200 BYTE),
LAST_NAME VARCHAR2(200 BYTE)
);
INSERT INTO sql_test_a (ID, FIRST_NAME, LAST_NAME) VALUES ('1', 'John', 'Snow');
INSERT INTO sql_test_a (ID, FIRST_NAME, LAST_NAME) VALUES ('2', 'Mike', 'Tyson');
INSERT INTO sql_test_b (NUM, FIRST_NAME, LAST_NAME) VALUES ('20', 'Mike', 'Tyson');
When I execute the following, it gives me the expected result. The resulting table contains two rows, with one of the rows containing NULL for the NUM field, because there is no john snow in the table sql_test_b.
SELECT A.FIRST_NAME, A.LAST_NAME, A.ID, B.NUM
FROM
SQL_TEST_A A
FULL OUTER JOIN
SQL_TEST_B B
ON
A.FIRST_NAME = B.FIRST_NAME
AND
A.LAST_NAME = B.LAST_NAME;
You can test the sql script here: http://sqltest.net/
HIVE
In HIVE, however, if you were to try the same thing, the full outer join results in a table with two rows. The row that should be the "John Snow" row contains NULL for the fields FIRST_NAME, LAST_NAME, and NUM. The 1 is filled in for ID, but that's it.
Why such weird behavior in HIVE? Is this a bug? Or am I missing something...because Oracle 11g seems to handle this much better. Thanks.
I could not simulate the result reported by #Candic3
I used the below statements along with the same "select" query as in the question.
CREATE TABLE IF NOT EXISTS sql_test_a (ID String, FIRST_NAME String, LAST_NAME String) COMMENT 'sql_test_a'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
CREATE TABLE IF NOT EXISTS sql_test_b (NUM String, FIRST_NAME String, LAST_NAME String) COMMENT 'sql_test_b'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
INSERT INTO sql_test_a VALUES ('1', 'John', 'Snow');
INSERT INTO sql_test_a VALUES ('2', 'Mike', 'Tyson');
INSERT INTO sql_test_b VALUES ('20', 'Mike', 'Tyson');
SELECT A.FIRST_NAME, A.LAST_NAME, A.ID, B.NUM
FROM
SQL_TEST_A A
FULL OUTER JOIN
SQL_TEST_B B
ON
A.FIRST_NAME = B.FIRST_NAME
AND
A.LAST_NAME = B.LAST_NAME;
Please find the result attached.
However, select query would return NULL due to unnoticed minor mistakes like data-type mismatch between the DDL and the actual data (say, from flat files) or mismatch among the delimiter mentioned in the DDL and the ones in the actual data.
I think issue with "(" after on condition which is slightly different than traditional sql.
SELECT A.FIRST_NAME, A.LAST_NAME, A.ID, B.NUM
FROM
SQL_TEST_A A
FULL OUTER JOIN
SQL_TEST_B B ON
(A.FIRST_NAME = B.FIRST_NAME AND A.LAST_NAME = B.LAST_NAME);
In select statement you have used A.FIRST_NAME, A.LAST_NAME which is not present for the row from table B. That is why the null value. Instead use COALESCE to find non null value between A.FIRST_NAME and B.FIRST_NAME
SELECT COALESCE(A.FIRST_NAME, B.FIRST_NAME) as FIRST_NAME, COALESCE(A.LAST_NAME, B.LAST_NAME) as LAST_NAME, A.ID, B.NUM
FROM
SQL_TEST_A A
FULL OUTER JOIN
SQL_TEST_B B
ON
A.FIRST_NAME = B.FIRST_NAME
AND
A.LAST_NAME = B.LAST_NAME;

Return all null values between two dates

I have the following query and I need it to return all the null values between those two dates.
select cust_first_name
from customers
join orders using(customer_id)
where order_date between (to_date('01-01-2007','DD-MM-YYYY'))
and (to_date('31-12-2008','DD-MM-YYYY'));
Sounds like what you want is customers with no orders within the given date range. The join you are using finds the opposite of that.
You could do this with an outer join, in which case you need to apply the date filter prior to the join. It's probably easier and more readable to use a NOT IN or NOT EXISTS subquery:
select cust_first_name
from customers
WHERE customers.customer_id NOT IN (
SELECT orders.customer_id from orders
where order_date between (to_date('01-01-2007','DD-MM-YYYY'))
and (to_date('31-12-2008','DD-MM-YYYY'))
)
Here is an example of how to do what you want.
The key part is doing a left join on your orders table, and then simply doing a not between date1 and date2
declare #customers table (
id int identity(1,1),
first_name nvarchar(50),
last_name nvarchar(50)
)
declare #orders table (
id int identity(1,1),
customer_id int,
order_date datetime
)
insert into #customers(first_name, last_name) values ('bob', 'gates')
insert into #customers(first_name, last_name) values ('cyril', 'smith')
insert into #customers(first_name, last_name) values ('harry', 'potter')
insert into #orders(customer_id, order_date) values (1, '2007-02-01')
insert into #orders(customer_id, order_date) values (2, '2015-02-15')
insert into #orders(customer_id, order_date) values (3, '2008-02-15')
select
customers.id
,customers.first_name
,customers.last_name
from #customers customers
left join #orders orders on orders.customer_id = customers.id
where orders.id is null
or orders.order_date not between ('2007-01-01') and ('2008-12-31')
group by
customers.id
,customers.first_name
,customers.last_name;

Resources