I'm trying to get aggregates values for each att1, and att2 column, and also for each value of the arrays in att3 column.
As far I tried:
create table test(value Float32, att1 String, att2 String, att3 Array(String))
ENGINE=MergeTree() ORDER BY ();
INSERT INTO test VALUES (2.0, 'a', 'Z', ['sports', 'office', 'anothertag'])
INSERT INTO test VALUES (4.0, 'b', 'X', ['sports', 'office', 'tag'])
INSERT INTO test VALUES (6.0, 'b', 'X', ['sports', 'internet', 'planes'])
SELECT * from test;
┌─value─┬─att1─┬─att2─┬─att3───────────────────────────┐
│ 6 │ b │ X │ ['sports','internet','planes'] │
└───────┴──────┴──────┴────────────────────────────────┘
┌─value─┬─att1─┬─att2─┬─att3─────────────────────────────┐
│ 2 │ a │ Z │ ['sports','office','anothertag'] │
└───────┴──────┴──────┴──────────────────────────────────┘
┌─value─┬─att1─┬─att2─┬─att3──────────────────────┐
│ 4 │ b │ X │ ['sports','office','tag'] │
└───────┴──────┴──────┴───────────────────────────┘
I want to get the aggregate -sum(value)- for each different attribute.
I have it working for att1 and att2 columns with:
SELECT
att1,
att2,
sum(value)
FROM test
GROUP BY
att1,
att2
WITH CUBE
Result:
┌─att1─┬─att2─┬─sum(value)─┐
│ b │ X │ 10 │
│ a │ Z │ 2 │
└──────┴──────┴────────────┘
┌─att1─┬─att2─┬─sum(value)─┐
│ a │ │ 2 │
│ b │ │ 10 │
└──────┴──────┴────────────┘
┌─att1─┬─att2─┬─sum(value)─┐
│ │ Z │ 2 │
│ │ X │ 10 │
└──────┴──────┴────────────┘
┌─att1─┬─att2─┬─sum(value)─┐
│ │ │ 12 │
└──────┴──────┴────────────┘
Which gives me more than needed, but results two and three give correct results.
But I also need the value for each value on att3, I have it working in another query, but when trying to make a single query:
SELECT
att1,
att2,
arrayJoin(att3) AS tags,
sum(value)
FROM test
GROUP BY
att1,
att2,
tags
WITH CUBE
Which gives (among other things):
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ a │ │ │ 6 │
│ b │ │ │ 30 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags───────┬─sum(value)─┐
│ │ │ tag │ 4 │
│ │ │ anothertag │ 2 │
│ │ │ planes │ 6 │
│ │ │ sports │ 12 │
│ │ │ internet │ 6 │
│ │ │ office │ 6 │
└──────┴──────┴────────────┴────────────┘
Since arrayJoin 'unfolds' array into rows, now values of sum(value) in att1 are not accurate.
I've also tried the LEFT ARRAY JOIN syntax with same results.
Updated:
The ideal result would be something like:
┌─'att1'─┬─'att2'─┬─'tags'─┬─'sum(value)'─┐
│ a │ │ │ 2 │
│ b │ │ │ 10 │
│ │ X │ │ 10 │
│ │ Z │ │ 2 │
│ │ │ sports │ 12 │
│ │ │ office │ 6 │
│ │ │ anot.. │ 2 │
│ │ │ tag │ 4 │
│ │ │internet│ 6 │
│ │ │planes │ 6 │
└────────┴────────┴────────┴──────────────┘
Could be in different rows (results), but ideally in one single query.
SELECT
sumMap(([att1], [value])) AS r1,
sumMap(([att2], [value])) AS r2,
sumMap((att3, replicate(value, att3))) AS r3
FROM test
┌─r1─────────────────┬─r2─────────────────┬─r3──────────────────────────────────────────────────────────────────────────┐
│ (['a','b'],[2,10]) │ (['X','Z'],[10,2]) │ (['anothertag','internet','office','planes','sports','tag'],[2,6,6,6,12,4]) │
└────────────────────┴────────────────────┴─────────────────────────────────────────────────────────────────────────────┘
SELECT
(arrayJoin(arrayZip((arrayJoin([sumMap(([att1], [value])), sumMap(([att2], [value])), sumMap((att3, replicate(value, att3)))]) AS r).1, r.2)) AS x).1 AS y,
x.2 AS z
FROM test
┌─y──────────┬──z─┐
│ a │ 2 │
│ b │ 10 │
│ X │ 10 │
│ Z │ 2 │
│ anothertag │ 2 │
│ internet │ 6 │
│ office │ 6 │
│ planes │ 6 │
│ sports │ 12 │
│ tag │ 4 │
└────────────┴────┘
I think the more straightforward way is to combine two queries:
SELECT
att1,
att2,
'' AS tags,
sum(value)
FROM test
GROUP BY
att1,
att2
WITH CUBE
UNION ALL
SELECT
'' AS att1,
'' AS att2,
arrayJoin(att3) AS tags,
sum(value)
FROM test
GROUP BY tags
/*
┌─att1─┬─att2─┬─tags───────┬─sum(value)─┐
│ │ │ internet │ 6 │
│ │ │ sports │ 12 │
│ │ │ office │ 6 │
│ │ │ tag │ 4 │
│ │ │ planes │ 6 │
│ │ │ anothertag │ 2 │
└──────┴──────┴────────────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ b │ X │ │ 10 │
│ a │ Z │ │ 2 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ a │ │ │ 2 │
│ b │ │ │ 10 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ │ Z │ │ 2 │
│ │ X │ │ 10 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ │ │ │ 12 │
└──────┴──────┴──────┴────────────┘
*/
I'm looking for an efficient way on how to query the n past values as array in ClickHouse for each row ordered by one column (i.e. Time), where the values should be retrieved as array.
Window functions are still not supported in ClickHouse (see #1469), so I was hoping for a work-around using aggregation functions like groupArray()?
Table:
Time | Value
12:11 | 1
12:12 | 2
12:13 | 3
12:14 | 4
12:15 | 5
12:16 | 6
Expected result with a window of size n=3:
Time | Value
12:13 | [1,2,3]
12:14 | [2,3,4]
12:15 | [3,4,5]
12:16 | [4,5,6]
What are the ways/functions currently used in ClickHouse to efficiently query a sliding/moving window and how can I achieve my desired result?
EDIT:
My solution based on response of #vladimir:
select max(Time) as Time, groupArray(Value) as Values
from (
select
*,
rowNumberInAllBlocks() as row_number,
arrayJoin(range(row_number, row_number + 3)) as window_id
from (
/* BEGIN emulate origin dataset */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
)
order by Time
/* END emulate origin dataset */
)
order by Time
) s
group by window_id
having length(Values) = 3
order by Time
Note that 3 appears twice in the query and represents the window size n.
Output:
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
Starting from version 21.4 added the full support of window-functions. At this moment it was marked as an experimental feature.
SELECT
Time,
groupArray(any(Value)) OVER (ORDER BY Time ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS Values
FROM
(
/* Emulate the test dataset, */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
)
order by Time
)
GROUP BY Time
SETTINGS allow_experimental_window_functions = 1
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [1] │
│ 2020-01-01 12:12:00 │ [1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/
See https://altinity.com/blog/clickhouse-window-functions-current-state-of-the-art.
ClickHouse has several datablock-scoped window functions, let's take neighbor:
SELECT Time, [neighbor(Value, -2), neighbor(Value, -1), neighbor(Value, 0)] Values
FROM (
/* emulate origin data */
SELECT toDateTime(data.1) as Time, data.2 as Value
FROM (
SELECT arrayJoin([('2020-01-01 12:11:00', 1),
('2020-01-01 12:12:00', 2),
('2020-01-01 12:13:00', 3),
('2020-01-01 12:14:00', 4),
('2020-01-01 12:15:00', 5),
('2020-01-01 12:16:00', 6)]) as data)
)
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [0,0,1] │
│ 2020-01-01 12:12:00 │ [0,1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/
An alternate way based on the duplication of source rows by window_size times:
SELECT
arrayReduce('max', arrayMap(x -> x.1, raw_result)) Time,
arrayMap(x -> x.2, raw_result) Values
FROM (
SELECT groupArray((Time, Value)) raw_result, max(row_number) max_row_number
FROM (
SELECT
3 AS window_size,
*,
rowNumberInAllBlocks() row_number,
arrayJoin(arrayMap(x -> x + row_number, range(window_size))) window_id
FROM (
/* emulate origin dataset */
SELECT toDateTime(data.1) as Time, data.2 as Value
FROM (
SELECT arrayJoin([('2020-01-01 12:11:00', 1),
('2020-01-01 12:12:00', 2),
('2020-01-01 12:13:00', 3),
('2020-01-01 12:14:00', 4),
('2020-01-01 12:15:00', 5),
('2020-01-01 12:16:00', 6)]) as data)
ORDER BY Value
)
)
GROUP BY window_id
HAVING max_row_number = window_id
ORDER BY window_id
)
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [1] │
│ 2020-01-01 12:12:00 │ [1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/
Extra example:
SELECT
arrayReduce('max', arrayMap(x -> x.1, raw_result)) id,
arrayMap(x -> x.2, raw_result) values
FROM (
SELECT groupArray((id, value)) raw_result, max(row_number) max_row_number
FROM (
SELECT
48 AS window_size,
*,
rowNumberInAllBlocks() row_number,
arrayJoin(arrayMap(x -> x + row_number, range(window_size))) window_id
FROM (
/* the origin dataset */
SELECT number AS id, number AS value
FROM numbers(4096)
)
)
GROUP BY window_id
HAVING max_row_number = window_id
ORDER BY window_id
)
/*
┌─id─┬─values────────────────┐
│ 0 │ [0] │
│ 1 │ [0,1] │
│ 2 │ [0,1,2] │
│ 3 │ [0,1,2,3] │
│ 4 │ [0,1,2,3,4] │
│ 5 │ [0,1,2,3,4,5] │
│ 6 │ [0,1,2,3,4,5,6] │
│ 7 │ [0,1,2,3,4,5,6,7] │
..
│ 56 │ [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56] │
│ 57 │ [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57] │
│ 58 │ [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58] │
│ 59 │ [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59] │
│ 60 │ [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60] │
..
│ 4093 │ [4046,4047,4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093] │
│ 4094 │ [4047,4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094] │
│ 4095 │ [4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095] │
└──────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
*/
For Clickhouse 19, where range function takes only single input, you can use following query
select max(Time) as Time, groupArray(Value) as Values
from (
select
*,
rowNumberInAllBlocks() as row_number,
arrayJoin( arrayMap(x -> x + row_number, range(3)) ) as window_id
from (
/* BEGIN emulate origin dataset */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
)
order by Time
/* END emulate origin dataset */
)
order by Time
) s
group by window_id
having length(Values) = 3
order by Time