Clickhouse: Sliding / moving window - clickhouse
I'm looking for an efficient way on how to query the n past values as array in ClickHouse for each row ordered by one column (i.e. Time), where the values should be retrieved as array.
Window functions are still not supported in ClickHouse (see #1469), so I was hoping for a work-around using aggregation functions like groupArray()?
Time | Value
12:11 | 1
12:12 | 2
12:13 | 3
12:14 | 4
12:15 | 5
12:16 | 6
Expected result with a window of size n=3:
Time | Value
12:13 | [1,2,3]
12:14 | [2,3,4]
12:15 | [3,4,5]
12:16 | [4,5,6]
What are the ways/functions currently used in ClickHouse to efficiently query a sliding/moving window and how can I achieve my desired result?
My solution based on response of #vladimir:
select max(Time) as Time, groupArray(Value) as Values
from (
rowNumberInAllBlocks() as row_number,
arrayJoin(range(row_number, row_number + 3)) as window_id
from (
/* BEGIN emulate origin dataset */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
order by Time
/* END emulate origin dataset */
order by Time
) s
group by window_id
having length(Values) = 3
order by Time
Note that 3 appears twice in the query and represents the window size n.
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
Starting from version 21.4 added the full support of window-functions. At this moment it was marked as an experimental feature.
/* Emulate the test dataset, */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
order by Time
SETTINGS allow_experimental_window_functions = 1
│ 2020-01-01 12:11:00 │ [1] │
│ 2020-01-01 12:12:00 │ [1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
ClickHouse has several datablock-scoped window functions, let's take neighbor:
SELECT Time, [neighbor(Value, -2), neighbor(Value, -1), neighbor(Value, 0)] Values
/* emulate origin data */
SELECT toDateTime(data.1) as Time, data.2 as Value
SELECT arrayJoin([('2020-01-01 12:11:00', 1),
('2020-01-01 12:12:00', 2),
('2020-01-01 12:13:00', 3),
('2020-01-01 12:14:00', 4),
('2020-01-01 12:15:00', 5),
('2020-01-01 12:16:00', 6)]) as data)
│ 2020-01-01 12:11:00 │ [0,0,1] │
│ 2020-01-01 12:12:00 │ [0,1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
An alternate way based on the duplication of source rows by window_size times:
arrayReduce('max', arrayMap(x -> x.1, raw_result)) Time,
arrayMap(x -> x.2, raw_result) Values
SELECT groupArray((Time, Value)) raw_result, max(row_number) max_row_number
3 AS window_size,
rowNumberInAllBlocks() row_number,
arrayJoin(arrayMap(x -> x + row_number, range(window_size))) window_id
/* emulate origin dataset */
SELECT toDateTime(data.1) as Time, data.2 as Value
SELECT arrayJoin([('2020-01-01 12:11:00', 1),
('2020-01-01 12:12:00', 2),
('2020-01-01 12:13:00', 3),
('2020-01-01 12:14:00', 4),
('2020-01-01 12:15:00', 5),
('2020-01-01 12:16:00', 6)]) as data)
GROUP BY window_id
HAVING max_row_number = window_id
ORDER BY window_id
│ 2020-01-01 12:11:00 │ [1] │
│ 2020-01-01 12:12:00 │ [1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
Extra example:
arrayReduce('max', arrayMap(x -> x.1, raw_result)) id,
arrayMap(x -> x.2, raw_result) values
SELECT groupArray((id, value)) raw_result, max(row_number) max_row_number
48 AS window_size,
rowNumberInAllBlocks() row_number,
arrayJoin(arrayMap(x -> x + row_number, range(window_size))) window_id
/* the origin dataset */
SELECT number AS id, number AS value
FROM numbers(4096)
GROUP BY window_id
HAVING max_row_number = window_id
ORDER BY window_id
│ 0 │ [0] │
│ 1 │ [0,1] │
│ 2 │ [0,1,2] │
│ 3 │ [0,1,2,3] │
│ 4 │ [0,1,2,3,4] │
│ 5 │ [0,1,2,3,4,5] │
│ 6 │ [0,1,2,3,4,5,6] │
│ 7 │ [0,1,2,3,4,5,6,7] │
│ 56 │ [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56] │
│ 57 │ [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57] │
│ 58 │ [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58] │
│ 59 │ [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59] │
│ 60 │ [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60] │
│ 4093 │ [4046,4047,4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093] │
│ 4094 │ [4047,4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094] │
│ 4095 │ [4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095] │
For Clickhouse 19, where range function takes only single input, you can use following query
select max(Time) as Time, groupArray(Value) as Values
from (
rowNumberInAllBlocks() as row_number,
arrayJoin( arrayMap(x -> x + row_number, range(3)) ) as window_id
from (
/* BEGIN emulate origin dataset */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
order by Time
/* END emulate origin dataset */
order by Time
) s
group by window_id
having length(Values) = 3
order by Time
Compare two query result set in Clickhouse
I have a requirement to compare the row values from yesterday to current day (today). Sample table values below: table_1 hour | loads 2022-12-16 00:00:00 | 30000 2022-12-16 01:00:00 | 40000 table_2 hour | loads 2022-12-15 00:00:00 | 25000 2022-12-15 01:00:00 | 25000 then compare table_1 values to table_2 values. I need to return the difference of each row like result_table hour | diff_loads_from_yesterday 2022-12-16 00:00:00 | 5000 2022-12-16 01:00:00 | 15000 I tried to use UNION ALL but it did not worked. I would appreciate if someone can help me on this problem.
You can use a window function i.e. roughly (I've used a diff of 10 places to shorten insert) CREATE TABLE test ( `loads` Int32, `time` DateTime ) ENGINE = MergeTree ORDER BY time INSERT INTO test VALUES (10, '2022-12-15 00:00:00'), (20, '2022-12-15 01:00:00'), (30, '2022-12-15 02:00:00'), (40, '2022-12-15 03:00:00'), (50, '2022-12-15 04:00:00'), (60, '2022-12-15 05:00:00'), (70, '2022-12-15 06:00:00'), (80, '2022-12-15 07:00:00'), (90, '2022-12-15 08:00:00'), (10, '2022-12-15 09:00:00'), (20, '2022-12-15 10:00:00'), (30, '2022-12-15 11:00:00'), (40, '2022-12-15 12:00:00') SELECT time, loads, first_value(loads) OVER (ORDER BY time ASC ROWS BETWEEN 10 PRECEDING AND CURRENT ROW) AS previous, loads - previous AS diff FROM test ORDER BY time ASC ┌────────────────time─┬─loads─┬─previous─┬─diff─┐ │ 2022-12-15 00:00:00 │ 10 │ 10 │ 0 │ │ 2022-12-15 01:00:00 │ 20 │ 10 │ 10 │ │ 2022-12-15 02:00:00 │ 30 │ 10 │ 20 │ │ 2022-12-15 03:00:00 │ 40 │ 10 │ 30 │ │ 2022-12-15 04:00:00 │ 50 │ 10 │ 40 │ │ 2022-12-15 05:00:00 │ 60 │ 10 │ 50 │ │ 2022-12-15 06:00:00 │ 70 │ 10 │ 60 │ │ 2022-12-15 07:00:00 │ 80 │ 10 │ 70 │ │ 2022-12-15 08:00:00 │ 90 │ 10 │ 80 │ │ 2022-12-15 09:00:00 │ 10 │ 10 │ 0 │ │ 2022-12-15 10:00:00 │ 20 │ 10 │ 10 │ │ 2022-12-15 11:00:00 │ 30 │ 20 │ 10 │ │ 2022-12-15 12:00:00 │ 40 │ 30 │ 10 │ └─────────────────────┴───────┴──────────┴──────┘ Note first N values are invalid.
Clickhouse leadInFrame inside COALESCE return Unknown identifier
I'm following Altinity's examples on how to implement Lag/Lead functions But I can't find a way to replace NULLs with other values. Using that example and adding toNullable(a) you can see that many values are going to be NULL. SELECT g, a, lagInFrame(toNullable(a)) OVER (PARTITION BY g ORDER BY a ASC Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS prev, leadInFrame(a) OVER (PARTITION BY g ORDER BY a ASC Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS next FROM llexample ORDER BY g ASC, a ASC Query id: 65c75108-520f-4115-8996-328e8e62aa25 ┌─g─┬──────────a─┬───────prev─┬───────next─┐ │ 0 │ 2020-01-01 │ ᴺᵁᴸᴸ │ 2020-01-04 │ │ 0 │ 2020-01-04 │ 2020-01-01 │ 2020-01-07 │ │ 0 │ 2020-01-07 │ 2020-01-04 │ 2020-01-10 │ │ 0 │ 2020-01-10 │ 2020-01-07 │ 1970-01-01 │ │ 1 │ 2020-01-02 │ ᴺᵁᴸᴸ │ 2020-01-05 │ │ 1 │ 2020-01-05 │ 2020-01-02 │ 2020-01-08 │ │ 1 │ 2020-01-08 │ 2020-01-05 │ 1970-01-01 │ │ 2 │ 2020-01-03 │ ᴺᵁᴸᴸ │ 2020-01-06 │ │ 2 │ 2020-01-06 │ 2020-01-03 │ 2020-01-09 │ │ 2 │ 2020-01-09 │ 2020-01-06 │ 1970-01-01 │ └───┴────────────┴────────────┴────────────┘ I tried to add leadInFrame inside a COALESCE. But when I try to do that I get the error: SELECT g, a, COALESCE( lagInFrame(toNullable(a)) OVER (PARTITION BY g ORDER BY a ASC Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), today() ) AS prev, leadInFrame(a) OVER (PARTITION BY g ORDER BY a ASC Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS next FROM llexample ORDER BY g ASC, a ASC Query id: 9685b822-7f31-45d3-9103-89f06b373876 0 rows in set. Elapsed: 0.002 sec. Received exception from server (version 22.1.2): Code: 47. DB::Exception: Received from localhost:9000. DB::Exception: Unknown identifier: lagInFrame(toNullable(a)) OVER (PARTITION BY g ORDER BY a ASC Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING); there are columns: g, a, toNullable(a): While processing g, a, coalesce(lagInFrame(toNullable(a)) OVER (PARTITION BY g ORDER BY a ASC Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), today()) AS prev, leadInFrame(a) OVER (PARTITION BY g ORDER BY a ASC Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS next. (UNKNOWN_IDENTIFIER) I also tried other conditionals and got the same error. Best You simply need to use subquery, because window functions are not fully functional. select g, a, COALESCE( prev, today()) prev, next from ( SELECT g, a, lagInFrame(toNullable(a)) OVER (PARTITION BY g ORDER BY a ASC Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) prev, leadInFrame(a) OVER (PARTITION BY g ORDER BY a ASC Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS next FROM llexample ORDER BY g ASC, a ASC )
Clickhouse pass data for inner join
In PostgreSQL we can join tables with custom data. For example: select * from points p inner join (VALUES (5, '2000-1-1'::date, 1, 1)) as x(id, create_date, store_id, supplier_id) on = Does such kind of join exist in Clickhouse? If yes, how should I write it? SELECT * FROM VALUES('a UInt64, s String', (1, 'one'), (2, 'two'), (3, 'three')) ┌─a─┬─s─────┐ │ 1 │ one │ │ 2 │ two │ │ 3 │ three │ └───┴───────┘ WITH [(toUInt64(1), 'one'), (2, 'two'), (3, 'three')] AS rows_array, arrayJoin(rows_array) AS row_tuple SELECT row_tuple.1 AS number_decimal, row_tuple.2 AS number_string ┌─number_decimal─┬─number_string─┐ │ 1 │ one │ │ 2 │ two │ │ 3 │ three │ └────────────────┴───────────────┘
Clickhouse: runningAccumulate() does not work as I expect
Say, we have a table testint. SELECT * FROM testint ┌─f1─┬─f2─┐ │ 2 │ 3 │ │ 2 │ 3 │ │ 4 │ 5 │ │ 4 │ 5 │ │ 6 │ 7 │ │ 6 │ 7 │ └────┴────┘ We try to query runningAccumulate() with sumState(). SELECT runningAccumulate(col) FROM ( SELECT sumState(f1) AS col FROM testint GROUP BY f1 ) ┌─runningAccumulate(col)─┐ │ 8 │ │ 12 │ │ 24 │ └────────────────────────┘ Why is the first row in the response 8, and not 4? If we are grouping by f1, the first row seems to be 4 (we do sum the first 2 and the second 2 in the column f1).
For accumulate-functions the order of elements is important, so just add ORDER BY to fix it: SELECT runningAccumulate(col) FROM ( SELECT sumState(f1) AS col FROM testint GROUP BY f1 ORDER BY f1 ASC /* <-- */ ) You got the result [8, 12, 24] for input data [8, 4, 12] when should be used the ordered input - [4, 8, 12].
How to use results of [with totals] modifier
We have modifier [with totals] that can summarize values across all rows and get the total result with key value=0 or null or smth like this The problem is that I don't understand how I can use these values in the next calculations Maybe I'm using the wrong format select processing_date,count(*) from `telegram.message` where processing_date>='2019-05-01' group by processing_date with totals
The documentation says that You can use WITH TOTALS in subqueries, including subqueries in the JOIN clause (in this case, the respective total values are combined). Example subqueries in the JOIN (CH tests scripts in github): SELECT k, s1, s2 FROM ( SELECT intDiv(number, 3) AS k, sum(number) AS s1 FROM ( SELECT * FROM system.numbers LIMIT 10 ) GROUP BY k WITH TOTALS ) ANY LEFT JOIN ( SELECT intDiv(number, 4) AS k, sum(number) AS s2 FROM ( SELECT * FROM system.numbers LIMIT 10 ) GROUP BY k WITH TOTALS ) USING (k) ORDER BY k ASC /* Result: ┌─k─┬─s1─┬─s2─┐ │ 0 │ 3 │ 6 │ │ 1 │ 12 │ 22 │ │ 2 │ 21 │ 17 │ │ 3 │ 9 │ 0 │ └───┴────┴────┘ Totals: ┌─k─┬─s1─┬─s2─┐ │ 0 │ 45 │ 45 │ └───┴────┴────┘ */ As a workaround, you can combine results of several totals using client libraries.
Using "with rollup" instead of "with totals" decides problems with format