Problem:
Count distinct values in an array filtered by another array on same row (and agg higher).
Explanation:
Using this data:
In the Size D70, there are 5 pcs available (hqsize), but shops requests 15. By using the column accumulatedNeed, the 5 first stores in the column shops should receive items (since every store request 1 pcs). That is [4098,4101,4109,4076,4080].
It could also be that the values in accumulatedNeed would be [1,4,5,5,5,...,15], where shop 1 request 1 pcs, shop2 3 pcs, etc. Then only 3 stores would get.
In the size E75 there is enough stock, so every shop will receive (10 shops):
Now i want the distinct list of shops from D70 & E75, which would be be final result:
[4098,4101,4109,4076,4080,4062,4063,4067,4072,4075,4056,4058,4059,4061] (14 unique stores) (4109 is only counted once)
Wanted result:
[4098,4101,4109,4076,4080,4062,4063,4067,4072,4075,4056,4058,4059,4061]. (14 unique stores)
I'm totally open to structure the data otherwise if better.
The reason why it can't be precalculated is that the result depends on which shops that are filtered on.
Additional issue
The answer below from Vdimir is good and I've used it as basics for the final solution, but the solution does not cover (partial fullfillment).
If the stock number is in the runningNeed array we are all goodt, but remainers are not handled.
If you got:
select 5 as stock,[2,2,3,3] as need, [1,2,3,4] as shops, arrayCumSum(need) as runningNeed,arrayMap(x -> (x <= stock), runningNeed) as mask
You will get:
This is not correct since the 3rd shop should have 1 from stock (5-2-2 = 1)
I can't seem to get my head around how to make an array with "stock given", which in this case would be [2,2,1,0]
I use this query to create table with data similar to your screenshot:
CREATE TABLE t
(
Size String,
hqsize Int,
accumulatedNeed Array(Int),
shops Array(Int)
) engine = Memory;
INSERT INTO t VALUES ('D70', 5, [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], [4098,4101,4109,4076,4080,4083,4062,4063,4067,4072,4075,4056,4057,4058,4059]),('E75', 43, [1,2,3,4,5,6,7,8,9,10], [4109,4062,4063,4067,4072,4075,4056,4058,4059,4061]);
Find which shops that can receive enough items:
SELECT arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask FROM t;
┌─mask────────────────────────────┐
│ [1,1,1,1,1,0,0,0,0,0,0,0,0,0,0] │
│ [1,1,1,1,1,1,1,1,1,1] │
└─────────────────────────────────┘
Filter not fulfilled shops according to this mask:
Note that shops and accumulatedNeed have to have equals sizes.
SELECT arrayFilter((x,y) -> y, shops, mask) as fulfilled_shops, arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask FROM t;
┌─fulfilled_shops─────────────────────────────────────┬─mask────────────────────────────┐
│ [4098,4101,4109,4076,4080] │ [1,1,1,1,1,0,0,0,0,0,0,0,0,0,0] │
│ [4109,4062,4063,4067,4072,4075,4056,4058,4059,4061] │ [1,1,1,1,1,1,1,1,1,1] │
└─────────────────────────────────────────────────────┴─────────────────────────────────┘
Then you can create table with all distinct shops:
SELECT DISTINCT arrayJoin(fulfilled_shops) as shops FROM (
SELECT arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask, arrayFilter((x,y) -> y, shops, mask) as fulfilled_shops FROM t
);
┌─shops─┐
│ 4098 │
│ 4101 │
│ 4109 │
│ 4076 │
│ 4080 │
│ 4062 │
│ 4063 │
│ 4067 │
│ 4072 │
│ 4075 │
│ 4056 │
│ 4058 │
│ 4059 │
│ 4061 │
└───────┘
14 rows in set. Elapsed: 0.049 sec.
Or if you need single array group it back:
SELECT groupArrayDistinct(arrayJoin(fulfilled_shops)) as shops FROM (
SELECT arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask, arrayFilter((x,y) -> y, shops, mask) as fulfilled_shops FROM t
);
┌─shops───────────────────────────────────────────────────────────────────┐
│ [4080,4076,4101,4075,4056,4061,4062,4063,4109,4058,4067,4059,4072,4098] │
└─────────────────────────────────────────────────────────────────────────┘
If you need data only from D70 & E75 you can filter extra rows from table with WHERE before.
I'm looking for an efficient way on how to query the n past values as array in ClickHouse for each row ordered by one column (i.e. Time), where the values should be retrieved as array.
Window functions are still not supported in ClickHouse (see #1469), so I was hoping for a work-around using aggregation functions like groupArray()?
Table:
Time | Value
12:11 | 1
12:12 | 2
12:13 | 3
12:14 | 4
12:15 | 5
12:16 | 6
Expected result with a window of size n=3:
Time | Value
12:13 | [1,2,3]
12:14 | [2,3,4]
12:15 | [3,4,5]
12:16 | [4,5,6]
What are the ways/functions currently used in ClickHouse to efficiently query a sliding/moving window and how can I achieve my desired result?
EDIT:
My solution based on response of #vladimir:
select max(Time) as Time, groupArray(Value) as Values
from (
select
*,
rowNumberInAllBlocks() as row_number,
arrayJoin(range(row_number, row_number + 3)) as window_id
from (
/* BEGIN emulate origin dataset */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
)
order by Time
/* END emulate origin dataset */
)
order by Time
) s
group by window_id
having length(Values) = 3
order by Time
Note that 3 appears twice in the query and represents the window size n.
Output:
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
Starting from version 21.4 added the full support of window-functions. At this moment it was marked as an experimental feature.
SELECT
Time,
groupArray(any(Value)) OVER (ORDER BY Time ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS Values
FROM
(
/* Emulate the test dataset, */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
)
order by Time
)
GROUP BY Time
SETTINGS allow_experimental_window_functions = 1
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [1] │
│ 2020-01-01 12:12:00 │ [1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/
See https://altinity.com/blog/clickhouse-window-functions-current-state-of-the-art.
ClickHouse has several datablock-scoped window functions, let's take neighbor:
SELECT Time, [neighbor(Value, -2), neighbor(Value, -1), neighbor(Value, 0)] Values
FROM (
/* emulate origin data */
SELECT toDateTime(data.1) as Time, data.2 as Value
FROM (
SELECT arrayJoin([('2020-01-01 12:11:00', 1),
('2020-01-01 12:12:00', 2),
('2020-01-01 12:13:00', 3),
('2020-01-01 12:14:00', 4),
('2020-01-01 12:15:00', 5),
('2020-01-01 12:16:00', 6)]) as data)
)
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [0,0,1] │
│ 2020-01-01 12:12:00 │ [0,1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/
An alternate way based on the duplication of source rows by window_size times:
SELECT
arrayReduce('max', arrayMap(x -> x.1, raw_result)) Time,
arrayMap(x -> x.2, raw_result) Values
FROM (
SELECT groupArray((Time, Value)) raw_result, max(row_number) max_row_number
FROM (
SELECT
3 AS window_size,
*,
rowNumberInAllBlocks() row_number,
arrayJoin(arrayMap(x -> x + row_number, range(window_size))) window_id
FROM (
/* emulate origin dataset */
SELECT toDateTime(data.1) as Time, data.2 as Value
FROM (
SELECT arrayJoin([('2020-01-01 12:11:00', 1),
('2020-01-01 12:12:00', 2),
('2020-01-01 12:13:00', 3),
('2020-01-01 12:14:00', 4),
('2020-01-01 12:15:00', 5),
('2020-01-01 12:16:00', 6)]) as data)
ORDER BY Value
)
)
GROUP BY window_id
HAVING max_row_number = window_id
ORDER BY window_id
)
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [1] │
│ 2020-01-01 12:12:00 │ [1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/
Extra example:
SELECT
arrayReduce('max', arrayMap(x -> x.1, raw_result)) id,
arrayMap(x -> x.2, raw_result) values
FROM (
SELECT groupArray((id, value)) raw_result, max(row_number) max_row_number
FROM (
SELECT
48 AS window_size,
*,
rowNumberInAllBlocks() row_number,
arrayJoin(arrayMap(x -> x + row_number, range(window_size))) window_id
FROM (
/* the origin dataset */
SELECT number AS id, number AS value
FROM numbers(4096)
)
)
GROUP BY window_id
HAVING max_row_number = window_id
ORDER BY window_id
)
/*
┌─id─┬─values────────────────┐
│ 0 │ [0] │
│ 1 │ [0,1] │
│ 2 │ [0,1,2] │
│ 3 │ [0,1,2,3] │
│ 4 │ [0,1,2,3,4] │
│ 5 │ [0,1,2,3,4,5] │
│ 6 │ [0,1,2,3,4,5,6] │
│ 7 │ [0,1,2,3,4,5,6,7] │
..
│ 56 │ [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56] │
│ 57 │ [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57] │
│ 58 │ [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58] │
│ 59 │ [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59] │
│ 60 │ [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60] │
..
│ 4093 │ [4046,4047,4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093] │
│ 4094 │ [4047,4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094] │
│ 4095 │ [4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095] │
└──────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
*/
For Clickhouse 19, where range function takes only single input, you can use following query
select max(Time) as Time, groupArray(Value) as Values
from (
select
*,
rowNumberInAllBlocks() as row_number,
arrayJoin( arrayMap(x -> x + row_number, range(3)) ) as window_id
from (
/* BEGIN emulate origin dataset */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
)
order by Time
/* END emulate origin dataset */
)
order by Time
) s
group by window_id
having length(Values) = 3
order by Time