Clickhouse: runningAccumulate() does not work as I expect - clickhouse

Say, we have a table testint.
SELECT *
FROM testint
┌─f1─┬─f2─┐
│ 2 │ 3 │
│ 2 │ 3 │
│ 4 │ 5 │
│ 4 │ 5 │
│ 6 │ 7 │
│ 6 │ 7 │
└────┴────┘
We try to query runningAccumulate() with sumState().
SELECT runningAccumulate(col)
FROM
(
SELECT sumState(f1) AS col
FROM testint
GROUP BY f1
)
┌─runningAccumulate(col)─┐
│ 8 │
│ 12 │
│ 24 │
└────────────────────────┘
Why is the first row in the response 8, and not 4? If we are grouping by f1, the first row seems to be 4 (we do sum the first 2 and the second 2 in the column f1).

For accumulate-functions the order of elements is important, so just add ORDER BY to fix it:
SELECT runningAccumulate(col)
FROM
(
SELECT sumState(f1) AS col
FROM testint
GROUP BY f1
ORDER BY f1 ASC /* <-- */
)
You got the result [8, 12, 24] for input data [8, 4, 12] when should be used the ordered input - [4, 8, 12].

Related

Idea for algorithm to arrange balls of different weights into boxes

The problem
Suppose I have some balls and box types:
Each ball has a different weight.
Each type of box has its min and max capacity, and a penalty when used.
There are unlimited number of boxes for each type.
How can I arrange the balls into the least boxes such that:
The total weight of the balls in each box is within its min and max capacity.
The total penalty of the used boxes is minimized.
There may be multiple solutions. However, the accepted solution is where the total weight of the balls in each box is nearest to its max capacity.
Example
For example, there are 5 balls of weight 31, 14, 13, 12, 7 respectively, and 3 box types:
type│ min │ max │penalty
────┼─────┼─────┼───────
A │ 11 │ 20 │ 1
B │ 21 │ 30 │ 1
C │ 31 │ 40 │ 5
The possible combinations are:
boxTypes│ 31 │ 14 │ 13 │ 12 │ 7 │ penalty
────────┼────┼────┼────┼────┼────┼─────────
ABC │ C │ B │ B │ A │ C │ 7
BBC │ C │ B1 │ B2 │ B2 │ B1 │ 7
CC │ C1 │ C2 │ C2 │ C2 │ C1 │ 10
ACC │ C1 │ C2 │ C2 │ A │ C2 │ 11
and many other unlisted possibilities where the set of box types are the same or the penalty is just too high.
Notice that there are 2 solutions with the same penalty. However, considering the third condition:
boxTypes │ box1 │ box2 │ box3 │ shortfall
─────────┼──────┼──────┼──────┼──────────────────────────────────
ABC │ 12 │ 27 │ 38 │ (20-12) + (30-27) + (40-38) = 13
BBC │ 21 │ 25 │ 31 │ (30-25) + (30-25) + (40-31) = 19
The ABC box combination is chosen due to filling the most capacity of the boxes.
My code
I am currently recursively generating all combinations of the balls, and check whether there is a set of box that fits the ball groups.
I am able to improve the performance by:
Early halt when a group weight is out of the maximum capacity (40 in this example)
Limit the number of boxes (2 - 3 instead of 5, i.e. 1 box for each ball)
However, my solution still cannot handle more than 15 balls.
Is there a better algorithm other than bruteforce to solve this problem?

Clickhouse pass data for inner join

In PostgreSQL we can join tables with custom data.
For example:
select *
from points p
inner join (VALUES (5, '2000-1-1'::date, 1, 1)) as x(id, create_date, store_id, supplier_id)
on p.id = x.id
Does such kind of join exist in Clickhouse? If yes, how should I write it?
https://github.com/ClickHouse/ClickHouse/issues/5984
SELECT *
FROM VALUES('a UInt64, s String', (1, 'one'), (2, 'two'), (3, 'three'))
┌─a─┬─s─────┐
│ 1 │ one │
│ 2 │ two │
│ 3 │ three │
└───┴───────┘
WITH
[(toUInt64(1), 'one'), (2, 'two'), (3, 'three')] AS rows_array,
arrayJoin(rows_array) AS row_tuple
SELECT
row_tuple.1 AS number_decimal,
row_tuple.2 AS number_string
┌─number_decimal─┬─number_string─┐
│ 1 │ one │
│ 2 │ two │
│ 3 │ three │
└────────────────┴───────────────┘

Aggregate query over multiple columns (one is an array) in clickhouse

I'm trying to get aggregates values for each att1, and att2 column, and also for each value of the arrays in att3 column.
As far I tried:
create table test(value Float32, att1 String, att2 String, att3 Array(String))
ENGINE=MergeTree() ORDER BY ();
INSERT INTO test VALUES (2.0, 'a', 'Z', ['sports', 'office', 'anothertag'])
INSERT INTO test VALUES (4.0, 'b', 'X', ['sports', 'office', 'tag'])
INSERT INTO test VALUES (6.0, 'b', 'X', ['sports', 'internet', 'planes'])
SELECT * from test;
┌─value─┬─att1─┬─att2─┬─att3───────────────────────────┐
│ 6 │ b │ X │ ['sports','internet','planes'] │
└───────┴──────┴──────┴────────────────────────────────┘
┌─value─┬─att1─┬─att2─┬─att3─────────────────────────────┐
│ 2 │ a │ Z │ ['sports','office','anothertag'] │
└───────┴──────┴──────┴──────────────────────────────────┘
┌─value─┬─att1─┬─att2─┬─att3──────────────────────┐
│ 4 │ b │ X │ ['sports','office','tag'] │
└───────┴──────┴──────┴───────────────────────────┘
I want to get the aggregate -sum(value)- for each different attribute.
I have it working for att1 and att2 columns with:
SELECT
att1,
att2,
sum(value)
FROM test
GROUP BY
att1,
att2
WITH CUBE
Result:
┌─att1─┬─att2─┬─sum(value)─┐
│ b │ X │ 10 │
│ a │ Z │ 2 │
└──────┴──────┴────────────┘
┌─att1─┬─att2─┬─sum(value)─┐
│ a │ │ 2 │
│ b │ │ 10 │
└──────┴──────┴────────────┘
┌─att1─┬─att2─┬─sum(value)─┐
│ │ Z │ 2 │
│ │ X │ 10 │
└──────┴──────┴────────────┘
┌─att1─┬─att2─┬─sum(value)─┐
│ │ │ 12 │
└──────┴──────┴────────────┘
Which gives me more than needed, but results two and three give correct results.
But I also need the value for each value on att3, I have it working in another query, but when trying to make a single query:
SELECT
att1,
att2,
arrayJoin(att3) AS tags,
sum(value)
FROM test
GROUP BY
att1,
att2,
tags
WITH CUBE
Which gives (among other things):
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ a │ │ │ 6 │
│ b │ │ │ 30 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags───────┬─sum(value)─┐
│ │ │ tag │ 4 │
│ │ │ anothertag │ 2 │
│ │ │ planes │ 6 │
│ │ │ sports │ 12 │
│ │ │ internet │ 6 │
│ │ │ office │ 6 │
└──────┴──────┴────────────┴────────────┘
Since arrayJoin 'unfolds' array into rows, now values of sum(value) in att1 are not accurate.
I've also tried the LEFT ARRAY JOIN syntax with same results.
Updated:
The ideal result would be something like:
┌─'att1'─┬─'att2'─┬─'tags'─┬─'sum(value)'─┐
│ a │ │ │ 2 │
│ b │ │ │ 10 │
│ │ X │ │ 10 │
│ │ Z │ │ 2 │
│ │ │ sports │ 12 │
│ │ │ office │ 6 │
│ │ │ anot.. │ 2 │
│ │ │ tag │ 4 │
│ │ │internet│ 6 │
│ │ │planes │ 6 │
└────────┴────────┴────────┴──────────────┘
Could be in different rows (results), but ideally in one single query.
SELECT
sumMap(([att1], [value])) AS r1,
sumMap(([att2], [value])) AS r2,
sumMap((att3, replicate(value, att3))) AS r3
FROM test
┌─r1─────────────────┬─r2─────────────────┬─r3──────────────────────────────────────────────────────────────────────────┐
│ (['a','b'],[2,10]) │ (['X','Z'],[10,2]) │ (['anothertag','internet','office','planes','sports','tag'],[2,6,6,6,12,4]) │
└────────────────────┴────────────────────┴─────────────────────────────────────────────────────────────────────────────┘
SELECT
(arrayJoin(arrayZip((arrayJoin([sumMap(([att1], [value])), sumMap(([att2], [value])), sumMap((att3, replicate(value, att3)))]) AS r).1, r.2)) AS x).1 AS y,
x.2 AS z
FROM test
┌─y──────────┬──z─┐
│ a │ 2 │
│ b │ 10 │
│ X │ 10 │
│ Z │ 2 │
│ anothertag │ 2 │
│ internet │ 6 │
│ office │ 6 │
│ planes │ 6 │
│ sports │ 12 │
│ tag │ 4 │
└────────────┴────┘
I think the more straightforward way is to combine two queries:
SELECT
att1,
att2,
'' AS tags,
sum(value)
FROM test
GROUP BY
att1,
att2
WITH CUBE
UNION ALL
SELECT
'' AS att1,
'' AS att2,
arrayJoin(att3) AS tags,
sum(value)
FROM test
GROUP BY tags
/*
┌─att1─┬─att2─┬─tags───────┬─sum(value)─┐
│ │ │ internet │ 6 │
│ │ │ sports │ 12 │
│ │ │ office │ 6 │
│ │ │ tag │ 4 │
│ │ │ planes │ 6 │
│ │ │ anothertag │ 2 │
└──────┴──────┴────────────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ b │ X │ │ 10 │
│ a │ Z │ │ 2 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ a │ │ │ 2 │
│ b │ │ │ 10 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ │ Z │ │ 2 │
│ │ X │ │ 10 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ │ │ │ 12 │
└──────┴──────┴──────┴────────────┘
*/

Clickhouse: Sliding / moving window

I'm looking for an efficient way on how to query the n past values as array in ClickHouse for each row ordered by one column (i.e. Time), where the values should be retrieved as array.
Window functions are still not supported in ClickHouse (see #1469), so I was hoping for a work-around using aggregation functions like groupArray()?
Table:
Time | Value
12:11 | 1
12:12 | 2
12:13 | 3
12:14 | 4
12:15 | 5
12:16 | 6
Expected result with a window of size n=3:
Time | Value
12:13 | [1,2,3]
12:14 | [2,3,4]
12:15 | [3,4,5]
12:16 | [4,5,6]
What are the ways/functions currently used in ClickHouse to efficiently query a sliding/moving window and how can I achieve my desired result?
EDIT:
My solution based on response of #vladimir:
select max(Time) as Time, groupArray(Value) as Values
from (
select
*,
rowNumberInAllBlocks() as row_number,
arrayJoin(range(row_number, row_number + 3)) as window_id
from (
/* BEGIN emulate origin dataset */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
)
order by Time
/* END emulate origin dataset */
)
order by Time
) s
group by window_id
having length(Values) = 3
order by Time
Note that 3 appears twice in the query and represents the window size n.
Output:
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
Starting from version 21.4 added the full support of window-functions. At this moment it was marked as an experimental feature.
SELECT
Time,
groupArray(any(Value)) OVER (ORDER BY Time ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS Values
FROM
(
/* Emulate the test dataset, */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
)
order by Time
)
GROUP BY Time
SETTINGS allow_experimental_window_functions = 1
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [1] │
│ 2020-01-01 12:12:00 │ [1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/
See https://altinity.com/blog/clickhouse-window-functions-current-state-of-the-art.
ClickHouse has several datablock-scoped window functions, let's take neighbor:
SELECT Time, [neighbor(Value, -2), neighbor(Value, -1), neighbor(Value, 0)] Values
FROM (
/* emulate origin data */
SELECT toDateTime(data.1) as Time, data.2 as Value
FROM (
SELECT arrayJoin([('2020-01-01 12:11:00', 1),
('2020-01-01 12:12:00', 2),
('2020-01-01 12:13:00', 3),
('2020-01-01 12:14:00', 4),
('2020-01-01 12:15:00', 5),
('2020-01-01 12:16:00', 6)]) as data)
)
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [0,0,1] │
│ 2020-01-01 12:12:00 │ [0,1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/
An alternate way based on the duplication of source rows by window_size times:
SELECT
arrayReduce('max', arrayMap(x -> x.1, raw_result)) Time,
arrayMap(x -> x.2, raw_result) Values
FROM (
SELECT groupArray((Time, Value)) raw_result, max(row_number) max_row_number
FROM (
SELECT
3 AS window_size,
*,
rowNumberInAllBlocks() row_number,
arrayJoin(arrayMap(x -> x + row_number, range(window_size))) window_id
FROM (
/* emulate origin dataset */
SELECT toDateTime(data.1) as Time, data.2 as Value
FROM (
SELECT arrayJoin([('2020-01-01 12:11:00', 1),
('2020-01-01 12:12:00', 2),
('2020-01-01 12:13:00', 3),
('2020-01-01 12:14:00', 4),
('2020-01-01 12:15:00', 5),
('2020-01-01 12:16:00', 6)]) as data)
ORDER BY Value
)
)
GROUP BY window_id
HAVING max_row_number = window_id
ORDER BY window_id
)
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [1] │
│ 2020-01-01 12:12:00 │ [1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/
Extra example:
SELECT
arrayReduce('max', arrayMap(x -> x.1, raw_result)) id,
arrayMap(x -> x.2, raw_result) values
FROM (
SELECT groupArray((id, value)) raw_result, max(row_number) max_row_number
FROM (
SELECT
48 AS window_size,
*,
rowNumberInAllBlocks() row_number,
arrayJoin(arrayMap(x -> x + row_number, range(window_size))) window_id
FROM (
/* the origin dataset */
SELECT number AS id, number AS value
FROM numbers(4096)
)
)
GROUP BY window_id
HAVING max_row_number = window_id
ORDER BY window_id
)
/*
┌─id─┬─values────────────────┐
│ 0 │ [0] │
│ 1 │ [0,1] │
│ 2 │ [0,1,2] │
│ 3 │ [0,1,2,3] │
│ 4 │ [0,1,2,3,4] │
│ 5 │ [0,1,2,3,4,5] │
│ 6 │ [0,1,2,3,4,5,6] │
│ 7 │ [0,1,2,3,4,5,6,7] │
..
│ 56 │ [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56] │
│ 57 │ [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57] │
│ 58 │ [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58] │
│ 59 │ [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59] │
│ 60 │ [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60] │
..
│ 4093 │ [4046,4047,4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093] │
│ 4094 │ [4047,4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094] │
│ 4095 │ [4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095] │
└──────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
*/
For Clickhouse 19, where range function takes only single input, you can use following query
select max(Time) as Time, groupArray(Value) as Values
from (
select
*,
rowNumberInAllBlocks() as row_number,
arrayJoin( arrayMap(x -> x + row_number, range(3)) ) as window_id
from (
/* BEGIN emulate origin dataset */
select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00',
'2020-01-01 12:12:00',
'2020-01-01 12:13:00',
'2020-01-01 12:14:00',
'2020-01-01 12:15:00',
'2020-01-01 12:16:00']) a
)
order by Time
/* END emulate origin dataset */
)
order by Time
) s
group by window_id
having length(Values) = 3
order by Time

How to use results of [with totals] modifier

We have modifier [with totals] that can summarize values across all rows and get the total result with key value=0 or null or smth like this
The problem is that I don't understand how I can use these values in the next calculations
Maybe I'm using the wrong format
select processing_date,count(*)
from `telegram.message`
where processing_date>='2019-05-01'
group by processing_date with totals
The documentation says that
You can use WITH TOTALS in subqueries, including subqueries in the
JOIN clause (in this case, the respective total values are combined).
Example subqueries in the JOIN (CH tests scripts in github):
SELECT k, s1, s2
FROM
(
SELECT intDiv(number, 3) AS k, sum(number) AS s1
FROM
(
SELECT *
FROM system.numbers
LIMIT 10
)
GROUP BY k WITH TOTALS
)
ANY LEFT JOIN
(
SELECT intDiv(number, 4) AS k, sum(number) AS s2
FROM
(
SELECT *
FROM system.numbers
LIMIT 10
)
GROUP BY k WITH TOTALS
) USING (k)
ORDER BY k ASC
/* Result:
┌─k─┬─s1─┬─s2─┐
│ 0 │ 3 │ 6 │
│ 1 │ 12 │ 22 │
│ 2 │ 21 │ 17 │
│ 3 │ 9 │ 0 │
└───┴────┴────┘
Totals:
┌─k─┬─s1─┬─s2─┐
│ 0 │ 45 │ 45 │
└───┴────┴────┘
*/
As a workaround, you can combine results of several totals using client libraries.
Using "with rollup" instead of "with totals" decides problems with format

Resources