how can I calculated point of each user per day with sum all the points from beginning to that day in clickhouse - clickhouse

I have this data in clickhouse:
final point of each user in day is sum(point) from the beginning to that day.
e.g: point of user 1 in 2020-07-02 is 800 and in 2020-07-03 is 200.
I need this result: Point of each user per day:

select uid, d, t from (
select uid, groupArray(date) dg, arrayCumSum(groupArray(spt)) gt from
(select uid, date, sum(pt) spt from
(select 1 tid, '2020-07-01' date, 1 uid, 500 pt
union all
select 1 tid, '2020-07-02' date, 1 uid, 300 pt
union all
select 1 tid, '2020-07-03' date, 1 uid, -600 pt)
group by uid, date
order by uid, date)
group by uid) array join dg as d, gt as t
┌─uid─┬─d──────────┬───t─┐
│ 1 │ 2020-07-01 │ 500 │
│ 1 │ 2020-07-02 │ 800 │
│ 1 │ 2020-07-03 │ 200 │
└─────┴────────────┴─────┘

Related

clickhouse sum arrays at same index [duplicate]

I am trying to add an array column element by element after a group by another column.
Having the table A below:
id units
1 [1,1,1]
2 [3,0,0]
1 [5,3,7]
3 [2,5,2]
2 [3,2,6]
I would like to query something like:
select id, sum(units) from A group by id
And get the following result:
id units
1 [6,4,8]
2 [6,2,6]
3 [2,5,2]
Where the units arrays in rows with the same id get added element by element.
Try this query:
SELECT id, sumForEach(units) units
FROM (
/* emulate dataset */
SELECT data.1 id, data.2 units
FROM (
SELECT arrayJoin([(1, [1,1,1]), (2, [3,0,0]), (1, [5,3,7]), (3, [2,5,2]), (2, [3,2,6])]) data))
GROUP BY id
/* Result
┌─id─┬─units───┐
│ 1 │ [6,4,8] │
│ 2 │ [6,2,6] │
│ 3 │ [2,5,2] │
└────┴─────────┘
*/

Clickhouse SQL Query: Average in intervals

I have a table:
deviceId, valueDateTime, value, valueType
Where the valueType - temperature, pressure, etc.
I have several parameters for query: begin, end (period), and time interval (for example 20 minutes)
I want to get charts for the period for each deviceId and valueType with series of average values for each interval in the period.
EDIT:
Above is the final task, at this moment I just experimenting with this task and I use https://play.clickhouse.tech/?file=playground where I trying to solve a similar task. I want to calculate the average Age in the time interval grouped by Title field. And I have a problem, how to add grouping by Title?
-- 2013-07-15 00:00:00 - begin
-- 2013-07-16 00:00:00 - end
-- 1200 - average in interval 20m
SELECT t, avg(Age) as Age FROM (
SELECT
arrayJoin(
arrayMap(x -> addSeconds(toDateTime('2013-07-15 00:00:00'), x * 1200),
range(toUInt64(dateDiff('second', toDateTime('2013-07-15 00:00:00'), toDateTime('2013-07-16 00:00:00'))/1200)))
) as t,
null as Age
UNION ALL
SELECT
(addSeconds(
toDateTime('2013-07-15 00:00:00'),
1200 * intDivOrZero(dateDiff('second', toDateTime('2013-07-15 00:00:00'), EventTime), 1200))
) as t,
avg(Age) as Age
FROM `hits_100m_obfuscated`
WHERE EventTime BETWEEN toDateTime('2013-07-15 00:00:00') AND toDateTime('2013-07-16 00:00:00')
GROUP BY t
)
GROUP BY t ORDER BY t;
EDITED 2
Correct answer from vladimir adapted to be used and tested on https://play.clickhouse.tech/?file=playground
SELECT
Title, -- as deviceId
JavaEnable, -- as valueType
groupArray((rounded_time, avg_value)) values
FROM (
WITH 60 * 20 AS interval
SELECT
Title,
JavaEnable,
toDateTime(intDiv(toUInt32(EventTime), interval) * interval)
AS rounded_time, -- EventTime as valueDateTime
avg(Age) avg_value -- Age as value
FROM `hits_100m_obfuscated`
WHERE
EventTime BETWEEN toDateTime('2013-07-15 00:00:00')
AND toDateTime('2013-07-16 00:00:00')
GROUP BY
Title,
JavaEnable,
rounded_time
ORDER BY rounded_time
)
GROUP BY
Title,
JavaEnable
ORDER BY
Title,
JavaEnable
Try this query:
SELECT
deviceId,
valueType,
groupArray((rounded_time, avg_value)) values
FROM (
WITH 60 * 20 AS interval
SELECT
deviceId,
valueType,
toDateTime(intDiv(toUInt32(valueDateTime), interval) * interval) AS rounded_time,
avg(value) avg_value
FROM
(
/* emulate the test dataset */
SELECT
number % 4 AS deviceId,
now() - (number * 60) AS valueDateTime,
number % 10 AS value,
if((number % 2) = 1, 'temp', 'pres') AS valueType
FROM numbers(48)
)
/*WHERE valueDateTime >= begin AND valueDateTime < end */
GROUP BY
deviceId,
valueType,
rounded_time
ORDER BY rounded_time
)
GROUP BY
deviceId,
valueType
ORDER BY
deviceId,
valueType
/*
┌─deviceId─┬─valueType─┬─values────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 0 │ pres │ [('2021-02-12 06:00:00',4),('2021-02-12 06:20:00',4),('2021-02-12 06:40:00',4),('2021-02-12 07:00:00',0)] │
│ 1 │ temp │ [('2021-02-12 06:00:00',5),('2021-02-12 06:20:00',5),('2021-02-12 06:40:00',5),('2021-02-12 07:00:00',1)] │
│ 2 │ pres │ [('2021-02-12 06:00:00',4),('2021-02-12 06:20:00',4),('2021-02-12 06:40:00',4)] │
│ 3 │ temp │ [('2021-02-12 06:00:00',5),('2021-02-12 06:20:00',5),('2021-02-12 06:40:00',5)] │
└──────────┴───────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────┘
*/
I would recommend using Grafana to visualize CH report (see Grafana ClickHouse datasource).

Group by date with sparkline like data in the one query

I have the time-series data from the similar hosts that stored in ClickHouse table in the next structure:
event_type | event_day
------------|---------------------
type_1 | 2017-11-09 20:11:28
type_1 | 2017-11-09 20:11:25
type_2 | 2017-11-09 20:11:23
type_2 | 2017-11-09 20:11:21
Each row in the table means the presence of a value 1 for event_type on the datetime. To quickly assess the situation I need to indicate the sum (total) + the last seven values (pulse), like this:
event_type | day | total | pulse
------------|------------|-------|-----------------------------
type_1 | 2017-11-09 | 876 | 12,9,23,67,5,34,10
type_2 | 2017-11-09 | 11865 | 267,120,234,425,102,230,150
I tried to get it with one request in the following way, but it failed - the pulse consists of the same values:
with
arrayMap(x -> today() - 7 + x, range(7)) as week_range,
arrayMap(x -> count(event_type), week_range) as pulse
select
event_type,
toDate(event_date) as day,
count() as total,
pulse
from database.table
group by day, event_type
event_type | day | total | pulse
------------|------------|-------|-------------------------------------------
type_1 | 2017-11-09 | 876 | 876,876,876,876,876,876,876
type_2 | 2017-11-09 | 11865 | 11865,11865,11865,11865,11865,11865,11865
Please point out where is my mistake and how to get desired?
select event_type, groupArray(1)(day)[1], arraySum(pulse) total7, groupArray(7)(cnt) pulse
from (
select
event_type,
toDate(event_date) as day,
count() as cnt
from database.table
where day >= today()-30
group by event_type,day
order by event_type,day desc
)
group by event_type
order by event_type
I would consider calculating pulse on the server-side, CH just provides the required data.
Can be used neighbor-window function:
SELECT
number,
[neighbor(number, -7), neighbor(number, -6), neighbor(number, -5), neighbor(number, -4), neighbor(number, -3), neighbor(number, -2), neighbor(number, -1)] AS pulse
FROM
(
SELECT number
FROM numbers(10, 15)
ORDER BY number ASC
)
┌─number─┬─pulse──────────────────┐
│ 10 │ [0,0,0,0,0,0,0] │
│ 11 │ [0,0,0,0,0,0,10] │
│ 12 │ [0,0,0,0,0,10,11] │
│ 13 │ [0,0,0,0,10,11,12] │
│ 14 │ [0,0,0,10,11,12,13] │
│ 15 │ [0,0,10,11,12,13,14] │
│ 16 │ [0,10,11,12,13,14,15] │
│ 17 │ [10,11,12,13,14,15,16] │
│ 18 │ [11,12,13,14,15,16,17] │
│ 19 │ [12,13,14,15,16,17,18] │
│ 20 │ [13,14,15,16,17,18,19] │
│ 21 │ [14,15,16,17,18,19,20] │
│ 22 │ [15,16,17,18,19,20,21] │
│ 23 │ [16,17,18,19,20,21,22] │
│ 24 │ [17,18,19,20,21,22,23] │
└────────┴────────────────────────┘

Materialized view for calculated results

I have a table like below, where State is a limited set of updates (e.g. Start, End):
CREATE TABLE event_updates (
event_id Int32,
timestamp DateTime,
state String
) ENGINE Log;
And I want to be able quickly run queries like:
SELECT count(*)
FROM (
SELECT event_id,
minOrNullIf(timestamp, state = 'Start') as start,
minOrNullIf(timestamp, state = 'End') as end,
end - start as duration,
duration < 10 as is_fast,
duration > 300 as is_slow
FROM event_updates
GROUP BY event_id)
WHERE start >= '2020-08-20 00:00:00'
AND start < '2020-08-20 00:00:00'
AND is_slow;
But those queries are slow when there is a lot of data, I'm guessing because the calculations are required for every row.
Example data:
┌─event_id─┬───────────timestamp─┬─state─┐
│ 1 │ 2020-08-21 09:58:00 │ Start │
│ 1 │ 2020-08-21 10:18:00 │ End │
│ 2 │ 2020-08-21 10:23:00 │ Start │
│ 2 │ 2020-08-21 10:23:05 │ End │
│ 3 │ 2020-08-21 10:23:00 │ Start │
│ 3 │ 2020-08-21 10:24:00 │ End │
│ 3 │ 2020-08-21 11:24:00 │ End │
│ 4 │ 2020-08-21 10:30:00 │ Start │
└──────────┴─────────────────────┴───────┘
And example query:
SELECT
event_id,
minOrNullIf(timestamp, state = 'Start') AS start,
minOrNullIf(timestamp, state = 'End') AS end,
end - start AS duration,
duration < 10 AS is_fast,
duration > 300 AS is_slow
FROM event_updates
GROUP BY event_id
ORDER BY event_id ASC
┌─event_id─┬───────────────start─┬─────────────────end─┬─duration─┬─is_fast─┬─is_slow─┐
│ 1 │ 2020-08-21 09:58:00 │ 2020-08-21 10:18:00 │ 1200 │ 0 │ 1 │
│ 2 │ 2020-08-21 10:23:00 │ 2020-08-21 10:23:05 │ 5 │ 1 │ 0 │
│ 3 │ 2020-08-21 10:23:00 │ 2020-08-21 10:24:00 │ 60 │ 0 │ 0 │
│ 4 │ 2020-08-21 10:30:00 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
└──────────┴─────────────────────┴─────────────────────┴──────────┴─────────┴─────────┘
What I would like to produce is a pre-calculated table like:
CREATE TABLE event_stats (
event_id Int32,
start Nullable(DateTime),
end Nullable(DateTime),
duration Nullable(Int32),
is_fast Nullable(UInt8),
is_slow Nullable(UInt8)
);
But I can't work out how to create this table with a materialized view or find a better way.
At first, I would
use MergeTree-engine instead of Log to get benefits from sorting key
CREATE TABLE event_updates (
event_id Int32,
timestamp DateTime,
state String
) ENGINE MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, state);
constraint the origin dataset by applying WHERE-clause to timestamp and state (in your query processed the whole dataset)
SELECT count(*)
FROM (
SELECT event_id,
minOrNullIf(timestamp, state = 'Start') as start,
minOrNullIf(timestamp, state = 'End') as end,
end - start as duration,
duration < 10 as is_fast,
duration > 300 as is_slow
FROM event_updates
WHERE timestamp >= '2020-08-20 00:00:00' AND timestamp < '2020-09-20 00:00:00'
AND state IN ('Start', 'End')
GROUP BY event_id
HAVING start >= '2020-08-20 00:00:00' AND start < '2020-09-20 00:00:00'
AND is_slow);
If these ones don't help need to consider use AggregatingMergeTree to manipulate the precalculated aggregates not raw data.

How to sum arrays element by element after group by in clickhouse

I am trying to add an array column element by element after a group by another column.
Having the table A below:
id units
1 [1,1,1]
2 [3,0,0]
1 [5,3,7]
3 [2,5,2]
2 [3,2,6]
I would like to query something like:
select id, sum(units) from A group by id
And get the following result:
id units
1 [6,4,8]
2 [6,2,6]
3 [2,5,2]
Where the units arrays in rows with the same id get added element by element.
Try this query:
SELECT id, sumForEach(units) units
FROM (
/* emulate dataset */
SELECT data.1 id, data.2 units
FROM (
SELECT arrayJoin([(1, [1,1,1]), (2, [3,0,0]), (1, [5,3,7]), (3, [2,5,2]), (2, [3,2,6])]) data))
GROUP BY id
/* Result
┌─id─┬─units───┐
│ 1 │ [6,4,8] │
│ 2 │ [6,2,6] │
│ 3 │ [2,5,2] │
└────┴─────────┘
*/

Resources