Pivot In clickhouse - clickhouse

I want to do a pivot in clickhouse
I have data in the form of
rule_name | result
'string_1', 'result_1'
'string_2', 'result_2'
'string_3', 'result_3'
'string_4', 'result_4'
I want to pivot it to this such that string_1, string_2 ... are columns
and the resultant should have 4 columns and one row(result_1, result_2, result_3, result_4)
string_1 | string_2 | string_3 | string_4
result_1 | result_2 | result_3 | result_4
┌─string_1────┬─string_2─────┬─string_3─────┬─string_4─────┐
│ result_1 result_2 result_3 result_4
└─────────────┴──────────────┴──────────────┴──────────────┘
How do I achieve this ?

select anyIf(result, rule_name = 'string_1') string_1,
anyIf(result, rule_name = 'string_2') string_2,
anyIf(result, rule_name = 'string_3') string_3,
anyIf(result, rule_name = 'string_4') string_4
from (
select 'string_1' rule_name, 'result_1' result
union all select 'string_2', 'result_2'
union all select 'string_3', 'result_3'
union all select 'string_4', 'result_4')
┌─string_1─┬─string_2─┬─string_3─┬─string_4─┐
│ result_1 │ result_2 │ result_3 │ result_4 │
└──────────┴──────────┴──────────┴──────────┘

Related

clickhouse sum arrays at same index [duplicate]

I am trying to add an array column element by element after a group by another column.
Having the table A below:
id units
1 [1,1,1]
2 [3,0,0]
1 [5,3,7]
3 [2,5,2]
2 [3,2,6]
I would like to query something like:
select id, sum(units) from A group by id
And get the following result:
id units
1 [6,4,8]
2 [6,2,6]
3 [2,5,2]
Where the units arrays in rows with the same id get added element by element.
Try this query:
SELECT id, sumForEach(units) units
FROM (
/* emulate dataset */
SELECT data.1 id, data.2 units
FROM (
SELECT arrayJoin([(1, [1,1,1]), (2, [3,0,0]), (1, [5,3,7]), (3, [2,5,2]), (2, [3,2,6])]) data))
GROUP BY id
/* Result
┌─id─┬─units───┐
│ 1 │ [6,4,8] │
│ 2 │ [6,2,6] │
│ 3 │ [2,5,2] │
└────┴─────────┘
*/

Q: How to configure ClickHouse to return NULL instead of 0?

Let's say I have a table created as such without any record:
create table metric (date Int32) Engine=MergeTree ORDER BY (date);
If I run this query
select max(date) from metric;
ClickHouse returns
+-----------+
| max(date) |
+-----------+
| 0 |
+-----------+
1 row in set (0.02 sec)
instead of
+-----------+
| max(date) |
+-----------+
| NULL |
+-----------+
1 row in set (0.02 sec)
Is possible to configure ClickHouse to return NULL without have to write query like this:
select max(toNullable(date)) from metric;
Use setting aggregate_functions_null_for_empty:
SELECT max(date)
FROM metric
SETTINGS aggregate_functions_null_for_empty = 1
/*
┌─maxOrNull(date)─┐
│ ᴺᵁᴸᴸ │
└─────────────────┘
*/
or consider using OrNull-combinator:
SELECT maxOrNull(date)
FROM metric
/*
┌─maxOrNull(date)─┐
│ ᴺᵁᴸᴸ │
└─────────────────┘
*/

Time series query based on another table

Initial data
CREATE TABLE a_table (
id UInt8,
created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY tuple()
ORDER BY id;
CREATE TABLE b_table (
id UInt8,
started_at DateTime,
stopped_at DateTime
)
ENGINE = MergeTree()
PARTITION BY tuple()
ORDER BY id;
INSERT INTO a_table (id, created_at) VALUES
(1, '2020-01-01 00:00:00'),
(2, '2020-01-02 00:00:00'),
(3, '2020-01-03 00:00:00')
;
INSERT INTO b_table (id, started_at, stopped_at) VALUES
(1, '2020-01-01 00:00:00', '2020-01-01 23:59:59'),
(2, '2020-01-02 00:00:00', '2020-01-02 23:59:59'),
(3, '2020-01-04 00:00:00', '2020-01-04 23:59:59')
;
Expected result: The 'a_table' rows by condition
b_table.started_at >= a_table.created_at AND
b_table.stopped_at <= a_table.created_at
+----+---------------------+
| id | created_at |
+----+---------------------+
| 1 | 2020-01-01 00:00:00 |
+----+---------------------+
| 2 | 2020-01-02 00:00:00 |
+----+---------------------+
What have i tried
-- No errors, empty result
SELECT a_table.*
FROM a_table
INNER JOIN b_table
ON b_table.id = a_table.id
WHERE b_table.started_at >= a_table.created_at
ANd b_table.stopped_at <= a_table.created_at
;
SELECT a_table.*
FROM a_table
ASOF INNER JOIN (
SELECT * FROM b_table
) q
ON q.id = a_table.id
AND q.started_at >= a_table.created_at
-- Error:
-- Invalid expression for JOIN ON.
-- ASOF JOIN expects exactly one inequality in ON section,
-- unexpected stopped_at <= created_at.
-- AND q.stopped_at <= a_table.created_at
;
WHERE b_table.started_at >= a_table.created_at
ANd b_table.stopped_at <= a_table.created_at
Wrong condition >= <= --> <= >=
20.8.7.15
SELECT
a_table.*,
b_table.*
FROM a_table
INNER JOIN b_table ON b_table.id = a_table.id
WHERE (b_table.started_at <= a_table.created_at) AND (b_table.stopped_at >= a_table.created_at)
┌─id─┬──────────created_at─┬─b_table.id─┬──────────started_at─┬──────────stopped_at─┐
│ 1 │ 2020-01-01 00:00:00 │ 1 │ 2020-01-01 00:00:00 │ 2020-01-01 23:59:59 │
│ 2 │ 2020-01-02 00:00:00 │ 2 │ 2020-01-02 00:00:00 │ 2020-01-02 23:59:59 │
└────┴─────────────────────┴────────────┴─────────────────────┴─────────────────────┘
In real production such queries would not work. Because JOIN is very slow.
It needs re-design. It hard to say how without knowing why do you have the second table. Probably I would use rangeHashed external dictionary.

Group by date with sparkline like data in the one query

I have the time-series data from the similar hosts that stored in ClickHouse table in the next structure:
event_type | event_day
------------|---------------------
type_1 | 2017-11-09 20:11:28
type_1 | 2017-11-09 20:11:25
type_2 | 2017-11-09 20:11:23
type_2 | 2017-11-09 20:11:21
Each row in the table means the presence of a value 1 for event_type on the datetime. To quickly assess the situation I need to indicate the sum (total) + the last seven values (pulse), like this:
event_type | day | total | pulse
------------|------------|-------|-----------------------------
type_1 | 2017-11-09 | 876 | 12,9,23,67,5,34,10
type_2 | 2017-11-09 | 11865 | 267,120,234,425,102,230,150
I tried to get it with one request in the following way, but it failed - the pulse consists of the same values:
with
arrayMap(x -> today() - 7 + x, range(7)) as week_range,
arrayMap(x -> count(event_type), week_range) as pulse
select
event_type,
toDate(event_date) as day,
count() as total,
pulse
from database.table
group by day, event_type
event_type | day | total | pulse
------------|------------|-------|-------------------------------------------
type_1 | 2017-11-09 | 876 | 876,876,876,876,876,876,876
type_2 | 2017-11-09 | 11865 | 11865,11865,11865,11865,11865,11865,11865
Please point out where is my mistake and how to get desired?
select event_type, groupArray(1)(day)[1], arraySum(pulse) total7, groupArray(7)(cnt) pulse
from (
select
event_type,
toDate(event_date) as day,
count() as cnt
from database.table
where day >= today()-30
group by event_type,day
order by event_type,day desc
)
group by event_type
order by event_type
I would consider calculating pulse on the server-side, CH just provides the required data.
Can be used neighbor-window function:
SELECT
number,
[neighbor(number, -7), neighbor(number, -6), neighbor(number, -5), neighbor(number, -4), neighbor(number, -3), neighbor(number, -2), neighbor(number, -1)] AS pulse
FROM
(
SELECT number
FROM numbers(10, 15)
ORDER BY number ASC
)
┌─number─┬─pulse──────────────────┐
│ 10 │ [0,0,0,0,0,0,0] │
│ 11 │ [0,0,0,0,0,0,10] │
│ 12 │ [0,0,0,0,0,10,11] │
│ 13 │ [0,0,0,0,10,11,12] │
│ 14 │ [0,0,0,10,11,12,13] │
│ 15 │ [0,0,10,11,12,13,14] │
│ 16 │ [0,10,11,12,13,14,15] │
│ 17 │ [10,11,12,13,14,15,16] │
│ 18 │ [11,12,13,14,15,16,17] │
│ 19 │ [12,13,14,15,16,17,18] │
│ 20 │ [13,14,15,16,17,18,19] │
│ 21 │ [14,15,16,17,18,19,20] │
│ 22 │ [15,16,17,18,19,20,21] │
│ 23 │ [16,17,18,19,20,21,22] │
│ 24 │ [17,18,19,20,21,22,23] │
└────────┴────────────────────────┘

Exclude rows based on condition from two columns

My question is very similar to this one, except that I want to exclude all columns that have a unique value in a column.
If we assume that to be the input.
Name | Location
-------------------
Bob | Shasta
Bob | Leaves
Sean | Leaves
Sean | Leaves
Dylan | Shasta
Dylan | Redwood
Dylan | Leaves
I want the output to be
Name | Location
-------------------
Bob | Shasta
Bob | Leaves
Dylan | Shasta
Dylan | Redwood
Dylan | Leaves
In this case, Sean is being excluded because he always has the same location.
In SQL, there exists a subquery called whereexists. How do we do this in clickhouse?
Try this query:
SELECT Name, Location
FROM (
/* emulate the origin dataset */
SELECT test_data.1 AS Name, test_data.2 AS Location
FROM (
SELECT arrayJoin([
('Bob', 'Shasta'),
('Bob', 'Leaves'),
('Sean', 'Leaves'),
('Sean', 'Leaves'),
('Dylan', 'Shasta'),
('Dylan', 'Redwood'),
('Dylan', 'Leaves')]) AS test_data))
WHERE Name IN (
SELECT Name
FROM (
/* emulate the origin dataset */
SELECT test_data.1 AS Name, test_data.2 AS Location
FROM (
SELECT arrayJoin([
('Bob', 'Shasta'),
('Bob', 'Leaves'),
('Sean', 'Leaves'),
('Sean', 'Leaves'),
('Dylan', 'Shasta'),
('Dylan', 'Redwood'),
('Dylan', 'Leaves')]) AS test_data))
GROUP BY Name
HAVING uniq(Location) > 1)
/* result
┌─Name──┬─Location─┐
│ Bob │ Shasta │
│ Bob │ Leaves │
│ Dylan │ Shasta │
│ Dylan │ Redwood │
│ Dylan │ Leaves │
└───────┴──────────┘
*/

Resources