Exclude rows based on condition from two columns - clickhouse

My question is very similar to this one, except that I want to exclude all columns that have a unique value in a column.
If we assume that to be the input.
Name | Location
-------------------
Bob | Shasta
Bob | Leaves
Sean | Leaves
Sean | Leaves
Dylan | Shasta
Dylan | Redwood
Dylan | Leaves
I want the output to be
Name | Location
-------------------
Bob | Shasta
Bob | Leaves
Dylan | Shasta
Dylan | Redwood
Dylan | Leaves
In this case, Sean is being excluded because he always has the same location.
In SQL, there exists a subquery called whereexists. How do we do this in clickhouse?

Try this query:
SELECT Name, Location
FROM (
/* emulate the origin dataset */
SELECT test_data.1 AS Name, test_data.2 AS Location
FROM (
SELECT arrayJoin([
('Bob', 'Shasta'),
('Bob', 'Leaves'),
('Sean', 'Leaves'),
('Sean', 'Leaves'),
('Dylan', 'Shasta'),
('Dylan', 'Redwood'),
('Dylan', 'Leaves')]) AS test_data))
WHERE Name IN (
SELECT Name
FROM (
/* emulate the origin dataset */
SELECT test_data.1 AS Name, test_data.2 AS Location
FROM (
SELECT arrayJoin([
('Bob', 'Shasta'),
('Bob', 'Leaves'),
('Sean', 'Leaves'),
('Sean', 'Leaves'),
('Dylan', 'Shasta'),
('Dylan', 'Redwood'),
('Dylan', 'Leaves')]) AS test_data))
GROUP BY Name
HAVING uniq(Location) > 1)
/* result
┌─Name──┬─Location─┐
│ Bob │ Shasta │
│ Bob │ Leaves │
│ Dylan │ Shasta │
│ Dylan │ Redwood │
│ Dylan │ Leaves │
└───────┴──────────┘
*/

Related

add inserted on column for clickhouse

with other databases, you would ad an inserted_on column to a create table with query with inserted_on DateTime DEFAULT now().
However it seems that clickhouse evaluates that column at every query; I am getting the current time every time.
table structure
create table main.contracts(
bk_number UInt64,
bk_timestamp DateTime,
bk_hash String,
address String,
inserted_on DateTime DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(bk_timestamp)
ORDER BY (bk_number, bk_timestamp);
I re-created your scenario and it works for me with no problems.
Table Structure:
create table contracts(
bk_number UInt64,
bk_timestamp DateTime,
inserted_on DateTime DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(bk_timestamp)
ORDER BY (bk_number, bk_timestamp);
Insert Queries:
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (123,'2023-02-14 01:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (456,'2023-02-14 02:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (789,'2023-02-14 03:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (101,'2023-02-14 04:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (102,'2023-02-14 05:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (103,'2023-02-14 06:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (104,'2023-02-14 07:00:00');
The Result:
SELECT NOW();
┌───────────────now()─┐
│ 2023-02-15 04:17:51 │
└─────────────────────┘
SELECT * FROM contracts;
┌─bk_number─┬────────bk_timestamp─┬─────────inserted_on─┐
│ 101 │ 2023-02-14 04:00:00 │ 2023-02-15 04:08:39 │
│ 102 │ 2023-02-14 05:00:00 │ 2023-02-15 04:08:39 │
│ 103 │ 2023-02-14 06:00:00 │ 2023-02-15 04:08:39 │
│ 104 │ 2023-02-14 07:00:00 │ 2023-02-15 04:09:30 │
│ 123 │ 2023-02-14 01:00:00 │ 2023-02-15 04:07:33 │
│ 456 │ 2023-02-14 02:00:00 │ 2023-02-15 04:08:39 │
│ 789 │ 2023-02-14 03:00:00 │ 2023-02-15 04:08:39 │
└───────────┴─────────────────────┴─────────────────────┘
SELECT * FROM contracts WHERE bk_number = 123;
┌─bk_number─┬────────bk_timestamp─┬─────────inserted_on─┐
│ 123 │ 2023-02-14 01:00:00 │ 2023-02-15 04:07:33 │
└───────────┴─────────────────────┴─────────────────────┘
Suggestion:
If the table is not too big I would recommend running OPTIMIZE TABLE once.
Check what OPTIMIZE does over here -
Clickhouse Docs/Optimize

how to use array join in Clickhouse

I'm trying to split 2 arrays using arrayJoin()
my table:
create table test_array(
col1 Array(INT),
col2 Array(INT),
col3 String
)
engine = TinyLog;
then i insert these values:
insert into test_array values ([1,2],[11,22],'Text');
insert into test_array values ([5,6],[55,66],'Text');
when i split the first array in col1 the result will be like this :
but what i need is to split col1 and col2 and add them in the select.
I tried this query but it didn't work.
select arrayJoin(col1) ,arrayJoin(col2) ,col1 , col2 ,col3 from test_array;
how can i edit the query to remove the highlighted rows in the picture?
Thanks.
The serial calls of arrayJoin produces the cartesian product, to avoid it use ARRAY JOIN:
SELECT
c1,
c2,
col1,
col2,
col3
FROM test_array
ARRAY JOIN
col1 AS c1,
col2 AS c2
/*
┌─c1─┬─c2─┬─col1──┬─col2────┬─col3─┐
│ 1 │ 11 │ [1,2] │ [11,22] │ Text │
│ 2 │ 22 │ [1,2] │ [11,22] │ Text │
│ 5 │ 55 │ [5,6] │ [55,66] │ Text │
│ 6 │ 66 │ [5,6] │ [55,66] │ Text │
└────┴────┴───────┴─────────┴──────┘
*/
one more way -- tuple()
SELECT
untuple(arrayJoin(arrayZip(col1, col2))),
col3
FROM test_array
┌─_ut_1─┬─_ut_2─┬─col3─┐
│ 1 │ 11 │ Text │
│ 2 │ 22 │ Text │
│ 5 │ 55 │ Text │
│ 6 │ 66 │ Text │
└───────┴───────┴──────┘

Q: How to configure ClickHouse to return NULL instead of 0?

Let's say I have a table created as such without any record:
create table metric (date Int32) Engine=MergeTree ORDER BY (date);
If I run this query
select max(date) from metric;
ClickHouse returns
+-----------+
| max(date) |
+-----------+
| 0 |
+-----------+
1 row in set (0.02 sec)
instead of
+-----------+
| max(date) |
+-----------+
| NULL |
+-----------+
1 row in set (0.02 sec)
Is possible to configure ClickHouse to return NULL without have to write query like this:
select max(toNullable(date)) from metric;
Use setting aggregate_functions_null_for_empty:
SELECT max(date)
FROM metric
SETTINGS aggregate_functions_null_for_empty = 1
/*
┌─maxOrNull(date)─┐
│ ᴺᵁᴸᴸ │
└─────────────────┘
*/
or consider using OrNull-combinator:
SELECT maxOrNull(date)
FROM metric
/*
┌─maxOrNull(date)─┐
│ ᴺᵁᴸᴸ │
└─────────────────┘
*/

Group by date with sparkline like data in the one query

I have the time-series data from the similar hosts that stored in ClickHouse table in the next structure:
event_type | event_day
------------|---------------------
type_1 | 2017-11-09 20:11:28
type_1 | 2017-11-09 20:11:25
type_2 | 2017-11-09 20:11:23
type_2 | 2017-11-09 20:11:21
Each row in the table means the presence of a value 1 for event_type on the datetime. To quickly assess the situation I need to indicate the sum (total) + the last seven values (pulse), like this:
event_type | day | total | pulse
------------|------------|-------|-----------------------------
type_1 | 2017-11-09 | 876 | 12,9,23,67,5,34,10
type_2 | 2017-11-09 | 11865 | 267,120,234,425,102,230,150
I tried to get it with one request in the following way, but it failed - the pulse consists of the same values:
with
arrayMap(x -> today() - 7 + x, range(7)) as week_range,
arrayMap(x -> count(event_type), week_range) as pulse
select
event_type,
toDate(event_date) as day,
count() as total,
pulse
from database.table
group by day, event_type
event_type | day | total | pulse
------------|------------|-------|-------------------------------------------
type_1 | 2017-11-09 | 876 | 876,876,876,876,876,876,876
type_2 | 2017-11-09 | 11865 | 11865,11865,11865,11865,11865,11865,11865
Please point out where is my mistake and how to get desired?
select event_type, groupArray(1)(day)[1], arraySum(pulse) total7, groupArray(7)(cnt) pulse
from (
select
event_type,
toDate(event_date) as day,
count() as cnt
from database.table
where day >= today()-30
group by event_type,day
order by event_type,day desc
)
group by event_type
order by event_type
I would consider calculating pulse on the server-side, CH just provides the required data.
Can be used neighbor-window function:
SELECT
number,
[neighbor(number, -7), neighbor(number, -6), neighbor(number, -5), neighbor(number, -4), neighbor(number, -3), neighbor(number, -2), neighbor(number, -1)] AS pulse
FROM
(
SELECT number
FROM numbers(10, 15)
ORDER BY number ASC
)
┌─number─┬─pulse──────────────────┐
│ 10 │ [0,0,0,0,0,0,0] │
│ 11 │ [0,0,0,0,0,0,10] │
│ 12 │ [0,0,0,0,0,10,11] │
│ 13 │ [0,0,0,0,10,11,12] │
│ 14 │ [0,0,0,10,11,12,13] │
│ 15 │ [0,0,10,11,12,13,14] │
│ 16 │ [0,10,11,12,13,14,15] │
│ 17 │ [10,11,12,13,14,15,16] │
│ 18 │ [11,12,13,14,15,16,17] │
│ 19 │ [12,13,14,15,16,17,18] │
│ 20 │ [13,14,15,16,17,18,19] │
│ 21 │ [14,15,16,17,18,19,20] │
│ 22 │ [15,16,17,18,19,20,21] │
│ 23 │ [16,17,18,19,20,21,22] │
│ 24 │ [17,18,19,20,21,22,23] │
└────────┴────────────────────────┘

Pivot In clickhouse

I want to do a pivot in clickhouse
I have data in the form of
rule_name | result
'string_1', 'result_1'
'string_2', 'result_2'
'string_3', 'result_3'
'string_4', 'result_4'
I want to pivot it to this such that string_1, string_2 ... are columns
and the resultant should have 4 columns and one row(result_1, result_2, result_3, result_4)
string_1 | string_2 | string_3 | string_4
result_1 | result_2 | result_3 | result_4
┌─string_1────┬─string_2─────┬─string_3─────┬─string_4─────┐
│ result_1 result_2 result_3 result_4
└─────────────┴──────────────┴──────────────┴──────────────┘
How do I achieve this ?
select anyIf(result, rule_name = 'string_1') string_1,
anyIf(result, rule_name = 'string_2') string_2,
anyIf(result, rule_name = 'string_3') string_3,
anyIf(result, rule_name = 'string_4') string_4
from (
select 'string_1' rule_name, 'result_1' result
union all select 'string_2', 'result_2'
union all select 'string_3', 'result_3'
union all select 'string_4', 'result_4')
┌─string_1─┬─string_2─┬─string_3─┬─string_4─┐
│ result_1 │ result_2 │ result_3 │ result_4 │
└──────────┴──────────┴──────────┴──────────┘

Resources