I have the time-series data from the similar hosts that stored in ClickHouse table in the next structure:
event_type | event_day
------------|---------------------
type_1 | 2017-11-09 20:11:28
type_1 | 2017-11-09 20:11:25
type_2 | 2017-11-09 20:11:23
type_2 | 2017-11-09 20:11:21
Each row in the table means the presence of a value 1 for event_type on the datetime. To quickly assess the situation I need to indicate the sum (total) + the last seven values (pulse), like this:
event_type | day | total | pulse
------------|------------|-------|-----------------------------
type_1 | 2017-11-09 | 876 | 12,9,23,67,5,34,10
type_2 | 2017-11-09 | 11865 | 267,120,234,425,102,230,150
I tried to get it with one request in the following way, but it failed - the pulse consists of the same values:
with
arrayMap(x -> today() - 7 + x, range(7)) as week_range,
arrayMap(x -> count(event_type), week_range) as pulse
select
event_type,
toDate(event_date) as day,
count() as total,
pulse
from database.table
group by day, event_type
event_type | day | total | pulse
------------|------------|-------|-------------------------------------------
type_1 | 2017-11-09 | 876 | 876,876,876,876,876,876,876
type_2 | 2017-11-09 | 11865 | 11865,11865,11865,11865,11865,11865,11865
Please point out where is my mistake and how to get desired?
select event_type, groupArray(1)(day)[1], arraySum(pulse) total7, groupArray(7)(cnt) pulse
from (
select
event_type,
toDate(event_date) as day,
count() as cnt
from database.table
where day >= today()-30
group by event_type,day
order by event_type,day desc
)
group by event_type
order by event_type
I would consider calculating pulse on the server-side, CH just provides the required data.
Can be used neighbor-window function:
SELECT
number,
[neighbor(number, -7), neighbor(number, -6), neighbor(number, -5), neighbor(number, -4), neighbor(number, -3), neighbor(number, -2), neighbor(number, -1)] AS pulse
FROM
(
SELECT number
FROM numbers(10, 15)
ORDER BY number ASC
)
┌─number─┬─pulse──────────────────┐
│ 10 │ [0,0,0,0,0,0,0] │
│ 11 │ [0,0,0,0,0,0,10] │
│ 12 │ [0,0,0,0,0,10,11] │
│ 13 │ [0,0,0,0,10,11,12] │
│ 14 │ [0,0,0,10,11,12,13] │
│ 15 │ [0,0,10,11,12,13,14] │
│ 16 │ [0,10,11,12,13,14,15] │
│ 17 │ [10,11,12,13,14,15,16] │
│ 18 │ [11,12,13,14,15,16,17] │
│ 19 │ [12,13,14,15,16,17,18] │
│ 20 │ [13,14,15,16,17,18,19] │
│ 21 │ [14,15,16,17,18,19,20] │
│ 22 │ [15,16,17,18,19,20,21] │
│ 23 │ [16,17,18,19,20,21,22] │
│ 24 │ [17,18,19,20,21,22,23] │
└────────┴────────────────────────┘
Related
with other databases, you would ad an inserted_on column to a create table with query with inserted_on DateTime DEFAULT now().
However it seems that clickhouse evaluates that column at every query; I am getting the current time every time.
table structure
create table main.contracts(
bk_number UInt64,
bk_timestamp DateTime,
bk_hash String,
address String,
inserted_on DateTime DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(bk_timestamp)
ORDER BY (bk_number, bk_timestamp);
I re-created your scenario and it works for me with no problems.
Table Structure:
create table contracts(
bk_number UInt64,
bk_timestamp DateTime,
inserted_on DateTime DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(bk_timestamp)
ORDER BY (bk_number, bk_timestamp);
Insert Queries:
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (123,'2023-02-14 01:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (456,'2023-02-14 02:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (789,'2023-02-14 03:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (101,'2023-02-14 04:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (102,'2023-02-14 05:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (103,'2023-02-14 06:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (104,'2023-02-14 07:00:00');
The Result:
SELECT NOW();
┌───────────────now()─┐
│ 2023-02-15 04:17:51 │
└─────────────────────┘
SELECT * FROM contracts;
┌─bk_number─┬────────bk_timestamp─┬─────────inserted_on─┐
│ 101 │ 2023-02-14 04:00:00 │ 2023-02-15 04:08:39 │
│ 102 │ 2023-02-14 05:00:00 │ 2023-02-15 04:08:39 │
│ 103 │ 2023-02-14 06:00:00 │ 2023-02-15 04:08:39 │
│ 104 │ 2023-02-14 07:00:00 │ 2023-02-15 04:09:30 │
│ 123 │ 2023-02-14 01:00:00 │ 2023-02-15 04:07:33 │
│ 456 │ 2023-02-14 02:00:00 │ 2023-02-15 04:08:39 │
│ 789 │ 2023-02-14 03:00:00 │ 2023-02-15 04:08:39 │
└───────────┴─────────────────────┴─────────────────────┘
SELECT * FROM contracts WHERE bk_number = 123;
┌─bk_number─┬────────bk_timestamp─┬─────────inserted_on─┐
│ 123 │ 2023-02-14 01:00:00 │ 2023-02-15 04:07:33 │
└───────────┴─────────────────────┴─────────────────────┘
Suggestion:
If the table is not too big I would recommend running OPTIMIZE TABLE once.
Check what OPTIMIZE does over here -
Clickhouse Docs/Optimize
In clickhouse version 22.1 it is possible to inference schema.
e.g.: DESC file('nonexist', 'Protobuf') SETTINGS format_schema='file.proto:LogEntry'
But is it possible to create table with columns obtained from DESCRIBE?
yes
cat /var/lib/clickhouse/user_files/aa.csv
a, b, 4
create table t1 Engine=Log as select * from file('aa.csv');
DESCRIBE TABLE t1
┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
│ c2 │ Nullable(String) │ │ │ │ │ │
│ c3 │ Nullable(Float64) │ │ │ │ │ │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
create table t1 Engine=Log as select * from file('aa.csv') where 0;
DESCRIBE TABLE t1
┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
│ c2 │ Nullable(String) │ │ │ │ │ │
│ c3 │ Nullable(Float64) │ │ │ │ │ │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
I'm trying to split 2 arrays using arrayJoin()
my table:
create table test_array(
col1 Array(INT),
col2 Array(INT),
col3 String
)
engine = TinyLog;
then i insert these values:
insert into test_array values ([1,2],[11,22],'Text');
insert into test_array values ([5,6],[55,66],'Text');
when i split the first array in col1 the result will be like this :
but what i need is to split col1 and col2 and add them in the select.
I tried this query but it didn't work.
select arrayJoin(col1) ,arrayJoin(col2) ,col1 , col2 ,col3 from test_array;
how can i edit the query to remove the highlighted rows in the picture?
Thanks.
The serial calls of arrayJoin produces the cartesian product, to avoid it use ARRAY JOIN:
SELECT
c1,
c2,
col1,
col2,
col3
FROM test_array
ARRAY JOIN
col1 AS c1,
col2 AS c2
/*
┌─c1─┬─c2─┬─col1──┬─col2────┬─col3─┐
│ 1 │ 11 │ [1,2] │ [11,22] │ Text │
│ 2 │ 22 │ [1,2] │ [11,22] │ Text │
│ 5 │ 55 │ [5,6] │ [55,66] │ Text │
│ 6 │ 66 │ [5,6] │ [55,66] │ Text │
└────┴────┴───────┴─────────┴──────┘
*/
one more way -- tuple()
SELECT
untuple(arrayJoin(arrayZip(col1, col2))),
col3
FROM test_array
┌─_ut_1─┬─_ut_2─┬─col3─┐
│ 1 │ 11 │ Text │
│ 2 │ 22 │ Text │
│ 5 │ 55 │ Text │
│ 6 │ 66 │ Text │
└───────┴───────┴──────┘
I have a table like below, where State is a limited set of updates (e.g. Start, End):
CREATE TABLE event_updates (
event_id Int32,
timestamp DateTime,
state String
) ENGINE Log;
And I want to be able quickly run queries like:
SELECT count(*)
FROM (
SELECT event_id,
minOrNullIf(timestamp, state = 'Start') as start,
minOrNullIf(timestamp, state = 'End') as end,
end - start as duration,
duration < 10 as is_fast,
duration > 300 as is_slow
FROM event_updates
GROUP BY event_id)
WHERE start >= '2020-08-20 00:00:00'
AND start < '2020-08-20 00:00:00'
AND is_slow;
But those queries are slow when there is a lot of data, I'm guessing because the calculations are required for every row.
Example data:
┌─event_id─┬───────────timestamp─┬─state─┐
│ 1 │ 2020-08-21 09:58:00 │ Start │
│ 1 │ 2020-08-21 10:18:00 │ End │
│ 2 │ 2020-08-21 10:23:00 │ Start │
│ 2 │ 2020-08-21 10:23:05 │ End │
│ 3 │ 2020-08-21 10:23:00 │ Start │
│ 3 │ 2020-08-21 10:24:00 │ End │
│ 3 │ 2020-08-21 11:24:00 │ End │
│ 4 │ 2020-08-21 10:30:00 │ Start │
└──────────┴─────────────────────┴───────┘
And example query:
SELECT
event_id,
minOrNullIf(timestamp, state = 'Start') AS start,
minOrNullIf(timestamp, state = 'End') AS end,
end - start AS duration,
duration < 10 AS is_fast,
duration > 300 AS is_slow
FROM event_updates
GROUP BY event_id
ORDER BY event_id ASC
┌─event_id─┬───────────────start─┬─────────────────end─┬─duration─┬─is_fast─┬─is_slow─┐
│ 1 │ 2020-08-21 09:58:00 │ 2020-08-21 10:18:00 │ 1200 │ 0 │ 1 │
│ 2 │ 2020-08-21 10:23:00 │ 2020-08-21 10:23:05 │ 5 │ 1 │ 0 │
│ 3 │ 2020-08-21 10:23:00 │ 2020-08-21 10:24:00 │ 60 │ 0 │ 0 │
│ 4 │ 2020-08-21 10:30:00 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
└──────────┴─────────────────────┴─────────────────────┴──────────┴─────────┴─────────┘
What I would like to produce is a pre-calculated table like:
CREATE TABLE event_stats (
event_id Int32,
start Nullable(DateTime),
end Nullable(DateTime),
duration Nullable(Int32),
is_fast Nullable(UInt8),
is_slow Nullable(UInt8)
);
But I can't work out how to create this table with a materialized view or find a better way.
At first, I would
use MergeTree-engine instead of Log to get benefits from sorting key
CREATE TABLE event_updates (
event_id Int32,
timestamp DateTime,
state String
) ENGINE MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, state);
constraint the origin dataset by applying WHERE-clause to timestamp and state (in your query processed the whole dataset)
SELECT count(*)
FROM (
SELECT event_id,
minOrNullIf(timestamp, state = 'Start') as start,
minOrNullIf(timestamp, state = 'End') as end,
end - start as duration,
duration < 10 as is_fast,
duration > 300 as is_slow
FROM event_updates
WHERE timestamp >= '2020-08-20 00:00:00' AND timestamp < '2020-09-20 00:00:00'
AND state IN ('Start', 'End')
GROUP BY event_id
HAVING start >= '2020-08-20 00:00:00' AND start < '2020-09-20 00:00:00'
AND is_slow);
If these ones don't help need to consider use AggregatingMergeTree to manipulate the precalculated aggregates not raw data.
My question is very similar to this one, except that I want to exclude all columns that have a unique value in a column.
If we assume that to be the input.
Name | Location
-------------------
Bob | Shasta
Bob | Leaves
Sean | Leaves
Sean | Leaves
Dylan | Shasta
Dylan | Redwood
Dylan | Leaves
I want the output to be
Name | Location
-------------------
Bob | Shasta
Bob | Leaves
Dylan | Shasta
Dylan | Redwood
Dylan | Leaves
In this case, Sean is being excluded because he always has the same location.
In SQL, there exists a subquery called whereexists. How do we do this in clickhouse?
Try this query:
SELECT Name, Location
FROM (
/* emulate the origin dataset */
SELECT test_data.1 AS Name, test_data.2 AS Location
FROM (
SELECT arrayJoin([
('Bob', 'Shasta'),
('Bob', 'Leaves'),
('Sean', 'Leaves'),
('Sean', 'Leaves'),
('Dylan', 'Shasta'),
('Dylan', 'Redwood'),
('Dylan', 'Leaves')]) AS test_data))
WHERE Name IN (
SELECT Name
FROM (
/* emulate the origin dataset */
SELECT test_data.1 AS Name, test_data.2 AS Location
FROM (
SELECT arrayJoin([
('Bob', 'Shasta'),
('Bob', 'Leaves'),
('Sean', 'Leaves'),
('Sean', 'Leaves'),
('Dylan', 'Shasta'),
('Dylan', 'Redwood'),
('Dylan', 'Leaves')]) AS test_data))
GROUP BY Name
HAVING uniq(Location) > 1)
/* result
┌─Name──┬─Location─┐
│ Bob │ Shasta │
│ Bob │ Leaves │
│ Dylan │ Shasta │
│ Dylan │ Redwood │
│ Dylan │ Leaves │
└───────┴──────────┘
*/