add inserted on column for clickhouse - clickhouse

with other databases, you would ad an inserted_on column to a create table with query with inserted_on DateTime DEFAULT now().
However it seems that clickhouse evaluates that column at every query; I am getting the current time every time.
table structure
create table main.contracts(
bk_number UInt64,
bk_timestamp DateTime,
bk_hash String,
address String,
inserted_on DateTime DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(bk_timestamp)
ORDER BY (bk_number, bk_timestamp);

I re-created your scenario and it works for me with no problems.
Table Structure:
create table contracts(
bk_number UInt64,
bk_timestamp DateTime,
inserted_on DateTime DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(bk_timestamp)
ORDER BY (bk_number, bk_timestamp);
Insert Queries:
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (123,'2023-02-14 01:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (456,'2023-02-14 02:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (789,'2023-02-14 03:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (101,'2023-02-14 04:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (102,'2023-02-14 05:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (103,'2023-02-14 06:00:00');
INSERT INTO contracts (bk_number,bk_timestamp) VALUES (104,'2023-02-14 07:00:00');
The Result:
SELECT NOW();
┌───────────────now()─┐
│ 2023-02-15 04:17:51 │
└─────────────────────┘
SELECT * FROM contracts;
┌─bk_number─┬────────bk_timestamp─┬─────────inserted_on─┐
│ 101 │ 2023-02-14 04:00:00 │ 2023-02-15 04:08:39 │
│ 102 │ 2023-02-14 05:00:00 │ 2023-02-15 04:08:39 │
│ 103 │ 2023-02-14 06:00:00 │ 2023-02-15 04:08:39 │
│ 104 │ 2023-02-14 07:00:00 │ 2023-02-15 04:09:30 │
│ 123 │ 2023-02-14 01:00:00 │ 2023-02-15 04:07:33 │
│ 456 │ 2023-02-14 02:00:00 │ 2023-02-15 04:08:39 │
│ 789 │ 2023-02-14 03:00:00 │ 2023-02-15 04:08:39 │
└───────────┴─────────────────────┴─────────────────────┘
SELECT * FROM contracts WHERE bk_number = 123;
┌─bk_number─┬────────bk_timestamp─┬─────────inserted_on─┐
│ 123 │ 2023-02-14 01:00:00 │ 2023-02-15 04:07:33 │
└───────────┴─────────────────────┴─────────────────────┘
Suggestion:
If the table is not too big I would recommend running OPTIMIZE TABLE once.
Check what OPTIMIZE does over here -
Clickhouse Docs/Optimize

Related

Clickhouse in_order loadbalancing does not seem to work as expected

I have a single shard clickhouse setup with 2 replicas. I tried in_order and first_or_randome loadbalancing. But when I check the query_log table in system schema, it seem to show the queries are actually going into both replicas on a distributed table and not the first replica (assuming there are no clickhouse errors at the moment)
I have the following table:
test_table_buffer (buffer table to which data is written)
test_table (Replicated Table)
test_table_distributed (Distributed table).
I don't need a distributed table theoretically because I have only one shard. I created only for the sake of testing.
Query Log on first replica:
┌──────────event_time─┬─query─────────────────────────────────────────────────────┐
│ 2022-09-03 08:28:26 │ SELECT count(1) FROM test_table_distributed WHERE __created_at >= '2022-09-03 08:28:20' │
│ 2022-09-03 08:28:25 │ SELECT count(1) FROM test_table_distributed WHERE __created_at >= '2022-09-03 08:28:20' │
│ 2022-09-03 08:28:20 │ INSERT INTO test_table_distributed VALUES │
│ 2022-09-03 08:28:20 │ INSERT INTO test_table_distributed VALUES │
On replica 2
┌──────────event_time─┬─query───────────────────────────────────────────────────────────┐
│ 2022-09-03 08:28:33 │ INSERT INTO test_table_distributed VALUES │
│ 2022-09-03 08:28:32 │ SELECT count(1) FROM test_table_distributed WHERE __created_at >=
2022-09-03 08:28:26 │
│ 2022-09-03 08:28:32 │ INSERT INTO test_table_distributed VALUES │
│ 2022-09-03 08:28:31 │ SELECT count(1) FROM test_table_distributed WHERE __created_at >=
2022-09-03 08:28:26 │
│ 2022-09-03 08:28:26 │ INSERT INTO test_table_distributed VALUES │
│ 2022-09-03 08:28:26 │ INSERT INTO test_table_distributed VALUES │
│ 2022-09-03 08:28:13 │ SELECT count(1) FROM test_table_distributed WHERE __created_at >= '2022-09-03 08:28:07' │
│ 2022-09-03 08:28:13 │ SELECT count(1) FROM test_table_distributed WHERE __created_at >= '2022-09-03 08:28:07' │
Seems like the select queries are going to both from above log and not just to first replica.
The use-case is that we need a consistent read immediately after write for few use-cases and not eventual consistency. So it's okay to write to first replica available and be able to read from the same. If the other replica eventually consistent with replication, it's okay.
There is a setting prefer_localhost_replica
You need to disable it.
cat /etc/clickhouse-server/users.d/prefer_localhost_replica.xml
<?xml version="1.0" ?>
<yandex>
<profiles>
<default>
<prefer_localhost_replica>0</prefer_localhost_replica>
</default>
</profiles>
</yandex>

create table with schema inferenced from data

In clickhouse version 22.1 it is possible to inference schema.
e.g.: DESC file('nonexist', 'Protobuf') SETTINGS format_schema='file.proto:LogEntry'
But is it possible to create table with columns obtained from DESCRIBE?
yes
cat /var/lib/clickhouse/user_files/aa.csv
a, b, 4
create table t1 Engine=Log as select * from file('aa.csv');
DESCRIBE TABLE t1
┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
│ c2 │ Nullable(String) │ │ │ │ │ │
│ c3 │ Nullable(Float64) │ │ │ │ │ │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
create table t1 Engine=Log as select * from file('aa.csv') where 0;
DESCRIBE TABLE t1
┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
│ c2 │ Nullable(String) │ │ │ │ │ │
│ c3 │ Nullable(Float64) │ │ │ │ │ │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

how to use array join in Clickhouse

I'm trying to split 2 arrays using arrayJoin()
my table:
create table test_array(
col1 Array(INT),
col2 Array(INT),
col3 String
)
engine = TinyLog;
then i insert these values:
insert into test_array values ([1,2],[11,22],'Text');
insert into test_array values ([5,6],[55,66],'Text');
when i split the first array in col1 the result will be like this :
but what i need is to split col1 and col2 and add them in the select.
I tried this query but it didn't work.
select arrayJoin(col1) ,arrayJoin(col2) ,col1 , col2 ,col3 from test_array;
how can i edit the query to remove the highlighted rows in the picture?
Thanks.
The serial calls of arrayJoin produces the cartesian product, to avoid it use ARRAY JOIN:
SELECT
c1,
c2,
col1,
col2,
col3
FROM test_array
ARRAY JOIN
col1 AS c1,
col2 AS c2
/*
┌─c1─┬─c2─┬─col1──┬─col2────┬─col3─┐
│ 1 │ 11 │ [1,2] │ [11,22] │ Text │
│ 2 │ 22 │ [1,2] │ [11,22] │ Text │
│ 5 │ 55 │ [5,6] │ [55,66] │ Text │
│ 6 │ 66 │ [5,6] │ [55,66] │ Text │
└────┴────┴───────┴─────────┴──────┘
*/
one more way -- tuple()
SELECT
untuple(arrayJoin(arrayZip(col1, col2))),
col3
FROM test_array
┌─_ut_1─┬─_ut_2─┬─col3─┐
│ 1 │ 11 │ Text │
│ 2 │ 22 │ Text │
│ 5 │ 55 │ Text │
│ 6 │ 66 │ Text │
└───────┴───────┴──────┘

Group by date with sparkline like data in the one query

I have the time-series data from the similar hosts that stored in ClickHouse table in the next structure:
event_type | event_day
------------|---------------------
type_1 | 2017-11-09 20:11:28
type_1 | 2017-11-09 20:11:25
type_2 | 2017-11-09 20:11:23
type_2 | 2017-11-09 20:11:21
Each row in the table means the presence of a value 1 for event_type on the datetime. To quickly assess the situation I need to indicate the sum (total) + the last seven values (pulse), like this:
event_type | day | total | pulse
------------|------------|-------|-----------------------------
type_1 | 2017-11-09 | 876 | 12,9,23,67,5,34,10
type_2 | 2017-11-09 | 11865 | 267,120,234,425,102,230,150
I tried to get it with one request in the following way, but it failed - the pulse consists of the same values:
with
arrayMap(x -> today() - 7 + x, range(7)) as week_range,
arrayMap(x -> count(event_type), week_range) as pulse
select
event_type,
toDate(event_date) as day,
count() as total,
pulse
from database.table
group by day, event_type
event_type | day | total | pulse
------------|------------|-------|-------------------------------------------
type_1 | 2017-11-09 | 876 | 876,876,876,876,876,876,876
type_2 | 2017-11-09 | 11865 | 11865,11865,11865,11865,11865,11865,11865
Please point out where is my mistake and how to get desired?
select event_type, groupArray(1)(day)[1], arraySum(pulse) total7, groupArray(7)(cnt) pulse
from (
select
event_type,
toDate(event_date) as day,
count() as cnt
from database.table
where day >= today()-30
group by event_type,day
order by event_type,day desc
)
group by event_type
order by event_type
I would consider calculating pulse on the server-side, CH just provides the required data.
Can be used neighbor-window function:
SELECT
number,
[neighbor(number, -7), neighbor(number, -6), neighbor(number, -5), neighbor(number, -4), neighbor(number, -3), neighbor(number, -2), neighbor(number, -1)] AS pulse
FROM
(
SELECT number
FROM numbers(10, 15)
ORDER BY number ASC
)
┌─number─┬─pulse──────────────────┐
│ 10 │ [0,0,0,0,0,0,0] │
│ 11 │ [0,0,0,0,0,0,10] │
│ 12 │ [0,0,0,0,0,10,11] │
│ 13 │ [0,0,0,0,10,11,12] │
│ 14 │ [0,0,0,10,11,12,13] │
│ 15 │ [0,0,10,11,12,13,14] │
│ 16 │ [0,10,11,12,13,14,15] │
│ 17 │ [10,11,12,13,14,15,16] │
│ 18 │ [11,12,13,14,15,16,17] │
│ 19 │ [12,13,14,15,16,17,18] │
│ 20 │ [13,14,15,16,17,18,19] │
│ 21 │ [14,15,16,17,18,19,20] │
│ 22 │ [15,16,17,18,19,20,21] │
│ 23 │ [16,17,18,19,20,21,22] │
│ 24 │ [17,18,19,20,21,22,23] │
└────────┴────────────────────────┘

Materialized view for calculated results

I have a table like below, where State is a limited set of updates (e.g. Start, End):
CREATE TABLE event_updates (
event_id Int32,
timestamp DateTime,
state String
) ENGINE Log;
And I want to be able quickly run queries like:
SELECT count(*)
FROM (
SELECT event_id,
minOrNullIf(timestamp, state = 'Start') as start,
minOrNullIf(timestamp, state = 'End') as end,
end - start as duration,
duration < 10 as is_fast,
duration > 300 as is_slow
FROM event_updates
GROUP BY event_id)
WHERE start >= '2020-08-20 00:00:00'
AND start < '2020-08-20 00:00:00'
AND is_slow;
But those queries are slow when there is a lot of data, I'm guessing because the calculations are required for every row.
Example data:
┌─event_id─┬───────────timestamp─┬─state─┐
│ 1 │ 2020-08-21 09:58:00 │ Start │
│ 1 │ 2020-08-21 10:18:00 │ End │
│ 2 │ 2020-08-21 10:23:00 │ Start │
│ 2 │ 2020-08-21 10:23:05 │ End │
│ 3 │ 2020-08-21 10:23:00 │ Start │
│ 3 │ 2020-08-21 10:24:00 │ End │
│ 3 │ 2020-08-21 11:24:00 │ End │
│ 4 │ 2020-08-21 10:30:00 │ Start │
└──────────┴─────────────────────┴───────┘
And example query:
SELECT
event_id,
minOrNullIf(timestamp, state = 'Start') AS start,
minOrNullIf(timestamp, state = 'End') AS end,
end - start AS duration,
duration < 10 AS is_fast,
duration > 300 AS is_slow
FROM event_updates
GROUP BY event_id
ORDER BY event_id ASC
┌─event_id─┬───────────────start─┬─────────────────end─┬─duration─┬─is_fast─┬─is_slow─┐
│ 1 │ 2020-08-21 09:58:00 │ 2020-08-21 10:18:00 │ 1200 │ 0 │ 1 │
│ 2 │ 2020-08-21 10:23:00 │ 2020-08-21 10:23:05 │ 5 │ 1 │ 0 │
│ 3 │ 2020-08-21 10:23:00 │ 2020-08-21 10:24:00 │ 60 │ 0 │ 0 │
│ 4 │ 2020-08-21 10:30:00 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
└──────────┴─────────────────────┴─────────────────────┴──────────┴─────────┴─────────┘
What I would like to produce is a pre-calculated table like:
CREATE TABLE event_stats (
event_id Int32,
start Nullable(DateTime),
end Nullable(DateTime),
duration Nullable(Int32),
is_fast Nullable(UInt8),
is_slow Nullable(UInt8)
);
But I can't work out how to create this table with a materialized view or find a better way.
At first, I would
use MergeTree-engine instead of Log to get benefits from sorting key
CREATE TABLE event_updates (
event_id Int32,
timestamp DateTime,
state String
) ENGINE MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, state);
constraint the origin dataset by applying WHERE-clause to timestamp and state (in your query processed the whole dataset)
SELECT count(*)
FROM (
SELECT event_id,
minOrNullIf(timestamp, state = 'Start') as start,
minOrNullIf(timestamp, state = 'End') as end,
end - start as duration,
duration < 10 as is_fast,
duration > 300 as is_slow
FROM event_updates
WHERE timestamp >= '2020-08-20 00:00:00' AND timestamp < '2020-09-20 00:00:00'
AND state IN ('Start', 'End')
GROUP BY event_id
HAVING start >= '2020-08-20 00:00:00' AND start < '2020-09-20 00:00:00'
AND is_slow);
If these ones don't help need to consider use AggregatingMergeTree to manipulate the precalculated aggregates not raw data.

Resources