Alternative to window function in ClickHouse - clickhouse

I have a table in ClickHouse that stores date, tenant_id and some values. I want to check the status of tenant and the status is considered as active if tenant_id has values > 0 for three consecutive days
Table:
date |tenant_id|value
2021-12-28|1681 |2
2021-12-29|1681 |2
2021-12-30|1681 |0
2021-12-31|1681 |2
create table test( date Date, tenant_id UInt64, value Int64) Engine=Memory;
insert into test values
('2021-12-28',1681,2),('2021-12-29',1681,2),('2021-12-30',1681,0),('2021-12-31',1681,2),
('2021-12-28',1682,2),('2021-12-29',1682,2),('2021-12-30',1682,2),('2021-12-31',1682,2);
Expected result:
tenant_id|status
1681 |inactive
Is it possible to achieve it in ClickHouse without window function as it is restricted in my case?

select tenant_id,
if (
arrayExists(k -> length(k)>=3, -- return 1 if exists array with length >= 3
arraySplit( j -> j.2 <=0, -- split array if value <= 0
arraySort( i -> i.1, -- sort array by date
groupArray((date, value)) -- gather all rows of tenant into array
)
)
), 'active', 'inactive') status
from test
group by tenant_id
┌─tenant_id─┬─status───┐
│ 1682 │ active │
│ 1681 │ inactive │
└───────────┴──────────┘

Related

Data skipping index for Map or pair-wise arrays in Clickhouse?

I am migrating a table from Postgres to Clickhouse, and one of the columns is a jsonb column which includes custom attributes. These attributes can be different per tenant, hence we currently have 100k different custom attributes' keys stored in postgres.
I checked Clickhouse's semi-structured JSON data options, and it seems we can use either Map(String, String) or 2 Array(String) columns holding the keys and values.
However I cannot make a proper assessment which one is best, as I get pretty similar results.
To test performance I created the following table:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String)
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 8192;
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(200000000);
--- data look like these:
SELECT *
FROM maptest
LIMIT 1
Query id: 9afcb888-94d9-42ec-a4b3-1d73b8cadde0
┌─k─┬─keys────────┬─values─┬─map─────────────┐
│ 0 │ ['custom0'] │ ['0'] │ {'custom0':'0'} │
└───┴─────────────┴────────┴─────────────────┘
However, whichever method I choose to query for a specific key-value pair, I always get the whole table scanned. e.g.
SELECT count()
FROM maptest
WHERE length(arrayFilter((v, k) -> ((k = 'custom2') AND (v = '2')), values, keys)) > 0
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 10.541 sec. Processed 200.00 million rows, 9.95 GB (18.97 million rows/s., 943.85 MB/s.)
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 11.142 sec. Processed 200.00 million rows, 8.35 GB (17.95 million rows/s., 749.32 MB/s.)
SELECT count()
FROM maptest
WHERE (values[indexOf(keys, 'custom2')]) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 3.458 sec. Processed 200.00 million rows, 9.95 GB (57.83 million rows/s., 2.88 GB/s.)
Any suggestions on data skipping indexes for any of the 2 options?
You can add data skipping index for a map field, although you will need to set lower index_granularity to get to the most optimal values between index size and how many granules will be skipped. You should build your index using mapValues (or mapKeys, depending on your needs) map function:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String),
INDEX b mapValues(map) TYPE tokenbf_v1(2048, 16, 42) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 2048; -- < lowered index_granularity!
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(20000000);
Now let's test it:
set send_logs_level='trace';
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2';
(...)
[LAPTOP-ASLS2SOJ] 2023.02.01 11:44:52.171103 [ 96 ] {3638972e-baf3-4b48-bf10-7b944e46fc64} <Debug> default.maptest (11baab32-a7a8-4b0f-b879-ad1541cbe282) (SelectExecutor): Index `b` has dropped 9123/9767 granules.
(...)
┌─count()─┐
│ 230 │
└─────────┘
(...)
1 row in set. Elapsed: 0.107 sec. Processed 1.32 million rows, 54.52 MB (12.30 million rows/s., 508.62 MB/s.)

Would Clickhouse merge cause increase in selected marks?

If clickhouse is performing a background merge operation (lets say 10 parts into 1 part), would that cause the selected marks to go up? Or are selected marks only governed by read operations performed due to SELECT queries
It should not in general but it may because of partition pruning.
create table test( D date, K Int64, S String )
Engine=MergeTree partition by toYYYYMM(D) order by K;
system stop merges test;
insert into test select '2022-01-01', number, '' from numbers(1000000);
insert into test select '2022-01-31', number, '' from numbers(1000000);
select name, min_date, max_date, rows from system.parts where table = 'test' and active;
┌─name─────────┬───min_date─┬───max_date─┬────rows─┐
│ 202201_1_1_0 │ 2022-01-01 │ 2022-01-01 │ 1000000 │ two parts in a partition and min_date
│ 202201_2_2_0 │ 2022-01-31 │ 2022-01-31 │ 1000000 │ min_date & max_date are not intersecting
└──────────────┴────────────┴────────────┴─────────┘
explain estimate select count() from test where D between '2022-01-01' and '2022-01-15';
┌─database─┬─table─┬─parts─┬────rows─┬─marks─┐
│ dw │ test │ 1 │ 1000000 │ 123 │ -- 123 mark.
└──────────┴───────┴───────┴─────────┴───────┘
system start merges test;
optimize table test final;
select name, min_date, max_date, rows from system.parts where table = 'test' and active;
┌─name─────────┬───min_date─┬───max_date─┬────rows─┐
│ 202201_1_2_1 │ 2022-01-01 │ 2022-01-31 │ 2000000 │ one part covers the whole month
└──────────────┴────────────┴────────────┴─────────┘
explain estimate select count() from test where D between '2022-01-01' and '2022-01-15';
┌─database─┬─table─┬─parts─┬────rows─┬─marks─┐
│ dw │ test │ 1 │ 2000000 │ 245 │ -- 245 mark.
└──────────┴───────┴───────┴─────────┴───────┘
In real life you will never notice this because it's very synthetic case, no filters on primary key index, and partition column is not in primary key index.
And it does not mean that merges make query slower, it means that Clickhouse is able to leverage the fact that data is not merged yet and reads only a part of the data in a partition.

Clickhouse. Get value from json

I use Clickhouse database. There is a table with string column (data). All rows contains data like:
'[{"a":23, "b":1}]'
'[{"a":7, "b":15}]'
I wanna get all values of key "b".
1
15
Next query:
Select JSONExtractInt('data', 0, 'b') from table
return 0 all time. How i can get values of key "b"?
SELECT tupleElement(JSONExtract(j, 'Array(Tuple(a Int64, b Int64))'), 'b')[1] AS res
FROM
(
SELECT '[{"a":23, "b":1}]' AS j
UNION ALL
SELECT '[{"a":7, "b":15}]'
)
┌─res─┐
│ 1 │
└─────┘
┌─res─┐
│ 15 │
└─────┘

Clickhouse convert table with json content each row into table

I have tried to transform json in rows into a table with this json fields. After looking at Clickhouse documentation I cound't find some clickhouse FUNCTION that can handle this task
Here is the table with the
col_a
{"casa":2,"value":4}
{"casa":6,"value":47}
The proposal is to transform using only Clickhouse SQL (CREATE WITH SELECT) int this table
casa
value
2
4
6
47
SELECT
'{"casa":2,"value":4}' AS j,
JSONExtractKeysAndValuesRaw(j) AS t
┌─j────────────────────┬─t────────────────────────────┐
│ {"casa":2,"value":4} │ [('casa','2'),('value','4')] │
└──────────────────────┴──────────────────────────────┘
SELECT
'{"casa":2,"value":4}' AS j,
JSONExtract(j, 'Tuple(casa Int64, value Int64)') AS t,
tupleElement(t, 'casa') AS casa,
tupleElement(t, 'value') AS value
┌─j────────────────────┬─t─────┬─casa─┬─value─┐
│ {"casa":2,"value":4} │ (2,4) │ 2 │ 4 │
└──────────────────────┴───────┴──────┴───────┘

clickhouse: How do I find the least date in array that is above date in another column?

Basically I have the table with the following data-structure:
id_level1: Int32
id_level2: Int32
event_date: Date
arr_object_ids: Array of Int32 - sorted by next column
arr_object_dates: Array of Date - sorted ascending
What I need is to have the least object_date that is above event_date for each pair of (id_leve1, id_level2). How is that possible in Clickhouse?
Then I would use arrayElement(arr_object_ids, indexOf(arr_object_dates, solution) to get corresponding object_id
Try this query:
SELECT
id_level1,
id_level2,
/*arrayFirst(x -> x > event_date, arr_object_dates) least_date,*/
arrayFirstIndex(x -> x > event_date, arr_object_dates) least_date_index,
least_date_index = 0 ? -1 : arrayElement(arr_object_ids, least_date_index) object_id /* -1 if result not found */
FROM (
/* emulate original table */
SELECT 1 id_level1, 2 id_level2, '2020-01-03' event_date,
[4, 5, 6,7] arr_object_ids,
['2020-01-01', '2020-01-03', '2020-01-06', '2020-01-11'] arr_object_dates
UNION ALL
SELECT 3 id_level1, 4 id_level2, '2020-05-03' event_date,
[4, 5, 6,7] arr_object_ids,
['2020-01-01', '2020-01-03', '2020-01-06', '2020-01-11'] arr_object_dates)
ORDER BY event_date
/* result
┌─id_level1─┬─id_level2─┬─least_date_index─┬─object_id─┐
│ 1 │ 2 │ 3 │ 6 │
│ 3 │ 4 │ 0 │ -1 │
└───────────┴───────────┴──────────────────┴───────────┘
*/

Resources