Clickhouse bloom filter index seems too slow - clickhouse

I had executed the following query but it has processed ~1B rows and took total time of 75 seconds for a simple count.
SELECT count(*)
FROM events_distributed
WHERE (orgId = '174a4727-1116-4c5c-8234-ab76f2406c4a') AND (timestamp >= '2022-12-05 00:00:00.000000000')
Query id: e4312ff5-6add-4757-8deb-d68e0f3e29d9
┌──count()─┐
│ 13071204 │
└──────────┘
1 row in set. Elapsed: 74.951 sec. Processed 979.00 million rows, 8.26 GB (13.06 million rows/s., 110.16 MB/s.)
I am wondering how I can speed this up? My events table has the following partition by and order by columns and a bloom filter index on orgid
PARTITION BY toDate(timestamp)
ORDER BY (timestamp);
INDEX idx_orgid orgid TYPE bloom_filter(0.01) GRANULARITY 1,
Below is the execution plan
EXPLAIN indexes = 1
SELECT count(*)
FROM events_distributed
WHERE (orgid = '174a4727-1116-4c5c-8234-ab76f240fc4a') AND (timestamp >= '2022-12-05 00:00:00.000000000') AND (timestamp <= '2022-12-06 00:00:00.000000000')
Query id: 879c2ce5-c4c7-4efc-b0e2-25613848afad
┌─explain────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ MergingAggregated │
│ Union │
│ Aggregating │
│ Expression (Before GROUP BY) │
│ Filter (WHERE) │
│ ReadFromMergeTree (users.events) │
│ Indexes: │
│ MinMax │
│ Keys: │
│ timestamp │
│ Condition: and((timestamp in (-Inf, '1670284800']), (timestamp in ['1670198400', +Inf))) │
│ Parts: 12/342 │
│ Granules: 42122/407615 │
│ Partition │
│ Keys: │
│ toDate(timestamp) │
│ Condition: and((toDate(timestamp) in (-Inf, 19332]), (toDate(timestamp) in [19331, +Inf))) │
│ Parts: 12/12 │
│ Granules: 42122/42122 │
│ PrimaryKey │
│ Keys: │
│ timestamp │
│ Condition: and((timestamp in (-Inf, '1670284800']), (timestamp in ['1670198400', +Inf))) │
│ Parts: 12/12 │
│ Granules: 30696/42122 │
│ Skip │
│ Name: idx_orgid │
│ Description: bloom_filter GRANULARITY 1 │
│ Parts: 8/12 │
│ Granules: 20556/30696 │
│ ReadFromRemote (Read from remote replica) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
32 rows in set. Elapsed: 0.129 sec.
How can I speed up this query? because processing 1B rows to give a count of 13M sounds like something is total off. Does creating a SET index on orgid any better? because I will have a max of 10K orgs
The queries I typicall run are
SELECT org_level, min(timestamp) as minTimeStamp,max(timestamp) as maxTimeStamp, toStartOfInterval(toDateTime(timestamp), INTERVAL <step> second) as roundedDownTs, count(*) as cnt, orgid
FROM events_distributed
WHERE orgid = 'foo' and timestamp BETWEEN <one week>
GROUP BY roundedDownTs, orgid, org_level
ORDER BY roundedDownTs DESC;
please note <step> here would be any of the following values 0, 60, 240, 1440, 10080
and another query for a one week time slice but it can be any time slice and always want the results in descending order because of timeseries
SELECT org_text
FROM events_distributed
WHERE (orgid = '174a4727-1116-4c5c-8234-ab76f2406c4a') AND (timestamp >= '2022-12-01 00:00:00.000000000' and timestamp <= '2022-12-07 00:00:00.000000000') order by timestamp DESC LIMIT 51;

You don't use primary index
I suggest is to use
PARTITION BY toDate(timestamp)
ORDER BY (orgId, timestamp)
https://kb.altinity.com/engines/mergetree-table-engine-family/pick-keys/
And remove bloom_filter index.

Related

Would Clickhouse merge cause increase in selected marks?

If clickhouse is performing a background merge operation (lets say 10 parts into 1 part), would that cause the selected marks to go up? Or are selected marks only governed by read operations performed due to SELECT queries
It should not in general but it may because of partition pruning.
create table test( D date, K Int64, S String )
Engine=MergeTree partition by toYYYYMM(D) order by K;
system stop merges test;
insert into test select '2022-01-01', number, '' from numbers(1000000);
insert into test select '2022-01-31', number, '' from numbers(1000000);
select name, min_date, max_date, rows from system.parts where table = 'test' and active;
┌─name─────────┬───min_date─┬───max_date─┬────rows─┐
│ 202201_1_1_0 │ 2022-01-01 │ 2022-01-01 │ 1000000 │ two parts in a partition and min_date
│ 202201_2_2_0 │ 2022-01-31 │ 2022-01-31 │ 1000000 │ min_date & max_date are not intersecting
└──────────────┴────────────┴────────────┴─────────┘
explain estimate select count() from test where D between '2022-01-01' and '2022-01-15';
┌─database─┬─table─┬─parts─┬────rows─┬─marks─┐
│ dw │ test │ 1 │ 1000000 │ 123 │ -- 123 mark.
└──────────┴───────┴───────┴─────────┴───────┘
system start merges test;
optimize table test final;
select name, min_date, max_date, rows from system.parts where table = 'test' and active;
┌─name─────────┬───min_date─┬───max_date─┬────rows─┐
│ 202201_1_2_1 │ 2022-01-01 │ 2022-01-31 │ 2000000 │ one part covers the whole month
└──────────────┴────────────┴────────────┴─────────┘
explain estimate select count() from test where D between '2022-01-01' and '2022-01-15';
┌─database─┬─table─┬─parts─┬────rows─┬─marks─┐
│ dw │ test │ 1 │ 2000000 │ 245 │ -- 245 mark.
└──────────┴───────┴───────┴─────────┴───────┘
In real life you will never notice this because it's very synthetic case, no filters on primary key index, and partition column is not in primary key index.
And it does not mean that merges make query slower, it means that Clickhouse is able to leverage the fact that data is not merged yet and reads only a part of the data in a partition.

Way to achieve overlapping GROUP BY groups with correct subtotals in Clickhouse

Assuming following schema:
CREATE TABLE test
(
date Date,
user_id UInt32,
user_answer UInt8,
user_multi_choice_answer Array(UInt8),
events UInt32
)
ENGINE = MergeTree() ORDER BY date;
And contents:
INSERT INTO test VALUES
('2020-01-01', 1, 5, [2, 3], 15),
('2020-01-01', 2, 6, [1, 2], 7);
Let's say I want to make a query "give me # of users and # of their events grouped by date and user_answer, with subtotals". That's easy:
select date, user_answer, count(distinct user_id), sum(events) from test group by date, user_answer with rollup;
┌───────date─┬─user_answer─┬─uniqExact(user_id)─┬─sum(events)─┐
│ 2020-01-01 │ 5 │ 1 │ 15 │
│ 2020-01-01 │ 6 │ 1 │ 7 │
│ 2020-01-01 │ 0 │ 2 │ 22 │
│ 0000-00-00 │ 0 │ 2 │ 22 │
└────────────┴─────────────┴────────────────────┴─────────────┘
What I can't easily do is making queries with overlapping groups, like when grouping by invidivual options of multiple choice question. For example:
# of users and # of their events grouped by date and user_multi_choice_answer, with subtotals
# of users and # of their events grouped by arbitrary hand-written grouping conditions, like "compare users with user_answer=5 and has(user_multi_choice_answer, 1) to users with has(user_multi_choice_answer, 2)"
For example, with the first query, I would like to see the following:
┌───────date─┬─user_multi_choice_answer─┬─uniqExact(user_id)─┬─sum(events)─┐
│ 2020-01-01 │ 1 │ 1 │ 15 │
│ 2020-01-01 │ 2 │ 2 │ 22 │
│ 2020-01-01 │ 3 │ 1 │ 7 │
│ 2020-01-01 │ 0 │ 2 │ 22 │
│ 0000-00-00 │ 0 │ 2 │ 22 │
└────────────┴──────────────────────────┴────────────────────┴─────────────┘
And for the second:
┌─my_grouping_id─┬─uniqExact(user_id)─┬─sum(events)─┐
│ 1 │ 1 │ 15 │ # users fulfilling arbitrary condition #1
│ 2 │ 2 │ 22 │ # users fulfilling arbitrary condition #2
│ 0 │ 2 │ 22 │ # subtotal
└────────────────┴────────────────────┴─────────────┘
The closest I can get to that is by using arrayJoin():
select date, arrayJoin(user_multi_choice_answer) as multi_answer, count(distinct user_id), sum(events)
from test group by date, multi_answer with rollup;
select arrayJoin(
arrayConcat(
if(user_answer=5 and has(user_multi_choice_answer, 3), [1], []),
if(has(user_multi_choice_answer, 2), [2], [])
)
) as my_grouping_id, count(distinct user_id), sum(events)
from test group by my_grouping_id with rollup;
But that's not a good solution for two reasons:
While it calculates correct results for grouping, the result for sum(events) is not correct for subtotals (as duplicated rows count multiple times)
It doesn't seem efficient, as it makes a lot of data duplication (while I just want the same row to get aggregated into several groups)
So, again, I'm looking for a way that would allow me to easily make grouping of answers to multiple choice questions and gropings by arbitrary conditions on some columns. I'm okay with changing the schema to make that possible, but I'm mostly hoping Clickhouse has a built-in way to achieve that.
While it calculates correct results for grouping, the result for sum(events) is not correct for subtotals (as duplicated rows count multiple times)
You can manually create my_grouping_id = 0 without using rollup. For example,
select arrayJoin(
arrayConcat(
[0],
if(user_answer=5 and has(user_multi_choice_answer, 3), [1], []),
if(has(user_multi_choice_answer, 2), [2], [])
)
) as my_grouping_id, count(distinct user_id), sum(events)
from test group by my_grouping_id
It doesn't seem efficient, as it makes a lot of data duplication (while I just want the same row to get aggregated into several groups)
Currently it's not possible. But I see possibilities. I'll try to make a POC of GROUP BY ARRAY. It seems to be a valid use case.

Use csv-data in clickhouse table

I have a problem when I want to use data from a csv-file in a table I have created. The database I created is called "test" and the table is created as following:
CREATE TABLE testing
(
`year` Int16,
`amount` Int16,
`rate` Float32,
`number` Int16
)
ENGINE = Log
Ok.
0 rows in set. Elapsed: 0.033 sec.
I created all these columns to be able to cover all the data in the csv-file. I've read through the clickhouse documentation but just can't figure out how to get the data into my database.
I tested to do this:
$ cat test.csv | clickhouse-client \ >-- database =test\ >--query='INSERT test FORMAT CSV'
Code: 62. DB::Exception: Syntax error: failed at position 1 (line 1, col 1): 2010,646,1.00,13
2010,2486,1.00,19
2010,8178,1.00,10
2010,15707,1.00,4
2010,15708,1.00,10
2010,15718,1.00,4
2010,16951,1.00,8
2010,17615,1.00,13
2010. Unrecognized token
Link: https://yadi.sk/d/ijJlmnBjsjBVc
cat test.csv |clickhouse-client -d test -q 'INSERT into testing FORMAT CSV'
SELECT *
FROM test.testing
┌─year─┬─amount─┬─rate─┬─number─┐
│ 2010 │ 646 │ 1 │ 13 │
│ 2010 │ 2486 │ 1 │ 19 │
│ 2010 │ 8178 │ 1 │ 10 │
│ 2010 │ 15707 │ 1 │ 4 │
│ 2010 │ 15708 │ 1 │ 10 │
│ 2010 │ 15718 │ 1 │ 4 │
│ 2010 │ 16951 │ 1 │ 8 │
│ 2010 │ 17615 │ 1 │ 13 │
│ 2010 │ 17616 │ 1 │ 4 │
│ 2010 │ 17617 │ 1 │ 8 │
│ 2010 │ 17618 │ 1 │ 9 │

memory used and execution order in sub-query

I'm playing with data in csv format from https://dev.maxmind.com/geoip/geoip2/geolite2/.
Generally, it's data that map from ip block to asn and country.
I have 2 table both are Memory engine, first has 299727 records, second has 406685.
SELECT *
FROM __ip_block_to_country
LIMIT 5
┌─network────┬───────id─┬───min_ip─┬───max_ip─┬─geoname_id─┬─country_iso_code─┬─country_name─┐
│ 1.0.0.0/24 │ 16777216 │ 16777217 │ 16777472 │ 2077456 │ AU │ Australia │
│ 1.0.1.0/24 │ 16777472 │ 16777473 │ 16777728 │ 1814991 │ CN │ China │
│ 1.0.2.0/23 │ 16777728 │ 16777729 │ 16778240 │ 1814991 │ CN │ China │
│ 1.0.4.0/22 │ 16778240 │ 16778241 │ 16779264 │ 2077456 │ AU │ Australia │
│ 1.0.8.0/21 │ 16779264 │ 16779265 │ 16781312 │ 1814991 │ CN │ China │
└────────────┴──────────┴──────────┴──────────┴────────────┴──────────────────┴──────────────┘
SELECT *
FROM __ip_block_to_asn
LIMIT 5
┌─network──────┬─autonomous_system_number─┬─autonomous_system_organization─┬───────id─┬─subnet_count─┬───min_ip─┬───max_ip─┐
│ 1.0.0.0/24 │ 13335 │ Cloudflare Inc │ 16777216 │ 255 │ 16777217 │ 16777472 │
│ 1.0.4.0/22 │ 56203 │ Gtelecom-AUSTRALIA │ 16778240 │ 1023 │ 16778241 │ 16779264 │
│ 1.0.16.0/24 │ 2519 │ ARTERIA Networks Corporation │ 16781312 │ 255 │ 16781313 │ 16781568 │
│ 1.0.64.0/18 │ 18144 │ Energia Communications,Inc. │ 16793600 │ 16383 │ 16793601 │ 16809984 │
│ 1.0.128.0/17 │ 23969 │ TOT Public Company Limited │ 16809984 │ 32767 │ 16809985 │ 16842752 │
└──────────────┴──────────────────────────┴────────────────────────────────┴──────────┴──────────────┴──────────┴──────────┘
Now, i want to exam which country that covers entire ip pool of one asn. The below query is just to obtain index of statisfied country.
SELECT idx from(
SELECT
(
SELECT groupArray(min_ip),groupArray(max_ip),groupArray(country_iso_code),groupArray(country_name)
FROM __ip_block_to_country
) t,
arrayFilter((i,mii, mai) -> min_ip >= mii and max_ip <= mai, arrayEnumerate(t.1), t.1, t.2) as idx
FROM __ip_block_to_asn
);
I got following exception:
Received exception from server (version 1.1.54394):
Code: 241. DB::Exception: Received from localhost:9000, ::1. DB::Exception: Memory limit (for query) exceeded: would use 512.02 GiB (attempt to allocate chunk of 549755813888 bytes), maximum: 37.25 GiB.
My question is:
It seems like the statement SELECT groupArray(min_ip),groupArray(max_ip),groupArray(country_iso_code),groupArray(country_name) is executed along with every record of __ip_block_to_asn, then query needs so much memory. Is that true to my query ?
Scalar subquery is executed only once.
But to execute arrayFilter, arrays are multiplied by number of rows of processed blocks from __ip_block_to_asn table. It is something like cross join of two tables.
To overcome this, you can use smaller block size for SELECT from __ip_block_to_asn.
It is controlled by max_block_size setting. But for Memory tables, blocks always have the same size as when they was inserted into a table, regardless to max_block_size setting during SELECT. To allow flexible block size, you can reload this table to TinyLog engine.
CREATE TABLE __ip_block_to_asn2 ENGINE = TinyLog AS SELECT * FROM __ip_block_to_asn
Then execute:
SET max_block_size = 10;
SELECT idx from(
SELECT
(
SELECT groupArray(min_ip),groupArray(max_ip),groupArray(country_iso_code),groupArray(country_name)
FROM __ip_block_to_country
) t,
arrayFilter((i,mii, mai) -> min_ip >= mii and max_ip <= mai, arrayEnumerate(t.1), t.1, t.2) as idx
FROM __ip_block_to_asn2
);

How to filter rows from Julia Array based on value of value in specified column?

I have data like this in a text file:
CLASS col2 col3 ...
1 ... ... ...
1 ... ... ...
2 ... ... ...
2 ... ... ...
2 ... ... ...
I load them using the following code:
data = readdlm("file.txt")[2:end, :] # without header line
And now I would like to get array with rows only from class 1.
(Data could be loaded using some other function if it would help.)
Logical indexing is the straight-forward way to do filtering on an array:
data[data[:,1] .== 1, :]
If, though, you read your file in as a data frame, you'll have more options available to you, and it'll keep track of your headers:
julia> using DataFrames
julia> df = readtable("file.txt", separator=' ')
5×4 DataFrames.DataFrame
│ Row │ CLASS │ col2 │ col3 │ _ │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ "..." │ "..." │ "..." │
│ 2 │ 1 │ "..." │ "..." │ "..." │
│ 3 │ 2 │ "..." │ "..." │ "..." │
│ 4 │ 2 │ "..." │ "..." │ "..." │
│ 5 │ 2 │ "..." │ "..." │ "..." │
julia> df[df[:CLASS] .== 1, :] # Refer to the column by its header name
2×4 DataFrames.DataFrame
│ Row │ CLASS │ col2 │ col3 │ _ │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ "..." │ "..." │ "..." │
│ 2 │ 1 │ "..." │ "..." │ "..." │
There are even more tools available with the DataFramesMeta package that aim to make this simpler (and other packages actively under development). You can use its #where macro to do SQL-style filtering:
julia> using DataFramesMeta
julia> #where(df, :CLASS .== 1)
2×4 DataFrames.DataFrame
│ Row │ CLASS │ col2 │ col3 │ _ │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ "..." │ "..." │ "..." │
│ 2 │ 1 │ "..." │ "..." │ "..." │
data[find(x -> a[x,1] == 1, 1:size(data)[1]),:]

Resources