clickhouse: groupBitmap slow than uniqExact?

clickhouse: groupBitmap slow than uniqExact? - clickhouse

I compared the use of groupBitmap and uniqExact：uniqExact is twice as fast as groupBitmap:
ClickHouseqlwp0001.mrs-41io.com :) select groupBitmap(LO_ORDERKEY) from lineorder
SELECT groupBitmap(LO_ORDERKEY)
FROM lineorder
Query id: a48055ca-bbfa-4937-b551-07c2c321a77c
┌─groupBitmap(LO_ORDERKEY)─┐
│ 414044233 │
└──────────────────────────┘
1 rows in set. Elapsed: 49.457 sec. Processed 1.66 billion rows, 6.62 GB (33.49 million rows/s., 133.95 MB/s.)
ClickHouseqlwp0001.mrs-41io.com :) select uniqExact(LO_ORDERKEY) from lineorder
SELECT uniqExact(LO_ORDERKEY)
FROM lineorder
Query id: bef46061-59d6-4121-aba4-68a74660efac
┌─uniqExact(LO_ORDERKEY)─┐
│ 414044233 │
└────────────────────────┘
1 rows in set. Elapsed: 29.201 sec. Processed 1.66 billion rows, 6.62 GB (56.72 million rows/s., 226.87 MB/s.)

Related

Data skipping index for Map or pair-wise arrays in Clickhouse?

I am migrating a table from Postgres to Clickhouse, and one of the columns is a jsonb column which includes custom attributes. These attributes can be different per tenant, hence we currently have 100k different custom attributes' keys stored in postgres.
I checked Clickhouse's semi-structured JSON data options, and it seems we can use either Map(String, String) or 2 Array(String) columns holding the keys and values.
However I cannot make a proper assessment which one is best, as I get pretty similar results.
To test performance I created the following table:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String)
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 8192;
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(200000000);
--- data look like these:
SELECT *
FROM maptest
LIMIT 1
Query id: 9afcb888-94d9-42ec-a4b3-1d73b8cadde0
┌─k─┬─keys────────┬─values─┬─map─────────────┐
│ 0 │ ['custom0'] │ ['0'] │ {'custom0':'0'} │
└───┴─────────────┴────────┴─────────────────┘
However, whichever method I choose to query for a specific key-value pair, I always get the whole table scanned. e.g.
SELECT count()
FROM maptest
WHERE length(arrayFilter((v, k) -> ((k = 'custom2') AND (v = '2')), values, keys)) > 0
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 10.541 sec. Processed 200.00 million rows, 9.95 GB (18.97 million rows/s., 943.85 MB/s.)
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 11.142 sec. Processed 200.00 million rows, 8.35 GB (17.95 million rows/s., 749.32 MB/s.)
SELECT count()
FROM maptest
WHERE (values[indexOf(keys, 'custom2')]) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 3.458 sec. Processed 200.00 million rows, 9.95 GB (57.83 million rows/s., 2.88 GB/s.)
Any suggestions on data skipping indexes for any of the 2 options?

You can add data skipping index for a map field, although you will need to set lower index_granularity to get to the most optimal values between index size and how many granules will be skipped. You should build your index using mapValues (or mapKeys, depending on your needs) map function:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String),
INDEX b mapValues(map) TYPE tokenbf_v1(2048, 16, 42) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 2048; -- < lowered index_granularity!
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(20000000);
Now let's test it:
set send_logs_level='trace';
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2';
(...)
[LAPTOP-ASLS2SOJ] 2023.02.01 11:44:52.171103 [ 96 ] {3638972e-baf3-4b48-bf10-7b944e46fc64} <Debug> default.maptest (11baab32-a7a8-4b0f-b879-ad1541cbe282) (SelectExecutor): Index `b` has dropped 9123/9767 granules.
(...)
┌─count()─┐
│ 230 │
└─────────┘
(...)
1 row in set. Elapsed: 0.107 sec. Processed 1.32 million rows, 54.52 MB (12.30 million rows/s., 508.62 MB/s.)

Why is ClickHouse dictionary performance so low?

I have a table with products names in PostgreSql database. Total rows is ~30M. And I have history of prices in ClickHouse. I want to join names to prices.
DDL to create dictionary:
CREATE DICTIONARY products_dict
(
product_id String,
name String
)
PRIMARY KEY product_id
SOURCE(POSTGRESQL(
...
query 'SELECT product_id, name FROM products'
))
LAYOUT(COMPLEX_KEY_HASHED())
LIFETIME(3600);
Then I have dictionary:
database: wdm
name: products_dict
uuid: 1464ba09-990c-4e69-9464-ba09990c0e69
status: LOADED
origin: 1464ba09-990c-4e69-9464-ba09990c0e69
type: ComplexKeyHashed
key.names: ['product_id']
key.types: ['String']
attribute.names: ['name']
attribute.types: ['String']
bytes_allocated: 4831830312
query_count: 57912282
hit_rate: 1
found_rate: 1
element_count: 28956140
load_factor: 0.4314801096916199
source: PostgreSQL: ...
lifetime_min: 0
lifetime_max: 3600
loading_start_time: 2022-01-17 03:53:21
last_successful_update_time: 2022-01-17 03:54:46
loading_duration: 84.79
last_exception:
comment:
Also I've got table for this dictionary:
-- auto-generated definition
create table products_dict
(
product_id String,
name String
)
engine = Dictionary;
When I query this dictionary, it tooks ~3 sec.
One id with WHERE IN
SELECT name FROM products_dict WHERE product_id IN ('97646221')
1 row retrieved starting from 1 in 2 s 891 ms (execution: 2 s 841 ms, fetching: 50 ms)
501 products without conditions and sorting
SELECT t.*
FROM products_dict t
LIMIT 501
500 rows retrieved starting from 1 in 2 s 616 ms (execution: 2 s 601 ms, fetching: 15 ms)
JOIN
SELECT ppd.*, p.name
FROM
(
SELECT
product_id,
price
FROM product_prices_daily
WHERE
product_id IN ('97646221','97646318','976464823','97647223','976472425','976474961','976476908')
AND day between '2022-01-13' and '2022-01-14'
) as ppd
LEFT JOIN products_dict as p ON p.product_id = ppd.product_id
4 rows retrieved starting from 1 in 6 s 984 ms (execution: 6 s 959 ms, fetching: 25 ms)
DBMS: ClickHouse (ver. 21.12.3.32)
Client: DataGrip 2021.3.2
Server: 128 GB RAM, dozens of cores, 3TB SSD without any load.
Reading from 16 billion MergeTree table by product_id tooks ~100ms.
I've tested manually created table with engine=dictionary and got the same results.
I cannot use flat layout because product_id is string.
Another test with clickhouse-client:
ch01 :) SELECT name FROM products_dict WHERE product_id IN ('97646239');
SELECT name
FROM products_dict
WHERE product_id IN ('97646239')
Query id: d4f467c9-be0e-4619-841b-a76251d3e714
┌─name──┐
│ ...│
└───────┘
1 rows in set. Elapsed: 2.859 sec. Processed 28.96 million rows, 2.30 GB (10.13 million rows/s., 803.25 MB/s.)
What's wrong?

Such optimization is not implemented, yet.
Initially supposed that dictionaries to be used with only dictGet functions.
Table representation were introduced much later.
Internally Dictionaries are the set of hash tables -- if your dictionary has 50 attributes then it will be 50 hash tables. These hash tables are very fast if you do seek by key, but very slow if you need to find the next element.
Right now the query SELECT name FROM products_dict WHERE product_id IN ('97646239') is executed in very straightforward way, though it could be converted into dictGet under the hood.

Clickhouse toStartOfYear gives unexpected results with Date32

I am testing out clickhouse with a wine dataset, which has a column, VINTAGE, with dates ranging from 1925-01-01 to 2017-01-01. I've run into some strange behaviour with the toStartOfYear function (see below).
I can work around the issue trivially in this dataset, but that won't be the case with other datasets that have datetimes prior to 1970 that I need to handle.
Would the behaviour below be considered a bug or known limitation?
select min(VINTAGE) from 0gzKBy;
SELECT min(VINTAGE)
FROM `0gzKBy`
Query id: 172b0560-f5dd-492d-8536-d2dbe3270caf
┌─min(VINTAGE)─┐
│ 1925-01-01 │
└──────────────┘
1 rows in set. Elapsed: 0.006 sec. Processed 129.97 thousand rows, 649.84 KB (21.30 million rows/s., 106.51 MB/s.)
select toStartOfYear(min(VINTAGE)) from 0gzKBy;
SELECT toStartOfYear(min(VINTAGE))
FROM `0gzKBy`
Query id: d65572dd-9e0d-4ef9-8c3e-4b357f2bb71a
┌─toStartOfYear(min(VINTAGE))─┐
│ 2104-06-07 │
└─────────────────────────────┘
1 rows in set. Elapsed: 0.009 sec. Processed 129.97 thousand rows, 649.84 KB (14.13 million rows/s., 70.66 MB/s.)
select min(toStartOfYear(VINTAGE)) from 0gzKBy;
SELECT min(toStartOfYear(VINTAGE))
FROM `0gzKBy`
Query id: eaf8f56c-e448-42d6-aa9b-2a02657ca774
┌─min(toStartOfYear(VINTAGE))─┐
│ 1973-01-01 │
└─────────────────────────────┘
1 rows in set. Elapsed: 0.006 sec. Processed 129.97 thousand rows, 649.84 KB (22.48 million rows/s., 112.39 MB/s.)```

Why adding OFFSET to the clickhouse query increase execution time?

I have a table with approximately 9 million records. When I'm trying to select records with big offset(for pagination) it increase execution time to extremely values. Or even causing an exceeding of memory limits and fails.
Here are logs for query with two different offset values.
SELECT * WHERE set_date >= '2019-10-11 11:05:00' AND set_date <= '2019-10-19 18:09:59' ORDER BY id ASC LIMIT 1 OFFSET 30
Elapsed: 0.729 sec. Processed 9.92 million rows, 3.06 GB (13.61 million rows/s., 4.19 GB/s.)
MemoryTracker: Peak memory usage (for query): 181.65 MiB.
SELECT * WHERE set_date >= '2019-10-11 11:05:00' AND set_date <= '2019-10-19 18:09:59' ORDER BY id ASC LIMIT 1 OFFSET 3000000
Elapsed: 6.301 sec. Processed 9.92 million rows, 3.06 GB (1.57 million rows/s., 485.35 MB/s.)
MemoryTracker: Peak memory usage (for query): 5.89 GiB.

All databases including CH implement OFFSET the same way. They just read all rows and skip OFFSET in a resultset. There is no optimization to ascend right into OFFSET 3000000.
https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/
try to disable optimize_read_in_order to fix memory usage
SELECT *
WHERE set_date >= '2019-10-11 11:05:00'
AND set_date <= '2019-10-19 18:09:59'
ORDER BY id ASC LIMIT 1 OFFSET 3000000
setting optimize_read_in_order=0

clickhouse round error in new version 18.10.3

centos7-231 :) select round(123.454, 2), round(123.445, 2);
SELECT
round(123.454, 2),
round(123.445, 2)
┌─round(123.454, 2)─┬─round(123.445, 2)─┐
│ 123.45 │ 123.44 │
└───────────────────┴───────────────────┘
1 rows in set. Elapsed: 0.002 sec.
centos7-231 :) select version();
SELECT version()
┌─version()─┐
│ 18.10.3 │
└───────────┘
1 rows in set. Elapsed: 0.005 sec.
round(123.445, 2) should get 123.45,why clickhouse get 123.44? somebody help!
In the old version clickhouse:
Connected to ClickHouse server version 1.1.54318.
:) select round(123.455, 2), round(123.445, 2);
SELECT
round(123.455, 2),
round(123.445, 2)
┌─round(123.455, 2)─┬─round(123.445, 2)─┐
│ 123.46 │ 123.45 │
└───────────────────┴───────────────────┘
Thank you!

ClickHouse uses Banker's Rounding which rounds half to even.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

clickhouse: groupBitmap slow than uniqExact? - clickhouse

Related

Data skipping index for Map or pair-wise arrays in Clickhouse?

Why is ClickHouse dictionary performance so low?

Clickhouse toStartOfYear gives unexpected results with Date32

Why adding OFFSET to the clickhouse query increase execution time?

clickhouse round error in new version 18.10.3

Categories

Resources