Clickhouse toStartOfYear gives unexpected results with Date32

Clickhouse toStartOfYear gives unexpected results with Date32 - clickhouse

I am testing out clickhouse with a wine dataset, which has a column, VINTAGE, with dates ranging from 1925-01-01 to 2017-01-01. I've run into some strange behaviour with the toStartOfYear function (see below).
I can work around the issue trivially in this dataset, but that won't be the case with other datasets that have datetimes prior to 1970 that I need to handle.
Would the behaviour below be considered a bug or known limitation?
select min(VINTAGE) from 0gzKBy;
SELECT min(VINTAGE)
FROM `0gzKBy`
Query id: 172b0560-f5dd-492d-8536-d2dbe3270caf
┌─min(VINTAGE)─┐
│ 1925-01-01 │
└──────────────┘
1 rows in set. Elapsed: 0.006 sec. Processed 129.97 thousand rows, 649.84 KB (21.30 million rows/s., 106.51 MB/s.)
select toStartOfYear(min(VINTAGE)) from 0gzKBy;
SELECT toStartOfYear(min(VINTAGE))
FROM `0gzKBy`
Query id: d65572dd-9e0d-4ef9-8c3e-4b357f2bb71a
┌─toStartOfYear(min(VINTAGE))─┐
│ 2104-06-07 │
└─────────────────────────────┘
1 rows in set. Elapsed: 0.009 sec. Processed 129.97 thousand rows, 649.84 KB (14.13 million rows/s., 70.66 MB/s.)
select min(toStartOfYear(VINTAGE)) from 0gzKBy;
SELECT min(toStartOfYear(VINTAGE))
FROM `0gzKBy`
Query id: eaf8f56c-e448-42d6-aa9b-2a02657ca774
┌─min(toStartOfYear(VINTAGE))─┐
│ 1973-01-01 │
└─────────────────────────────┘
1 rows in set. Elapsed: 0.006 sec. Processed 129.97 thousand rows, 649.84 KB (22.48 million rows/s., 112.39 MB/s.)```

Related

Data skipping index for Map or pair-wise arrays in Clickhouse?

I am migrating a table from Postgres to Clickhouse, and one of the columns is a jsonb column which includes custom attributes. These attributes can be different per tenant, hence we currently have 100k different custom attributes' keys stored in postgres.
I checked Clickhouse's semi-structured JSON data options, and it seems we can use either Map(String, String) or 2 Array(String) columns holding the keys and values.
However I cannot make a proper assessment which one is best, as I get pretty similar results.
To test performance I created the following table:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String)
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 8192;
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(200000000);
--- data look like these:
SELECT *
FROM maptest
LIMIT 1
Query id: 9afcb888-94d9-42ec-a4b3-1d73b8cadde0
┌─k─┬─keys────────┬─values─┬─map─────────────┐
│ 0 │ ['custom0'] │ ['0'] │ {'custom0':'0'} │
└───┴─────────────┴────────┴─────────────────┘
However, whichever method I choose to query for a specific key-value pair, I always get the whole table scanned. e.g.
SELECT count()
FROM maptest
WHERE length(arrayFilter((v, k) -> ((k = 'custom2') AND (v = '2')), values, keys)) > 0
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 10.541 sec. Processed 200.00 million rows, 9.95 GB (18.97 million rows/s., 943.85 MB/s.)
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 11.142 sec. Processed 200.00 million rows, 8.35 GB (17.95 million rows/s., 749.32 MB/s.)
SELECT count()
FROM maptest
WHERE (values[indexOf(keys, 'custom2')]) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 3.458 sec. Processed 200.00 million rows, 9.95 GB (57.83 million rows/s., 2.88 GB/s.)
Any suggestions on data skipping indexes for any of the 2 options?

You can add data skipping index for a map field, although you will need to set lower index_granularity to get to the most optimal values between index size and how many granules will be skipped. You should build your index using mapValues (or mapKeys, depending on your needs) map function:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String),
INDEX b mapValues(map) TYPE tokenbf_v1(2048, 16, 42) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 2048; -- < lowered index_granularity!
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(20000000);
Now let's test it:
set send_logs_level='trace';
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2';
(...)
[LAPTOP-ASLS2SOJ] 2023.02.01 11:44:52.171103 [ 96 ] {3638972e-baf3-4b48-bf10-7b944e46fc64} <Debug> default.maptest (11baab32-a7a8-4b0f-b879-ad1541cbe282) (SelectExecutor): Index `b` has dropped 9123/9767 granules.
(...)
┌─count()─┐
│ 230 │
└─────────┘
(...)
1 row in set. Elapsed: 0.107 sec. Processed 1.32 million rows, 54.52 MB (12.30 million rows/s., 508.62 MB/s.)

clickhouse: groupBitmap slow than uniqExact?

I compared the use of groupBitmap and uniqExact：uniqExact is twice as fast as groupBitmap:
ClickHouseqlwp0001.mrs-41io.com :) select groupBitmap(LO_ORDERKEY) from lineorder
SELECT groupBitmap(LO_ORDERKEY)
FROM lineorder
Query id: a48055ca-bbfa-4937-b551-07c2c321a77c
┌─groupBitmap(LO_ORDERKEY)─┐
│ 414044233 │
└──────────────────────────┘
1 rows in set. Elapsed: 49.457 sec. Processed 1.66 billion rows, 6.62 GB (33.49 million rows/s., 133.95 MB/s.)
ClickHouseqlwp0001.mrs-41io.com :) select uniqExact(LO_ORDERKEY) from lineorder
SELECT uniqExact(LO_ORDERKEY)
FROM lineorder
Query id: bef46061-59d6-4121-aba4-68a74660efac
┌─uniqExact(LO_ORDERKEY)─┐
│ 414044233 │
└────────────────────────┘
1 rows in set. Elapsed: 29.201 sec. Processed 1.66 billion rows, 6.62 GB (56.72 million rows/s., 226.87 MB/s.)

Clickhouse - join on string columns

I got String column uin in several tables, how do I can effectively join on uin these tables?
In Vertica database we use hash(uin) to transform string column into hash with Int data type - it significantly boosts efficiency in joins - could you recommend something like this? I tried CRC32(s) but it seems to work wrong.

At this moment the CH not very good cope with multi-joins queries (DB star-schema) and the query optimizer not good enough to rely on it completely.
So it needs to explicitly say how to 'execute' a query by using subqueries instead of joins.
Let's emulate your query:
SELECT table_01.number AS r
FROM numbers(87654321) AS table_01
INNER JOIN numbers(7654321) AS table_02 ON (table_01.number = table_02.number)
INNER JOIN numbers(654321) AS table_03 ON (table_02.number = table_03.number)
INNER JOIN numbers(54321) AS table_04 ON (table_03.number = table_04.number)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘
8 rows in set. Elapsed: 4.244 sec. Processed 96.06 million rows, 768.52 MB (22.64 million rows/s., 181.10 MB/s.)
*/
On my PC it takes ~4 secs. Let's rewrite it using subqueries to significantly speed it up.
SELECT number AS r
FROM numbers(87654321)
WHERE number IN (
SELECT number
FROM numbers(7654321)
WHERE number IN (
SELECT number
FROM numbers(654321)
WHERE number IN (
SELECT number
FROM numbers(54321)
)
)
)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘
8 rows in set. Elapsed: 0.411 sec. Processed 96.06 million rows, 768.52 MB (233.50 million rows/s., 1.87 GB/s.)
*/
There are other ways to optimize JOIN:
use External dictionary to get rid of join on 'small'-table
use Join table engine
use ANY-strictness
use specific settings like join_algorithm, partial_merge_join_optimizations etc
Some useful refs:
Altinity webinar: Tips and tricks every ClickHouse user should know
Altinity webinar: Secrets of ClickHouse Query Performance
Answer update:
To less storage consumption for String-column consider changing column type to LowCardinality (link 2) that significantly decrease the size of a column with many duplicated elements.
Use this query to get the size of columns:
SELECT
name AS column_name,
formatReadableSize(data_compressed_bytes) AS data_size,
formatReadableSize(marks_bytes) AS index_size,
type,
compression_codec
FROM system.columns
WHERE database = 'db_name' AND table = 'table_name'
ORDER BY data_compressed_bytes DESC
To get a numeric representation of a string need to use one of hash-functions.
SELECT 'jsfhuhsdf', xxHash32('jsfhuhsdf'), cityHash64('jsfhuhsdf');

Why adding OFFSET to the clickhouse query increase execution time?

I have a table with approximately 9 million records. When I'm trying to select records with big offset(for pagination) it increase execution time to extremely values. Or even causing an exceeding of memory limits and fails.
Here are logs for query with two different offset values.
SELECT * WHERE set_date >= '2019-10-11 11:05:00' AND set_date <= '2019-10-19 18:09:59' ORDER BY id ASC LIMIT 1 OFFSET 30
Elapsed: 0.729 sec. Processed 9.92 million rows, 3.06 GB (13.61 million rows/s., 4.19 GB/s.)
MemoryTracker: Peak memory usage (for query): 181.65 MiB.
SELECT * WHERE set_date >= '2019-10-11 11:05:00' AND set_date <= '2019-10-19 18:09:59' ORDER BY id ASC LIMIT 1 OFFSET 3000000
Elapsed: 6.301 sec. Processed 9.92 million rows, 3.06 GB (1.57 million rows/s., 485.35 MB/s.)
MemoryTracker: Peak memory usage (for query): 5.89 GiB.

All databases including CH implement OFFSET the same way. They just read all rows and skip OFFSET in a resultset. There is no optimization to ascend right into OFFSET 3000000.
https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/
try to disable optimize_read_in_order to fix memory usage
SELECT *
WHERE set_date >= '2019-10-11 11:05:00'
AND set_date <= '2019-10-19 18:09:59'
ORDER BY id ASC LIMIT 1 OFFSET 3000000
setting optimize_read_in_order=0

clickhouse round error in new version 18.10.3

centos7-231 :) select round(123.454, 2), round(123.445, 2);
SELECT
round(123.454, 2),
round(123.445, 2)
┌─round(123.454, 2)─┬─round(123.445, 2)─┐
│ 123.45 │ 123.44 │
└───────────────────┴───────────────────┘
1 rows in set. Elapsed: 0.002 sec.
centos7-231 :) select version();
SELECT version()
┌─version()─┐
│ 18.10.3 │
└───────────┘
1 rows in set. Elapsed: 0.005 sec.
round(123.445, 2) should get 123.45,why clickhouse get 123.44? somebody help!
In the old version clickhouse:
Connected to ClickHouse server version 1.1.54318.
:) select round(123.455, 2), round(123.445, 2);
SELECT
round(123.455, 2),
round(123.445, 2)
┌─round(123.455, 2)─┬─round(123.445, 2)─┐
│ 123.46 │ 123.45 │
└───────────────────┴───────────────────┘
Thank you!

ClickHouse uses Banker's Rounding which rounds half to even.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Clickhouse toStartOfYear gives unexpected results with Date32 - clickhouse

Related

Data skipping index for Map or pair-wise arrays in Clickhouse?

clickhouse: groupBitmap slow than uniqExact?

Clickhouse - join on string columns

Why adding OFFSET to the clickhouse query increase execution time?

clickhouse round error in new version 18.10.3

Categories

Resources