We've known materialized view is triggered by insertion. Large amount of data from one insertion will be divided into multiple blocks, each block will trigger one MV select, according to this doc.
Will MV select triggered after all the rows of a block received? Or it depends on the select? Say, if the select tries to count rows of inserted data, it won't be trigger until all data received? If the select is just doing data replication/reversed indexing, it will be triggered as soon as one record of the block received?
Insert triggers MatView after block is inserted into the main table. So the insert just passes the pointer to the block of rows (in memory) into MatView.
-- by default clickhouse-client forms blocks = 1048545
-- clickhouse-client by itself parses input stream and inserts into
-- server in Native format
select value from system.settings where name = 'max_insert_block_size';
1048545
-- here is a setup which traces sizes of blocks in rows
create table test(a Int, b String) engine=Null;
create materialized view test_mv
Engine=MergeTree order by ts as
select now() ts, count() rows from test;
-- create file with 100mil rows in TSV format
clickhouse-client -q "select number, 'x' from numbers(100000000) format TSV" >test.tsv
clickhouse-client -q "insert into test format TSV" <test.tsv
select max(rows), min(rows), count() from test_mv;
┌─max(rows)─┬─min(rows)─┬─count()─┐
│ 1947598 │ 32 │ 65 │
└───────────┴───────────┴─────────┘
-- 1947598 -<- several blocks were squashed into a single block because
-- of parallel parsing
-- 65 blocks were passed
truncate test_mv;
-- let's disable parallel parsing
clickhouse-client --input_format_parallel_parsing=0 -q "insert into test format TSV" <test.tsv
select max(rows), min(rows), count() from test_mv;
┌─max(rows)─┬─min(rows)─┬─count()─┐
│ 1048545 │ 388225 │ 96 │
└───────────┴───────────┴─────────┘
-- got 1048545 = max_insert_block_size
-- 96 blocks were passed
-- 100000000 - 95 * 1048545 = 388225
-- (95 blocks by max_insert_block_size and a remain 1 block = 388225 rows
truncate test_mv;
-- let's grow max_insert_block_size
clickhouse-client --max_insert_block_size=10000000000 --input_format_parallel_parsing=0 -q "insert into test format TSV" <test.tsv
select max(rows), min(rows), count() from test_mv;
┌─max(rows)─┬─min(rows)─┬─count()─┐
│ 100000000 │ 100000000 │ 1 │
└───────────┴───────────┴─────────┘
-- 1 block == 100 mil rows
more https://kb.altinity.com/altinity-kb-queries-and-syntax/atomic-insert/
Related
I am migrating a table from Postgres to Clickhouse, and one of the columns is a jsonb column which includes custom attributes. These attributes can be different per tenant, hence we currently have 100k different custom attributes' keys stored in postgres.
I checked Clickhouse's semi-structured JSON data options, and it seems we can use either Map(String, String) or 2 Array(String) columns holding the keys and values.
However I cannot make a proper assessment which one is best, as I get pretty similar results.
To test performance I created the following table:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String)
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 8192;
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(200000000);
--- data look like these:
SELECT *
FROM maptest
LIMIT 1
Query id: 9afcb888-94d9-42ec-a4b3-1d73b8cadde0
┌─k─┬─keys────────┬─values─┬─map─────────────┐
│ 0 │ ['custom0'] │ ['0'] │ {'custom0':'0'} │
└───┴─────────────┴────────┴─────────────────┘
However, whichever method I choose to query for a specific key-value pair, I always get the whole table scanned. e.g.
SELECT count()
FROM maptest
WHERE length(arrayFilter((v, k) -> ((k = 'custom2') AND (v = '2')), values, keys)) > 0
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 10.541 sec. Processed 200.00 million rows, 9.95 GB (18.97 million rows/s., 943.85 MB/s.)
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 11.142 sec. Processed 200.00 million rows, 8.35 GB (17.95 million rows/s., 749.32 MB/s.)
SELECT count()
FROM maptest
WHERE (values[indexOf(keys, 'custom2')]) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 3.458 sec. Processed 200.00 million rows, 9.95 GB (57.83 million rows/s., 2.88 GB/s.)
Any suggestions on data skipping indexes for any of the 2 options?
You can add data skipping index for a map field, although you will need to set lower index_granularity to get to the most optimal values between index size and how many granules will be skipped. You should build your index using mapValues (or mapKeys, depending on your needs) map function:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String),
INDEX b mapValues(map) TYPE tokenbf_v1(2048, 16, 42) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 2048; -- < lowered index_granularity!
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(20000000);
Now let's test it:
set send_logs_level='trace';
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2';
(...)
[LAPTOP-ASLS2SOJ] 2023.02.01 11:44:52.171103 [ 96 ] {3638972e-baf3-4b48-bf10-7b944e46fc64} <Debug> default.maptest (11baab32-a7a8-4b0f-b879-ad1541cbe282) (SelectExecutor): Index `b` has dropped 9123/9767 granules.
(...)
┌─count()─┐
│ 230 │
└─────────┘
(...)
1 row in set. Elapsed: 0.107 sec. Processed 1.32 million rows, 54.52 MB (12.30 million rows/s., 508.62 MB/s.)
I use Clickhouse database. There is a table with string column (data). All rows contains data like:
'[{"a":23, "b":1}]'
'[{"a":7, "b":15}]'
I wanna get all values of key "b".
1
15
Next query:
Select JSONExtractInt('data', 0, 'b') from table
return 0 all time. How i can get values of key "b"?
SELECT tupleElement(JSONExtract(j, 'Array(Tuple(a Int64, b Int64))'), 'b')[1] AS res
FROM
(
SELECT '[{"a":23, "b":1}]' AS j
UNION ALL
SELECT '[{"a":7, "b":15}]'
)
┌─res─┐
│ 1 │
└─────┘
┌─res─┐
│ 15 │
└─────┘
According to https://clickhouse.tech/docs/en/sql-reference/functions/hash-functions/,
I can get a checksum of entire table this way:
SELECT groupBitXor(cityHash64(*)) FROM table
What is the most accurate way to get a checksum of first N rows of a table?
As an example, I'm using a table with GenerateRandom engine as stated here.
CREATE TABLE test (name String, value UInt32) ENGINE = GenerateRandom(1, 5, 3)
I tried using LIMIT clause, but with no luck yet.
Consider using sub-query:
SELECT groupBitXor(cityHash64(*))
FROM (
SELECT *
FROM table
LIMIT x)
SELECT groupBitXor(cityHash64(*))
FROM
(
SELECT *
FROM system.numbers
LIMIT 10
)
/*
┌─groupBitXor(cityHash64(number))─┐
│ 9791317254842948406 │
└─────────────────────────────────┘
*/
I got String column uin in several tables, how do I can effectively join on uin these tables?
In Vertica database we use hash(uin) to transform string column into hash with Int data type - it significantly boosts efficiency in joins - could you recommend something like this? I tried CRC32(s) but it seems to work wrong.
At this moment the CH not very good cope with multi-joins queries (DB star-schema) and the query optimizer not good enough to rely on it completely.
So it needs to explicitly say how to 'execute' a query by using subqueries instead of joins.
Let's emulate your query:
SELECT table_01.number AS r
FROM numbers(87654321) AS table_01
INNER JOIN numbers(7654321) AS table_02 ON (table_01.number = table_02.number)
INNER JOIN numbers(654321) AS table_03 ON (table_02.number = table_03.number)
INNER JOIN numbers(54321) AS table_04 ON (table_03.number = table_04.number)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘
8 rows in set. Elapsed: 4.244 sec. Processed 96.06 million rows, 768.52 MB (22.64 million rows/s., 181.10 MB/s.)
*/
On my PC it takes ~4 secs. Let's rewrite it using subqueries to significantly speed it up.
SELECT number AS r
FROM numbers(87654321)
WHERE number IN (
SELECT number
FROM numbers(7654321)
WHERE number IN (
SELECT number
FROM numbers(654321)
WHERE number IN (
SELECT number
FROM numbers(54321)
)
)
)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘
8 rows in set. Elapsed: 0.411 sec. Processed 96.06 million rows, 768.52 MB (233.50 million rows/s., 1.87 GB/s.)
*/
There are other ways to optimize JOIN:
use External dictionary to get rid of join on 'small'-table
use Join table engine
use ANY-strictness
use specific settings like join_algorithm, partial_merge_join_optimizations etc
Some useful refs:
Altinity webinar: Tips and tricks every ClickHouse user should know
Altinity webinar: Secrets of ClickHouse Query Performance
Answer update:
To less storage consumption for String-column consider changing column type to LowCardinality (link 2) that significantly decrease the size of a column with many duplicated elements.
Use this query to get the size of columns:
SELECT
name AS column_name,
formatReadableSize(data_compressed_bytes) AS data_size,
formatReadableSize(marks_bytes) AS index_size,
type,
compression_codec
FROM system.columns
WHERE database = 'db_name' AND table = 'table_name'
ORDER BY data_compressed_bytes DESC
To get a numeric representation of a string need to use one of hash-functions.
SELECT 'jsfhuhsdf', xxHash32('jsfhuhsdf'), cityHash64('jsfhuhsdf');
I have 3 tables events_0, events_1, events_2 with Engine = MergeTree
and 1 table events with Engine = Merge
CREATE TABLE events as events_0 ENGINE=Merge(currentDatabase(), '^events');
When I run sql query like
select uuid from events where uuid = 'XXXX-YYY-ZZZZ';
I've got a duplicated response
┌─uuid──────────┐
│ XXXX-YYY-ZZZZ │
└───────────────┘
┌─uuid──────────┐
│ XXXX-YYY-ZZZZ │
└───────────────┘
Try adding _table to the select clause to see which table was generating the data.
select _table, uuid from events where uuid = 'XXXX-YYY-ZZZZ';
It looks like a self-recursion to me. You might need to rename the merge table that won't be matched by regex ^events