How understand the granularity and block in ClickHouse? - clickhouse

I am not clear about these two words.
Whether does one block have a fixed number of rows?
Whether is one block the minimum unit to read from disk?
Whether are different blocks stored in different files?
Whether is the range of one block bigger than granule? That means, one block can have several granules skip indices.

https://clickhouse.tech/docs/en/operations/table_engines/mergetree/#primary-keys-and-indexes-in-queries
Primary key is sparsed. By default it contains 1 value of each 8192 rows (= 1 granule).
Let's disable adaptive granularity (for the test) -- index_granularity_bytes=0
create table X (A Int64)
Engine=MergeTree order by A
settings index_granularity=16,index_granularity_bytes=0;
insert into X select * from numbers(32);
index_granularity=16 -- 32 rows = 2 granule , primary index have 2 values 0 and 16
select marks, primary_key_bytes_in_memory from system.parts where table = 'X';
┌─marks─┬─primary_key_bytes_in_memory─┐
│ 2 │ 16 │
└───────┴─────────────────────────────┘
16 bytes === 2 values of INT64.
Adaptive index granularity means that granules size various. Because wide rows (many bytes) needs (for performance) fewer (<8192) rows in granule.
index_granularity_bytes = 10MB ~ 1k row * 8129. So each granule have 10MB. If rows size 100k (long Strings), granule will have 100 rows (not 8192).
Skip index granules GRANULARITY 3 -- means that an index will store one value for each 3 table granules.
create table X (A Int64, B Int64, INDEX IX1 (B) TYPE minmax GRANULARITY 4)
Engine=MergeTree order by A
settings index_granularity=16,index_granularity_bytes=0;
insert into X select number, number from numbers(128);
128/16 = 8, table have 8 granules, INDEX IX1 stores 2 values of minmax (8/4)
So minmax index stores 2 values -- (0..63) and (64..128)
0..63 -- points to the first 4 table's granules.
64..128 -- points to the second 4 table' granules.
set send_logs_level='debug'
select * from X where B=77
[ 84 ] <Debug> dw.X (SelectExecutor): **Index `IX1` has dropped 1 granules**
[ 84 ] <Debug> dw.X (SelectExecutor): Selected 1 parts by date, 1 parts by key, **4 marks** to read from 1 ranges
SelectExecutor checked skip index - 4 table granules can be skipped because 77 is not in 0..63 .
And another 4 granules must be read ( 4 marks ) because 77 in (64..128) -- some of that 4 granules have B=77.

https://clickhouse.tech/docs/en/development/architecture/#block
Block can contain any number of rows.
For example 1 row blocks:
set max_block_size=1;
SELECT * FROM numbers_mt(1000000000) LIMIT 3;
┌─number─┐
│ 0 │
└────────┘
┌─number─┐
│ 2 │
└────────┘
┌─number─┐
│ 3 │
└────────┘
set max_block_size=100000000000;
create table X (A Int64) Engine=Memory;
insert into X values(1);
insert into X values(2);
insert into X values(3);
SELECT * FROM X;
┌─A─┐
│ 1 │
└───┘
┌─A─┐
│ 3 │
└───┘
┌─A─┐
│ 2 │
└───┘
3 rows in block
drop table X;
create table X (A Int64) Engine=Memory;
insert into X values(1)(2)(3);
select * from X
┌─A─┐
│ 1 │
│ 2 │
│ 3 │
└───┘

Related

how to generate random numbers in a specific range in Clickhouse

clickhouse only has a few random functions like rand(). but, how can i create random numbers under a specific range. let's say create numbers in the range of 0-50
for example something like:
select rand(0,50) as random_0_50
random_0_50
5
12
32
0
27
clickhouse lets you run module function under select.
so u can decide your range x by using select rand() % x
in the case about the code will be:
select rand() % 50 random_0_50
Clickhouse has built-in generateRandom() function that helps quickly generate and populate data
Here is an article with this function https://medium.com/p/45e92c2645c5
rand function returns a number between [0 : 4294967295]. You can get modulo of it to constrain between the numbers you want like this:
SELECT rand() % 51 AS random_0_50
Query id: e4addce1-37b2-44a7-ab51-f37b6ef4ff58
┌─random_0_50─┐
│ 13 │
└─────────────┘
Or you can create a function for it:
CREATE FUNCTION rand_range AS (lower_bound, upper_bound) -> (lower_bound + (rand() % (upper_bound + 1)))
SELECT rand_range(0, 50) AS random_0_50
Query id: 9efa566d-e825-4e34-b228-e6ba6e210b16
┌─random_0_50─┐
│ 50 │
└─────────────┘

Multiple arrays Clickhouse

Problem:
Count distinct values in an array filtered by another array on same row (and agg higher).
Explanation:
Using this data:
In the Size D70, there are 5 pcs available (hqsize), but shops requests 15. By using the column accumulatedNeed, the 5 first stores in the column shops should receive items (since every store request 1 pcs). That is [4098,4101,4109,4076,4080].
It could also be that the values in accumulatedNeed would be [1,4,5,5,5,...,15], where shop 1 request 1 pcs, shop2 3 pcs, etc. Then only 3 stores would get.
In the size E75 there is enough stock, so every shop will receive (10 shops):
Now i want the distinct list of shops from D70 & E75, which would be be final result:
[4098,4101,4109,4076,4080,4062,4063,4067,4072,4075,4056,4058,4059,4061] (14 unique stores) (4109 is only counted once)
Wanted result:
[4098,4101,4109,4076,4080,4062,4063,4067,4072,4075,4056,4058,4059,4061]. (14 unique stores)
I'm totally open to structure the data otherwise if better.
The reason why it can't be precalculated is that the result depends on which shops that are filtered on.
Additional issue
The answer below from Vdimir is good and I've used it as basics for the final solution, but the solution does not cover (partial fullfillment).
If the stock number is in the runningNeed array we are all goodt, but remainers are not handled.
If you got:
select 5 as stock,[2,2,3,3] as need, [1,2,3,4] as shops, arrayCumSum(need) as runningNeed,arrayMap(x -> (x <= stock), runningNeed) as mask
You will get:
This is not correct since the 3rd shop should have 1 from stock (5-2-2 = 1)
I can't seem to get my head around how to make an array with "stock given", which in this case would be [2,2,1,0]
I use this query to create table with data similar to your screenshot:
CREATE TABLE t
(
Size String,
hqsize Int,
accumulatedNeed Array(Int),
shops Array(Int)
) engine = Memory;
INSERT INTO t VALUES ('D70', 5, [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], [4098,4101,4109,4076,4080,4083,4062,4063,4067,4072,4075,4056,4057,4058,4059]),('E75', 43, [1,2,3,4,5,6,7,8,9,10], [4109,4062,4063,4067,4072,4075,4056,4058,4059,4061]);
Find which shops that can receive enough items:
SELECT arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask FROM t;
┌─mask────────────────────────────┐
│ [1,1,1,1,1,0,0,0,0,0,0,0,0,0,0] │
│ [1,1,1,1,1,1,1,1,1,1] │
└─────────────────────────────────┘
Filter not fulfilled shops according to this mask:
Note that shops and accumulatedNeed have to have equals sizes.
SELECT arrayFilter((x,y) -> y, shops, mask) as fulfilled_shops, arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask FROM t;
┌─fulfilled_shops─────────────────────────────────────┬─mask────────────────────────────┐
│ [4098,4101,4109,4076,4080] │ [1,1,1,1,1,0,0,0,0,0,0,0,0,0,0] │
│ [4109,4062,4063,4067,4072,4075,4056,4058,4059,4061] │ [1,1,1,1,1,1,1,1,1,1] │
└─────────────────────────────────────────────────────┴─────────────────────────────────┘
Then you can create table with all distinct shops:
SELECT DISTINCT arrayJoin(fulfilled_shops) as shops FROM (
SELECT arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask, arrayFilter((x,y) -> y, shops, mask) as fulfilled_shops FROM t
);
┌─shops─┐
│ 4098 │
│ 4101 │
│ 4109 │
│ 4076 │
│ 4080 │
│ 4062 │
│ 4063 │
│ 4067 │
│ 4072 │
│ 4075 │
│ 4056 │
│ 4058 │
│ 4059 │
│ 4061 │
└───────┘
14 rows in set. Elapsed: 0.049 sec.
Or if you need single array group it back:
SELECT groupArrayDistinct(arrayJoin(fulfilled_shops)) as shops FROM (
SELECT arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask, arrayFilter((x,y) -> y, shops, mask) as fulfilled_shops FROM t
);
┌─shops───────────────────────────────────────────────────────────────────┐
│ [4080,4076,4101,4075,4056,4061,4062,4063,4109,4058,4067,4059,4072,4098] │
└─────────────────────────────────────────────────────────────────────────┘
If you need data only from D70 & E75 you can filter extra rows from table with WHERE before.

Select data in range from first bad value to last bad value

Have such table and data:
create table sensor_values(
dt DateTime default now(),
value UInt32
)
engine MergeTree()
partition by toYYYYMM(dt)
order by tuple();
insert into sensor_values(value) values (1), (2), (11), (13), (4), (17), (5), (8);
Data:
value
-----
1
2
11
13
4
17
5
8
I would like to select data in range from first bad value (11) to last bad value (17). Bad values are more than 10.
Desired range after select:
value
-----
11
13
4
17
My first thoughts were to define whether value bad or not and then to calculate (some how) accumulative sum:
value isBad cumSum
--------------------
1 0 0
2 0 0
11 1 1
13 1 2
4 0 2
17 1 3
5 0 3
8 0 3
Then I would select from min(cumSum) to max(cumSum) - 1 but I miss last bad value.
How can I get the last value included in select result?
You can try to use either the window-functions (see: runningDifference, neighbor) or array-functions:
SELECT arrayJoin(slice) as result
FROM (
SELECT
groupArray(data) AS arr,
arrayFirstIndex(x -> (x > 10), arr) AS first_index,
(length(arr) - arrayFirstIndex(x -> (x > 10), arrayReverse(arr)) + 1) AS last_index,
arraySlice(arr, first_index, last_index - first_index + 1) AS slice
FROM
(
/* test dataset */
SELECT arrayJoin([1, 2, 11, 13, 4, 17, 5, 8]) AS data
)
)
/*
┌─result─┐
│ 11 │
│ 13 │
│ 4 │
│ 17 │
└────────┘
*/

Can I use 0 and 1 values in quantilesExact

From the quantile function documentation:
We recommend using a level value in the range of [0.01, 0.99]. Don't use a level value equal to 0 or 1 – use the min and max functions for these cases.
Does this also applies for quantileExact and quantilesExact functions?
In my experiments, I've found that quantileExact(0) = min and quantileExact(1) = max, but cannot be sure about it.
That recommendation is not about accuracy but about complexity of quantile*.
quantileExact is much much heavier than max min.
See the time difference, min / max 8 times faster even on a small dataset.
create table Speed Engine=MergeTree order by X
as select number X from numbers(1000000000);
SELECT min(X), max(X) FROM Speed;
┌─min(X)─┬────max(X)─┐
│ 0 │ 999999999 │
└────────┴───────────┘
1 rows in set. Elapsed: 1.040 sec. Processed 1.00 billion rows, 8.00 GB (961.32 million rows/s., 7.69 GB/s.)
SELECT quantileExact(0)(X), quantileExact(1)(X) FROM Speed;
┌─quantileExact(0)(X)─┬─quantileExact(1)(X)─┐
│ 0 │ 999999999 │
└─────────────────────┴─────────────────────┘
1 rows in set. Elapsed: 8.561 sec. Processed 1.00 billion rows, 8.00 GB (116.80 million rows/s., 934.43 MB/s.)
It turns out it is safe to use 0 and 1 values for quantileExact and quantilesExact functions.

How to use results of [with totals] modifier

We have modifier [with totals] that can summarize values across all rows and get the total result with key value=0 or null or smth like this
The problem is that I don't understand how I can use these values in the next calculations
Maybe I'm using the wrong format
select processing_date,count(*)
from `telegram.message`
where processing_date>='2019-05-01'
group by processing_date with totals
The documentation says that
You can use WITH TOTALS in subqueries, including subqueries in the
JOIN clause (in this case, the respective total values are combined).
Example subqueries in the JOIN (CH tests scripts in github):
SELECT k, s1, s2
FROM
(
SELECT intDiv(number, 3) AS k, sum(number) AS s1
FROM
(
SELECT *
FROM system.numbers
LIMIT 10
)
GROUP BY k WITH TOTALS
)
ANY LEFT JOIN
(
SELECT intDiv(number, 4) AS k, sum(number) AS s2
FROM
(
SELECT *
FROM system.numbers
LIMIT 10
)
GROUP BY k WITH TOTALS
) USING (k)
ORDER BY k ASC
/* Result:
┌─k─┬─s1─┬─s2─┐
│ 0 │ 3 │ 6 │
│ 1 │ 12 │ 22 │
│ 2 │ 21 │ 17 │
│ 3 │ 9 │ 0 │
└───┴────┴────┘
Totals:
┌─k─┬─s1─┬─s2─┐
│ 0 │ 45 │ 45 │
└───┴────┴────┘
*/
As a workaround, you can combine results of several totals using client libraries.
Using "with rollup" instead of "with totals" decides problems with format

Resources