centos7-231 :) select round(123.454, 2), round(123.445, 2);
SELECT
round(123.454, 2),
round(123.445, 2)
┌─round(123.454, 2)─┬─round(123.445, 2)─┐
│ 123.45 │ 123.44 │
└───────────────────┴───────────────────┘
1 rows in set. Elapsed: 0.002 sec.
centos7-231 :) select version();
SELECT version()
┌─version()─┐
│ 18.10.3 │
└───────────┘
1 rows in set. Elapsed: 0.005 sec.
round(123.445, 2) should get 123.45,why clickhouse get 123.44? somebody help!
In the old version clickhouse:
Connected to ClickHouse server version 1.1.54318.
:) select round(123.455, 2), round(123.445, 2);
SELECT
round(123.455, 2),
round(123.445, 2)
┌─round(123.455, 2)─┬─round(123.445, 2)─┐
│ 123.46 │ 123.45 │
└───────────────────┴───────────────────┘
Thank you!
ClickHouse uses Banker's Rounding which rounds half to even.
Related
I want to check if a Map contains all the entries I need.
> create table map_test (my_map Map(String, String)) engine = Memory;
> insert into map_test values ({'k1': 'v1', 'k2': 'v2'}), ({'k1': 'v1', 'k2': 'v2', 'k3': 'v3'}), ({'k1': 'v1', 'k4': 'v4'});
> select * from map_test;
┌─my_map──────────────────────────┐
│ {'k1':'v1','k2':'v2'} │
│ {'k1':'v1','k2':'v2','k3':'v3'} │
│ {'k1':'v1','k4':'v4'} │
└─────────────────────────────────┘
3 rows in set. Elapsed: 0.001 sec.
-- get the rows that "my_map" contains all entries I need.
> select * from map_test where my_map['k1'] = 'v1' and my_map['k2'] = 'v2'; -- The SQL will be very long.
┌─my_map──────────────────────────┐
│ {'k1':'v1','k2':'v2'} │
│ {'k1':'v1','k2':'v2','k3':'v3'} │
└─────────────────────────────────┘
2 rows in set. Elapsed: 0.001 sec.
If I have a lot of entries in where clause, the SQL will be very long.
Is there a way to do it like the following SQL.
select * from map_test where mapContainsAll(my_map, {'k1': 'v1', 'k2': 'v2'});
I've read documentation about Map, but can't find a function like mapContainsAll.
If clickhouse is performing a background merge operation (lets say 10 parts into 1 part), would that cause the selected marks to go up? Or are selected marks only governed by read operations performed due to SELECT queries
It should not in general but it may because of partition pruning.
create table test( D date, K Int64, S String )
Engine=MergeTree partition by toYYYYMM(D) order by K;
system stop merges test;
insert into test select '2022-01-01', number, '' from numbers(1000000);
insert into test select '2022-01-31', number, '' from numbers(1000000);
select name, min_date, max_date, rows from system.parts where table = 'test' and active;
┌─name─────────┬───min_date─┬───max_date─┬────rows─┐
│ 202201_1_1_0 │ 2022-01-01 │ 2022-01-01 │ 1000000 │ two parts in a partition and min_date
│ 202201_2_2_0 │ 2022-01-31 │ 2022-01-31 │ 1000000 │ min_date & max_date are not intersecting
└──────────────┴────────────┴────────────┴─────────┘
explain estimate select count() from test where D between '2022-01-01' and '2022-01-15';
┌─database─┬─table─┬─parts─┬────rows─┬─marks─┐
│ dw │ test │ 1 │ 1000000 │ 123 │ -- 123 mark.
└──────────┴───────┴───────┴─────────┴───────┘
system start merges test;
optimize table test final;
select name, min_date, max_date, rows from system.parts where table = 'test' and active;
┌─name─────────┬───min_date─┬───max_date─┬────rows─┐
│ 202201_1_2_1 │ 2022-01-01 │ 2022-01-31 │ 2000000 │ one part covers the whole month
└──────────────┴────────────┴────────────┴─────────┘
explain estimate select count() from test where D between '2022-01-01' and '2022-01-15';
┌─database─┬─table─┬─parts─┬────rows─┬─marks─┐
│ dw │ test │ 1 │ 2000000 │ 245 │ -- 245 mark.
└──────────┴───────┴───────┴─────────┴───────┘
In real life you will never notice this because it's very synthetic case, no filters on primary key index, and partition column is not in primary key index.
And it does not mean that merges make query slower, it means that Clickhouse is able to leverage the fact that data is not merged yet and reads only a part of the data in a partition.
Let's say I have a table defined as
CREATE TABLE orders (
sqlId Int64, -- orders.id from PSQL
isApproved UInt8, -- Boolean
comment String,
price Decimal(10, 2),
createdAt DateTime64(9, 'UTC'),
updatedAt DateTime64(9, 'UTC') DEFAULT NOW()
)
ENGINE = MergeTree
ORDER BY (createdAt, sqlId)
so two fields that might be changed in the source PSQL database - isApproved and comment
Naturally, if I sink some records from an MQ topic into it, I will end up with something like this
SELECT *
FROM orders
Query id: 50cd95a4-e581-41b5-82a4-7ec86771e4e5
┌─sqlId─┬─isApproved─┬──price─┬─comment───┬─────────────────────createdAt─┬─────────────────────updatedAt─┐
│ 1 │ 1 │ 100.00 │ some note │ 2021-11-08 16:24:07.000000000 │ 2021-11-08 16:27:29.000000000 │
└───────┴────────────┴────────┴───────────┴───────────────────────────────┴───────────────────────────────┘
┌─sqlId─┬─isApproved─┬──price─┬─comment─┬─────────────────────createdAt─┬─────────────────────updatedAt─┐
│ 1 │ 1 │ 100.00 │ │ 2021-11-08 16:24:07.000000000 │ 2021-11-08 16:27:22.000000000 │
└───────┴────────────┴────────┴─────────┴───────────────────────────────┴───────────────────────────────┘
┌─sqlId─┬─isApproved─┬──price─┬─comment─┬─────────────────────createdAt─┬─────────────────────updatedAt─┐
│ 1 │ 0 │ 100.00 │ │ 2021-11-08 16:24:07.000000000 │ 2021-11-08 16:27:17.000000000 │
└───────┴────────────┴────────┴─────────┴───────────────────────────────┴───────────────────────────────┘
In other words, a particular order was created first as a non-approved, then it was approved, then some comment was added.
Let's say I want to create a view that represents the total volume of orders per day.
A naive approach might be:
CREATE MATERIALIZED VIEW orders_volume_per_day (
day DateTime64(9, 'UTC'),
volume SimpleAggregateFunction(sum, Decimal(38, 2))
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(day)
ORDER BY day
AS SELECT
toStartOfDay(createdAt) as day,
sum(price) as volume
FROM orders
GROUP BY day
ORDER BY day ASC
However, it will be using all three redundant records, while I only need to use the latest one.
In my particular example, the view will return 300 (3x100) instead of just 100.
Is there any way to achieve the desired behavior in Clickhouse? I know that I can utilize VersionedCollapsingMergeTree somehow with the sign or version columns, but it seems that tools like clickhouse_sinker or snuba will not support it.
Assuming following schema:
CREATE TABLE test
(
date Date,
user_id UInt32,
user_answer UInt8,
user_multi_choice_answer Array(UInt8),
events UInt32
)
ENGINE = MergeTree() ORDER BY date;
And contents:
INSERT INTO test VALUES
('2020-01-01', 1, 5, [2, 3], 15),
('2020-01-01', 2, 6, [1, 2], 7);
Let's say I want to make a query "give me # of users and # of their events grouped by date and user_answer, with subtotals". That's easy:
select date, user_answer, count(distinct user_id), sum(events) from test group by date, user_answer with rollup;
┌───────date─┬─user_answer─┬─uniqExact(user_id)─┬─sum(events)─┐
│ 2020-01-01 │ 5 │ 1 │ 15 │
│ 2020-01-01 │ 6 │ 1 │ 7 │
│ 2020-01-01 │ 0 │ 2 │ 22 │
│ 0000-00-00 │ 0 │ 2 │ 22 │
└────────────┴─────────────┴────────────────────┴─────────────┘
What I can't easily do is making queries with overlapping groups, like when grouping by invidivual options of multiple choice question. For example:
# of users and # of their events grouped by date and user_multi_choice_answer, with subtotals
# of users and # of their events grouped by arbitrary hand-written grouping conditions, like "compare users with user_answer=5 and has(user_multi_choice_answer, 1) to users with has(user_multi_choice_answer, 2)"
For example, with the first query, I would like to see the following:
┌───────date─┬─user_multi_choice_answer─┬─uniqExact(user_id)─┬─sum(events)─┐
│ 2020-01-01 │ 1 │ 1 │ 15 │
│ 2020-01-01 │ 2 │ 2 │ 22 │
│ 2020-01-01 │ 3 │ 1 │ 7 │
│ 2020-01-01 │ 0 │ 2 │ 22 │
│ 0000-00-00 │ 0 │ 2 │ 22 │
└────────────┴──────────────────────────┴────────────────────┴─────────────┘
And for the second:
┌─my_grouping_id─┬─uniqExact(user_id)─┬─sum(events)─┐
│ 1 │ 1 │ 15 │ # users fulfilling arbitrary condition #1
│ 2 │ 2 │ 22 │ # users fulfilling arbitrary condition #2
│ 0 │ 2 │ 22 │ # subtotal
└────────────────┴────────────────────┴─────────────┘
The closest I can get to that is by using arrayJoin():
select date, arrayJoin(user_multi_choice_answer) as multi_answer, count(distinct user_id), sum(events)
from test group by date, multi_answer with rollup;
select arrayJoin(
arrayConcat(
if(user_answer=5 and has(user_multi_choice_answer, 3), [1], []),
if(has(user_multi_choice_answer, 2), [2], [])
)
) as my_grouping_id, count(distinct user_id), sum(events)
from test group by my_grouping_id with rollup;
But that's not a good solution for two reasons:
While it calculates correct results for grouping, the result for sum(events) is not correct for subtotals (as duplicated rows count multiple times)
It doesn't seem efficient, as it makes a lot of data duplication (while I just want the same row to get aggregated into several groups)
So, again, I'm looking for a way that would allow me to easily make grouping of answers to multiple choice questions and gropings by arbitrary conditions on some columns. I'm okay with changing the schema to make that possible, but I'm mostly hoping Clickhouse has a built-in way to achieve that.
While it calculates correct results for grouping, the result for sum(events) is not correct for subtotals (as duplicated rows count multiple times)
You can manually create my_grouping_id = 0 without using rollup. For example,
select arrayJoin(
arrayConcat(
[0],
if(user_answer=5 and has(user_multi_choice_answer, 3), [1], []),
if(has(user_multi_choice_answer, 2), [2], [])
)
) as my_grouping_id, count(distinct user_id), sum(events)
from test group by my_grouping_id
It doesn't seem efficient, as it makes a lot of data duplication (while I just want the same row to get aggregated into several groups)
Currently it's not possible. But I see possibilities. I'll try to make a POC of GROUP BY ARRAY. It seems to be a valid use case.
I got String column uin in several tables, how do I can effectively join on uin these tables?
In Vertica database we use hash(uin) to transform string column into hash with Int data type - it significantly boosts efficiency in joins - could you recommend something like this? I tried CRC32(s) but it seems to work wrong.
At this moment the CH not very good cope with multi-joins queries (DB star-schema) and the query optimizer not good enough to rely on it completely.
So it needs to explicitly say how to 'execute' a query by using subqueries instead of joins.
Let's emulate your query:
SELECT table_01.number AS r
FROM numbers(87654321) AS table_01
INNER JOIN numbers(7654321) AS table_02 ON (table_01.number = table_02.number)
INNER JOIN numbers(654321) AS table_03 ON (table_02.number = table_03.number)
INNER JOIN numbers(54321) AS table_04 ON (table_03.number = table_04.number)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘
8 rows in set. Elapsed: 4.244 sec. Processed 96.06 million rows, 768.52 MB (22.64 million rows/s., 181.10 MB/s.)
*/
On my PC it takes ~4 secs. Let's rewrite it using subqueries to significantly speed it up.
SELECT number AS r
FROM numbers(87654321)
WHERE number IN (
SELECT number
FROM numbers(7654321)
WHERE number IN (
SELECT number
FROM numbers(654321)
WHERE number IN (
SELECT number
FROM numbers(54321)
)
)
)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘
8 rows in set. Elapsed: 0.411 sec. Processed 96.06 million rows, 768.52 MB (233.50 million rows/s., 1.87 GB/s.)
*/
There are other ways to optimize JOIN:
use External dictionary to get rid of join on 'small'-table
use Join table engine
use ANY-strictness
use specific settings like join_algorithm, partial_merge_join_optimizations etc
Some useful refs:
Altinity webinar: Tips and tricks every ClickHouse user should know
Altinity webinar: Secrets of ClickHouse Query Performance
Answer update:
To less storage consumption for String-column consider changing column type to LowCardinality (link 2) that significantly decrease the size of a column with many duplicated elements.
Use this query to get the size of columns:
SELECT
name AS column_name,
formatReadableSize(data_compressed_bytes) AS data_size,
formatReadableSize(marks_bytes) AS index_size,
type,
compression_codec
FROM system.columns
WHERE database = 'db_name' AND table = 'table_name'
ORDER BY data_compressed_bytes DESC
To get a numeric representation of a string need to use one of hash-functions.
SELECT 'jsfhuhsdf', xxHash32('jsfhuhsdf'), cityHash64('jsfhuhsdf');