The docs for the groupArray function warns that
Values can be added to the array in any (indeterminate) order.... In
some cases, you can still rely on the order of execution. This applies
to cases when SELECT comes from a subquery that uses ORDER BY.
Does this just mean that the array will not neccessarily be in the order specified in the ORDER BY? Can I depend on the order of multiple groupArrays in the same query being consistent with each other?
For instance given the records:
{commonField:"common", fieldA: "1a", fieldB:"1b"}
{commonField:"common", fieldA: "2a", fieldB:"2b"}
{commonField:"common", fieldA: "3a", fieldB:"3b"}
Can I depend on the query
SELECT commonField, groupArray(fieldA), groupArray(fieldB) FROM myTable GROUP BY commonField
to return
{
commonField:"common",
groupedA:[
"2a", "3a", "1a"
],
groupedB:[
"2b", "3b", "1b"
]
}
multiple groupArrays in the same query being consistent with each other?
Yes. They will be consistent.
Anyway you can use Tuple & single groupArray. And Tuple is usefull if you have NULLs, because ALL aggregate functions skip Nulls.
create table test (K Int64, A Nullable(String), B Nullable(String)) Engine=Memory;
insert into test values(1, '1A','1B')(2, '2A', Null);
select groupArray(A), groupArray(B) from test;
┌─groupArray(A)─┬─groupArray(B)─┐
│ ['1A','2A'] │ ['1B'] │
└───────────────┴───────────────┘
---- Tuple (A,B) one groupArray ----
select groupArray( (A,B) ) from test;
┌─groupArray(tuple(A, B))───┐
│ [('1A','1B'),('2A',NULL)] │
└───────────────────────────┘
select (groupArray( (A,B) ) as ga).1 _A, ga.2 _B from test;
┌─_A──────────┬─_B──────────┐
│ ['1A','2A'] │ ['1B',NULL] │
└─────────────┴─────────────┘
---- One more Tuple trick - Tuple(Null) is not Null ----
select groupArray(tuple(A)).1 _A , groupArray(tuple(B)).1 _B from test;
┌─_A──────────┬─_B──────────┐
│ ['1A','2A'] │ ['1B',NULL] │
└─────────────┴─────────────┘
---- One more Tuple trick tuple(*)
select groupArray( tuple(*) ) from test;
┌─groupArray(tuple(K, A, B))────┐
│ [(1,'1A','1B'),(2,'2A',NULL)] │
└───────────────────────────────┘
Related
If clickhouse is performing a background merge operation (lets say 10 parts into 1 part), would that cause the selected marks to go up? Or are selected marks only governed by read operations performed due to SELECT queries
It should not in general but it may because of partition pruning.
create table test( D date, K Int64, S String )
Engine=MergeTree partition by toYYYYMM(D) order by K;
system stop merges test;
insert into test select '2022-01-01', number, '' from numbers(1000000);
insert into test select '2022-01-31', number, '' from numbers(1000000);
select name, min_date, max_date, rows from system.parts where table = 'test' and active;
┌─name─────────┬───min_date─┬───max_date─┬────rows─┐
│ 202201_1_1_0 │ 2022-01-01 │ 2022-01-01 │ 1000000 │ two parts in a partition and min_date
│ 202201_2_2_0 │ 2022-01-31 │ 2022-01-31 │ 1000000 │ min_date & max_date are not intersecting
└──────────────┴────────────┴────────────┴─────────┘
explain estimate select count() from test where D between '2022-01-01' and '2022-01-15';
┌─database─┬─table─┬─parts─┬────rows─┬─marks─┐
│ dw │ test │ 1 │ 1000000 │ 123 │ -- 123 mark.
└──────────┴───────┴───────┴─────────┴───────┘
system start merges test;
optimize table test final;
select name, min_date, max_date, rows from system.parts where table = 'test' and active;
┌─name─────────┬───min_date─┬───max_date─┬────rows─┐
│ 202201_1_2_1 │ 2022-01-01 │ 2022-01-31 │ 2000000 │ one part covers the whole month
└──────────────┴────────────┴────────────┴─────────┘
explain estimate select count() from test where D between '2022-01-01' and '2022-01-15';
┌─database─┬─table─┬─parts─┬────rows─┬─marks─┐
│ dw │ test │ 1 │ 2000000 │ 245 │ -- 245 mark.
└──────────┴───────┴───────┴─────────┴───────┘
In real life you will never notice this because it's very synthetic case, no filters on primary key index, and partition column is not in primary key index.
And it does not mean that merges make query slower, it means that Clickhouse is able to leverage the fact that data is not merged yet and reads only a part of the data in a partition.
I use Clickhouse database. There is a table with string column (data). All rows contains data like:
'[{"a":23, "b":1}]'
'[{"a":7, "b":15}]'
I wanna get all values of key "b".
1
15
Next query:
Select JSONExtractInt('data', 0, 'b') from table
return 0 all time. How i can get values of key "b"?
SELECT tupleElement(JSONExtract(j, 'Array(Tuple(a Int64, b Int64))'), 'b')[1] AS res
FROM
(
SELECT '[{"a":23, "b":1}]' AS j
UNION ALL
SELECT '[{"a":7, "b":15}]'
)
┌─res─┐
│ 1 │
└─────┘
┌─res─┐
│ 15 │
└─────┘
e.g.
In clickhouse, I want to create one table like the following structure.
create table (
time DateTime,
visits array(unit)
)
Engine=memory
the unit struct {
a string,
btime int64,
c string,
e string
}
How to create the table?
It needs to use Nested data structure:
CREATE TABLE visits (
time DateTime,
visits Nested
(
a String,
btime Int64,
c String,
e String
)
) ENGINE = Memory;
/* insert test data */
INSERT INTO visits
VALUES
(now(), ['a1', 'a2'], [1, 2], ['c1', 'c2'], ['e1', 'e2']),
(now(), ['a11', 'a12'], [11, 12], ['c11', 'c12'], ['e11', 'e12']);
SELECT *
FROM visits;
/* results
┌────────────────time─┬─visits.a──────┬─visits.btime─┬─visits.c──────┬─visits.e──────┐
│ 2020-06-12 08:14:07 │ ['a1','a2'] │ [1,2] │ ['c1','c2'] │ ['e1','e2'] │
│ 2020-06-12 08:14:07 │ ['a11','a12'] │ [11,12] │ ['c11','c12'] │ ['e11','e12'] │
└─────────────────────┴───────────────┴──────────────┴───────────────┴───────────────┘
*/
Additionally, see the article Nested Data Structures in ClickHouse.
I got String column uin in several tables, how do I can effectively join on uin these tables?
In Vertica database we use hash(uin) to transform string column into hash with Int data type - it significantly boosts efficiency in joins - could you recommend something like this? I tried CRC32(s) but it seems to work wrong.
At this moment the CH not very good cope with multi-joins queries (DB star-schema) and the query optimizer not good enough to rely on it completely.
So it needs to explicitly say how to 'execute' a query by using subqueries instead of joins.
Let's emulate your query:
SELECT table_01.number AS r
FROM numbers(87654321) AS table_01
INNER JOIN numbers(7654321) AS table_02 ON (table_01.number = table_02.number)
INNER JOIN numbers(654321) AS table_03 ON (table_02.number = table_03.number)
INNER JOIN numbers(54321) AS table_04 ON (table_03.number = table_04.number)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘
8 rows in set. Elapsed: 4.244 sec. Processed 96.06 million rows, 768.52 MB (22.64 million rows/s., 181.10 MB/s.)
*/
On my PC it takes ~4 secs. Let's rewrite it using subqueries to significantly speed it up.
SELECT number AS r
FROM numbers(87654321)
WHERE number IN (
SELECT number
FROM numbers(7654321)
WHERE number IN (
SELECT number
FROM numbers(654321)
WHERE number IN (
SELECT number
FROM numbers(54321)
)
)
)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘
8 rows in set. Elapsed: 0.411 sec. Processed 96.06 million rows, 768.52 MB (233.50 million rows/s., 1.87 GB/s.)
*/
There are other ways to optimize JOIN:
use External dictionary to get rid of join on 'small'-table
use Join table engine
use ANY-strictness
use specific settings like join_algorithm, partial_merge_join_optimizations etc
Some useful refs:
Altinity webinar: Tips and tricks every ClickHouse user should know
Altinity webinar: Secrets of ClickHouse Query Performance
Answer update:
To less storage consumption for String-column consider changing column type to LowCardinality (link 2) that significantly decrease the size of a column with many duplicated elements.
Use this query to get the size of columns:
SELECT
name AS column_name,
formatReadableSize(data_compressed_bytes) AS data_size,
formatReadableSize(marks_bytes) AS index_size,
type,
compression_codec
FROM system.columns
WHERE database = 'db_name' AND table = 'table_name'
ORDER BY data_compressed_bytes DESC
To get a numeric representation of a string need to use one of hash-functions.
SELECT 'jsfhuhsdf', xxHash32('jsfhuhsdf'), cityHash64('jsfhuhsdf');
I am wondering whether there is a faster way to do what I am trying to do below - basically, unnesting an array and creating a groupArray with different columsn.
-- create table
CREATE TABLE default.t15 ( product String, indx Array(UInt8), col1 String, col2 Array(UInt8)) ENGINE = Memory ;
--insert values
INSERT into t15 values ('p',[1,2,3],'a',[10,20,30]),('p',[1,2,3],'b',[40,50,60]),('p',[1,2,3],'c',[70,80,90]);
-- select values
SELECT * from t15;
┌─product─┬─indx────┬─col1─┬─col2───────┐
│ p │ [1,2,3] │ a │ [10,20,30] │
│ p │ [1,2,3] │ b │ [40,50,60] │
│ p │ [1,2,3] │ c │ [70,80,90] │
└─────────┴─────────┴──────┴────────────┘
DESIRED OUTPUT
┌─product─┬─indx_list─┬─col1_arr──────┬─col2_arr───┐
│ p │ 1 │ ['a','b','c'] │ [10,40,70] │
│ p │ 2 │ ['a','b','c'] │ [20,50,80] │
│ p │ 3 │ ['a','b','c'] │ [30,60,90] │
└─────────┴───────────┴───────────────┴────────────┘
How I am doing it -> [little slow for what I need this for]
SELECT product,
indx_list,
groupArray(col1) col1_arr,
groupArray(col2_list) col2_arr
FROM (
SELECT product,
indx_list,
col1,
col2_list
FROM t15
ARRAY JOIN
indx AS indx_list,
col2 AS col2_list
ORDER BY indx_list,
col1
)x
GROUP BY product,
indx_list;
Basically, I am unnesting the array and then grouping them back.
Is there a better and faster way to do this.
Thanks!
If you want to make it faster it look like you can avoid subselect and the global ORDER BY in it. So something like:
SELECT
product,
indx_list,
groupArray(col1) AS col1_arr,
groupArray(col2_list) AS col2_arr
FROM t15
ARRAY JOIN
indx AS indx_list,
col2 AS col2_list
GROUP BY
product,
indx_list
If you need the arrays to be sorted it's usually better to sort it inside each group separately, using arraySort.
I would make the query a little simple to reduce the count of array joins to one, that probably improves performance:
SELECT
product,
index as indx_list,
groupArray(col1) as col1_arr,
groupArray(element) as col2_arr
FROM
(
SELECT
product,
arrayJoin(indx) AS index,
col1,
col2[index] AS element
FROM default.t15
)
GROUP BY
product,
index;
Maybe make sense to change the table structure to get rid of any arrays. I would suggest the flat schema:
CREATE TABLE default.t15 (
product String,
valueId UInt8, /* indx */
col1 String, /* col1 */
value UInt8) /* col2 */
ENGINE = Memory ;