Is there a way to combined (concatenate) Nested fields in a group by in Clickhouse in an AggregatedMergeTree materialied view?
Imagine that I have a table with a schema (simplified) like this:
CREATE TABLE test
(
key1 String,
key2 String,
clicks Int32,
points Nested(x Int32, y Int32)
) Engine = Log
I would like to be able to use an AggregatingMergeTree to generate a materialized view that combines the nested fields by "concatenating" them (as if nested records could be just concatenated as complex values as some SQL dialects can).
If I do that as a query it's possible:
SELECT
key1,
key2,
arrayMap(p -> p.1, points) as x,
arrayMap(p -> p.2, points) as y
FROM
(
SELECT
key1,
key2,
groupArray(tuple(x, y)) as points
FROM
(
SELECT
key1, key2, points.x as x, points.y as y
FROM test
ARRAY JOIN points
)
GROUP BY key1, key2
)
Is there a way to express this in the query used in a materialized view based on the AggregatingMergeTree engine? The best I could come up with is something like this:
CREATE MATERIALIZED VIEW testagg1
engine = AggregatingMergeTree partition by key1 order by (key1, key2)
AS
SELECT
key1,
key2,
sumState(clicks) as clicks,
groupArrayState(points.x) as `points.x`,
groupArrayState(points.y) as `points.y`
FROM test
GROUP BY key1, key2
I can then get the flattened form using this query:
SELECT
arrayMap(p -> p.1, arrayZip(x, y)) as x1,
arrayMap(p -> p.2, arrayZip(x, y)) as y1
FROM
(
SELECT
key1,
key2,
groupArrayMerge(`points.x`) as x,
groupArrayMerge(`points.y`) as y
FROM testagg1
GROUP BY key1, key2
) as points
ARRAY JOIN x, y
It works but seems a bit complex.
Is there a simpler and better way to do this?
Are the groupArrayState and groupArrayMerge aggregation used above guaranteed to preserve the same ordering of the x/y fields in the parallel arrays?
Nested(x Int32, y Int32) -- is a syntax sugar for a create table command to reduce arrays boilerplate.
desc test
┌─name─────┬─type─────────┬
│ key1 │ String │
│ key2 │ String │
│ clicks │ Int32 │
│ points.x │ Array(Int32) │
│ points.y │ Array(Int32) │
└──────────┴──────────────┴
The best I could come up with is something like this:
groupArrayState(points.x) as points.x,
groupArrayState(points.y) as points.y
It's the only way. And it's the official/right CH way.
aggregation used above guaranteed to preserve the same
ordering of the x/y fields in the parallel arrays?
yes, it's guaranteed.
SELECT
arrayMap(p -> p.1, arrayZip(x, y)) as x1,
arrayMap(p -> p.2, arrayZip(x, y)) as y1
It's the same as
SELECT x,y
isn't it?
Related
table
CREATE TABLE test
(
uid UUID,
agc Int64,
stc Int8,
oci Int32,
sci Int32,
fcd String,
prc Float64
) engine = MergeTree()
ORDER BY (agc, oci);
base query
SELECT fcd, groupArray((agc, stc, oci, sci, (uid, prc))) as arr
FROM test
GROUP BY fcd;
next, I want to group groupArray by the first 4 values, like this (i know what groupArray cannot nest groupArray)
SELECT fcd, groupArray(groupArray(agc, stc, oci, sci)), (uid, prc))) as arr
example output
fcd
groupArray(groupArray(agc, stc, oci, sci)), (uid, prc)))
'str'
[(1, 1, 1, 2, [(id1, 10), (id2, 15)]), (1, 1, 1, 2, [(id3, 13), (id3, 11)])]
Try this query:
SELECT arrayJoin(arr_result) AS result
FROM
(
SELECT
id,
groupArray((v2, v3)) AS arr1,
groupArray((v4, v5)) AS arr2,
arrayMap(x -> (untuple(x), arr2), arr1) AS arr_result
FROM
(
SELECT
number % 2 AS id,
number AS v2,
number AS v3,
number AS v4,
number AS v5
FROM numbers(4)
)
GROUP BY id
)
/*
┌─result──────────────┐
│ (0,0,[(0,0),(2,2)]) │
│ (2,2,[(0,0),(2,2)]) │
│ (1,1,[(1,1),(3,3)]) │
│ (3,3,[(1,1),(3,3)]) │
└─────────────────────┘
*/
I use Clickhouse database. There is a table with string column (data). All rows contains data like:
'[{"a":23, "b":1}]'
'[{"a":7, "b":15}]'
I wanna get all values of key "b".
1
15
Next query:
Select JSONExtractInt('data', 0, 'b') from table
return 0 all time. How i can get values of key "b"?
SELECT tupleElement(JSONExtract(j, 'Array(Tuple(a Int64, b Int64))'), 'b')[1] AS res
FROM
(
SELECT '[{"a":23, "b":1}]' AS j
UNION ALL
SELECT '[{"a":7, "b":15}]'
)
┌─res─┐
│ 1 │
└─────┘
┌─res─┐
│ 15 │
└─────┘
The docs for the groupArray function warns that
Values can be added to the array in any (indeterminate) order.... In
some cases, you can still rely on the order of execution. This applies
to cases when SELECT comes from a subquery that uses ORDER BY.
Does this just mean that the array will not neccessarily be in the order specified in the ORDER BY? Can I depend on the order of multiple groupArrays in the same query being consistent with each other?
For instance given the records:
{commonField:"common", fieldA: "1a", fieldB:"1b"}
{commonField:"common", fieldA: "2a", fieldB:"2b"}
{commonField:"common", fieldA: "3a", fieldB:"3b"}
Can I depend on the query
SELECT commonField, groupArray(fieldA), groupArray(fieldB) FROM myTable GROUP BY commonField
to return
{
commonField:"common",
groupedA:[
"2a", "3a", "1a"
],
groupedB:[
"2b", "3b", "1b"
]
}
multiple groupArrays in the same query being consistent with each other?
Yes. They will be consistent.
Anyway you can use Tuple & single groupArray. And Tuple is usefull if you have NULLs, because ALL aggregate functions skip Nulls.
create table test (K Int64, A Nullable(String), B Nullable(String)) Engine=Memory;
insert into test values(1, '1A','1B')(2, '2A', Null);
select groupArray(A), groupArray(B) from test;
┌─groupArray(A)─┬─groupArray(B)─┐
│ ['1A','2A'] │ ['1B'] │
└───────────────┴───────────────┘
---- Tuple (A,B) one groupArray ----
select groupArray( (A,B) ) from test;
┌─groupArray(tuple(A, B))───┐
│ [('1A','1B'),('2A',NULL)] │
└───────────────────────────┘
select (groupArray( (A,B) ) as ga).1 _A, ga.2 _B from test;
┌─_A──────────┬─_B──────────┐
│ ['1A','2A'] │ ['1B',NULL] │
└─────────────┴─────────────┘
---- One more Tuple trick - Tuple(Null) is not Null ----
select groupArray(tuple(A)).1 _A , groupArray(tuple(B)).1 _B from test;
┌─_A──────────┬─_B──────────┐
│ ['1A','2A'] │ ['1B',NULL] │
└─────────────┴─────────────┘
---- One more Tuple trick tuple(*)
select groupArray( tuple(*) ) from test;
┌─groupArray(tuple(K, A, B))────┐
│ [(1,'1A','1B'),(2,'2A',NULL)] │
└───────────────────────────────┘
Ive a Clickhouse query question, Im pretty new to Clickhouse so maybe its an easy one for the experts ;)! We have a single table with events in, each event is linked to a product fe product_click, product_view. I want to extract the data grouped by product but in a single line I need all types of events in a separated column so I can sort on it.
I already wrote this query:
SELECT product_id,
arrayMap((x, y) -> (x, y),
(arrayReduce('sumMap', [(groupArrayArray([event_type]) as arr)],
[arrayResize(CAST([], 'Array(UInt64)'), length(arr), toUInt64(1))]) as s).1, s.2) events
FROM events
GROUP BY product_id
Result:
┌─────────────────────────product_id───┬─events─────────────────────────────────────────────────────────────────────────────────────┐
│ 0071f1e4-a484-448e-8355-64e2fea98fd5 │ [('PRODUCT_CLICK',1341),('PRODUCT_VIEW',11)] │
│ 406f4707-6bad-4d3f-9544-c74fdeb1e09d │ [('PRODUCT_CLICK',1),('PRODUCT_VIEW',122),('PRODUCT_BUY',37)] │
│ 94566b6d-6e23-4264-ad76-697ffcfe60c4 │ [('PRODUCT_CLICK',1027),('PRODUCT_VIEW',7)] │
...
Is there any way to convert to arrayMap to columns with a sort key?
So we can filter on the most clicked products first, or the most viewed?
Another question, is having this kind of queries a good idea to always execute, or should we create a MATERIALIZED view for it?
Thanks!
SQL does not allow variable number of columns.
the only way for you
SELECT product_id,
countIf(event_type = 'PRODUCT_CLICK') PRODUCT_CLICK,
countIf(event_type = 'PRODUCT_VIEW') PRODUCT_VIEW,
countIf(event_type = 'PRODUCT_BUY') PRODUCT_BUY
FROM events
GROUP BY product_id
I am wondering whether there is a faster way to do what I am trying to do below - basically, unnesting an array and creating a groupArray with different columsn.
-- create table
CREATE TABLE default.t15 ( product String, indx Array(UInt8), col1 String, col2 Array(UInt8)) ENGINE = Memory ;
--insert values
INSERT into t15 values ('p',[1,2,3],'a',[10,20,30]),('p',[1,2,3],'b',[40,50,60]),('p',[1,2,3],'c',[70,80,90]);
-- select values
SELECT * from t15;
┌─product─┬─indx────┬─col1─┬─col2───────┐
│ p │ [1,2,3] │ a │ [10,20,30] │
│ p │ [1,2,3] │ b │ [40,50,60] │
│ p │ [1,2,3] │ c │ [70,80,90] │
└─────────┴─────────┴──────┴────────────┘
DESIRED OUTPUT
┌─product─┬─indx_list─┬─col1_arr──────┬─col2_arr───┐
│ p │ 1 │ ['a','b','c'] │ [10,40,70] │
│ p │ 2 │ ['a','b','c'] │ [20,50,80] │
│ p │ 3 │ ['a','b','c'] │ [30,60,90] │
└─────────┴───────────┴───────────────┴────────────┘
How I am doing it -> [little slow for what I need this for]
SELECT product,
indx_list,
groupArray(col1) col1_arr,
groupArray(col2_list) col2_arr
FROM (
SELECT product,
indx_list,
col1,
col2_list
FROM t15
ARRAY JOIN
indx AS indx_list,
col2 AS col2_list
ORDER BY indx_list,
col1
)x
GROUP BY product,
indx_list;
Basically, I am unnesting the array and then grouping them back.
Is there a better and faster way to do this.
Thanks!
If you want to make it faster it look like you can avoid subselect and the global ORDER BY in it. So something like:
SELECT
product,
indx_list,
groupArray(col1) AS col1_arr,
groupArray(col2_list) AS col2_arr
FROM t15
ARRAY JOIN
indx AS indx_list,
col2 AS col2_list
GROUP BY
product,
indx_list
If you need the arrays to be sorted it's usually better to sort it inside each group separately, using arraySort.
I would make the query a little simple to reduce the count of array joins to one, that probably improves performance:
SELECT
product,
index as indx_list,
groupArray(col1) as col1_arr,
groupArray(element) as col2_arr
FROM
(
SELECT
product,
arrayJoin(indx) AS index,
col1,
col2[index] AS element
FROM default.t15
)
GROUP BY
product,
index;
Maybe make sense to change the table structure to get rid of any arrays. I would suggest the flat schema:
CREATE TABLE default.t15 (
product String,
valueId UInt8, /* indx */
col1 String, /* col1 */
value UInt8) /* col2 */
ENGINE = Memory ;