Duplicated rows in response with engine - Merge - clickhouse

I have 3 tables events_0, events_1, events_2 with Engine = MergeTree
and 1 table events with Engine = Merge
CREATE TABLE events as events_0 ENGINE=Merge(currentDatabase(), '^events');
When I run sql query like
select uuid from events where uuid = 'XXXX-YYY-ZZZZ';
I've got a duplicated response
┌─uuid──────────┐
│ XXXX-YYY-ZZZZ │
└───────────────┘
┌─uuid──────────┐
│ XXXX-YYY-ZZZZ │
└───────────────┘

Try adding _table to the select clause to see which table was generating the data.
select _table, uuid from events where uuid = 'XXXX-YYY-ZZZZ';
It looks like a self-recursion to me. You might need to rename the merge table that won't be matched by regex ^events

Related

Clickhouse convert table with json content each row into table

I have tried to transform json in rows into a table with this json fields. After looking at Clickhouse documentation I cound't find some clickhouse FUNCTION that can handle this task
Here is the table with the
col_a
{"casa":2,"value":4}
{"casa":6,"value":47}
The proposal is to transform using only Clickhouse SQL (CREATE WITH SELECT) int this table
casa
value
2
4
6
47
SELECT
'{"casa":2,"value":4}' AS j,
JSONExtractKeysAndValuesRaw(j) AS t
┌─j────────────────────┬─t────────────────────────────┐
│ {"casa":2,"value":4} │ [('casa','2'),('value','4')] │
└──────────────────────┴──────────────────────────────┘
SELECT
'{"casa":2,"value":4}' AS j,
JSONExtract(j, 'Tuple(casa Int64, value Int64)') AS t,
tupleElement(t, 'casa') AS casa,
tupleElement(t, 'value') AS value
┌─j────────────────────┬─t─────┬─casa─┬─value─┐
│ {"casa":2,"value":4} │ (2,4) │ 2 │ 4 │
└──────────────────────┴───────┴──────┴───────┘

How to use only the latest state of a particular record in Materialized view?

Let's say I have a table defined as
CREATE TABLE orders (
sqlId Int64, -- orders.id from PSQL
isApproved UInt8, -- Boolean
comment String,
price Decimal(10, 2),
createdAt DateTime64(9, 'UTC'),
updatedAt DateTime64(9, 'UTC') DEFAULT NOW()
)
ENGINE = MergeTree
ORDER BY (createdAt, sqlId)
so two fields that might be changed in the source PSQL database - isApproved and comment
Naturally, if I sink some records from an MQ topic into it, I will end up with something like this
SELECT *
FROM orders
Query id: 50cd95a4-e581-41b5-82a4-7ec86771e4e5
┌─sqlId─┬─isApproved─┬──price─┬─comment───┬─────────────────────createdAt─┬─────────────────────updatedAt─┐
│ 1 │ 1 │ 100.00 │ some note │ 2021-11-08 16:24:07.000000000 │ 2021-11-08 16:27:29.000000000 │
└───────┴────────────┴────────┴───────────┴───────────────────────────────┴───────────────────────────────┘
┌─sqlId─┬─isApproved─┬──price─┬─comment─┬─────────────────────createdAt─┬─────────────────────updatedAt─┐
│ 1 │ 1 │ 100.00 │ │ 2021-11-08 16:24:07.000000000 │ 2021-11-08 16:27:22.000000000 │
└───────┴────────────┴────────┴─────────┴───────────────────────────────┴───────────────────────────────┘
┌─sqlId─┬─isApproved─┬──price─┬─comment─┬─────────────────────createdAt─┬─────────────────────updatedAt─┐
│ 1 │ 0 │ 100.00 │ │ 2021-11-08 16:24:07.000000000 │ 2021-11-08 16:27:17.000000000 │
└───────┴────────────┴────────┴─────────┴───────────────────────────────┴───────────────────────────────┘
In other words, a particular order was created first as a non-approved, then it was approved, then some comment was added.
Let's say I want to create a view that represents the total volume of orders per day.
A naive approach might be:
CREATE MATERIALIZED VIEW orders_volume_per_day (
day DateTime64(9, 'UTC'),
volume SimpleAggregateFunction(sum, Decimal(38, 2))
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(day)
ORDER BY day
AS SELECT
toStartOfDay(createdAt) as day,
sum(price) as volume
FROM orders
GROUP BY day
ORDER BY day ASC
However, it will be using all three redundant records, while I only need to use the latest one.
In my particular example, the view will return 300 (3x100) instead of just 100.
Is there any way to achieve the desired behavior in Clickhouse? I know that I can utilize VersionedCollapsingMergeTree somehow with the sign or version columns, but it seems that tools like clickhouse_sinker or snuba will not support it.

Clickhouse - Split arrayMap to colums to sort on

Ive a Clickhouse query question, Im pretty new to Clickhouse so maybe its an easy one for the experts ;)! We have a single table with events in, each event is linked to a product fe product_click, product_view. I want to extract the data grouped by product but in a single line I need all types of events in a separated column so I can sort on it.
I already wrote this query:
SELECT product_id,
arrayMap((x, y) -> (x, y),
(arrayReduce('sumMap', [(groupArrayArray([event_type]) as arr)],
[arrayResize(CAST([], 'Array(UInt64)'), length(arr), toUInt64(1))]) as s).1, s.2) events
FROM events
GROUP BY product_id
Result:
┌─────────────────────────product_id───┬─events─────────────────────────────────────────────────────────────────────────────────────┐
│ 0071f1e4-a484-448e-8355-64e2fea98fd5 │ [('PRODUCT_CLICK',1341),('PRODUCT_VIEW',11)] │
│ 406f4707-6bad-4d3f-9544-c74fdeb1e09d │ [('PRODUCT_CLICK',1),('PRODUCT_VIEW',122),('PRODUCT_BUY',37)] │
│ 94566b6d-6e23-4264-ad76-697ffcfe60c4 │ [('PRODUCT_CLICK',1027),('PRODUCT_VIEW',7)] │
...
Is there any way to convert to arrayMap to columns with a sort key?
So we can filter on the most clicked products first, or the most viewed?
Another question, is having this kind of queries a good idea to always execute, or should we create a MATERIALIZED view for it?
Thanks!
SQL does not allow variable number of columns.
the only way for you
SELECT product_id,
countIf(event_type = 'PRODUCT_CLICK') PRODUCT_CLICK,
countIf(event_type = 'PRODUCT_VIEW') PRODUCT_VIEW,
countIf(event_type = 'PRODUCT_BUY') PRODUCT_BUY
FROM events
GROUP BY product_id

Clickhouse - join on string columns

I got String column uin in several tables, how do I can effectively join on uin these tables?
In Vertica database we use hash(uin) to transform string column into hash with Int data type - it significantly boosts efficiency in joins - could you recommend something like this? I tried CRC32(s) but it seems to work wrong.
At this moment the CH not very good cope with multi-joins queries (DB star-schema) and the query optimizer not good enough to rely on it completely.
So it needs to explicitly say how to 'execute' a query by using subqueries instead of joins.
Let's emulate your query:
SELECT table_01.number AS r
FROM numbers(87654321) AS table_01
INNER JOIN numbers(7654321) AS table_02 ON (table_01.number = table_02.number)
INNER JOIN numbers(654321) AS table_03 ON (table_02.number = table_03.number)
INNER JOIN numbers(54321) AS table_04 ON (table_03.number = table_04.number)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘
8 rows in set. Elapsed: 4.244 sec. Processed 96.06 million rows, 768.52 MB (22.64 million rows/s., 181.10 MB/s.)
*/
On my PC it takes ~4 secs. Let's rewrite it using subqueries to significantly speed it up.
SELECT number AS r
FROM numbers(87654321)
WHERE number IN (
SELECT number
FROM numbers(7654321)
WHERE number IN (
SELECT number
FROM numbers(654321)
WHERE number IN (
SELECT number
FROM numbers(54321)
)
)
)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘
8 rows in set. Elapsed: 0.411 sec. Processed 96.06 million rows, 768.52 MB (233.50 million rows/s., 1.87 GB/s.)
*/
There are other ways to optimize JOIN:
use External dictionary to get rid of join on 'small'-table
use Join table engine
use ANY-strictness
use specific settings like join_algorithm, partial_merge_join_optimizations etc
Some useful refs:
Altinity webinar: Tips and tricks every ClickHouse user should know
Altinity webinar: Secrets of ClickHouse Query Performance
Answer update:
To less storage consumption for String-column consider changing column type to LowCardinality (link 2) that significantly decrease the size of a column with many duplicated elements.
Use this query to get the size of columns:
SELECT
name AS column_name,
formatReadableSize(data_compressed_bytes) AS data_size,
formatReadableSize(marks_bytes) AS index_size,
type,
compression_codec
FROM system.columns
WHERE database = 'db_name' AND table = 'table_name'
ORDER BY data_compressed_bytes DESC
To get a numeric representation of a string need to use one of hash-functions.
SELECT 'jsfhuhsdf', xxHash32('jsfhuhsdf'), cityHash64('jsfhuhsdf');

MIgrating Virtual Columns from oracle to postgres

What options does one have to deal with virtual columns when migrating from Oracle 11 to Postgres 9.5 - without having to change database related code in an application (which means functions and views are out of the picture and triggers are way too expensive as dealing with large data sets)?
A similar question exists : Computed / calculated columns in PostgreSQL but the solutions do not help with the migration scenario.
If you use a BEFORE INSERT trigger, you can modify the values inserted before they actually are written. That shouldn't be very expensive. If cutting edge performance is required, write the trigger function in C.
But I think that a view is the best solution. You can use an updatable view, that way you wouldn't have to change the application code:
CREATE TABLE data(
id integer PRIMARY KEY,
factor1 integer NOT NULL,
factor2 integer NOT NULL
);
CREATE VIEW interface AS
SELECT id, factor1, factor2,
factor1 * factor2 AS product
FROM data;
test=> INSERT INTO interface VALUES (1, 6, 7), (2, 3, 14);
INSERT 0 2
test=> UPDATE interface SET factor1 = 7 WHERE id = 1;
UPDATE 1
test=> DELETE FROM interface WHERE id = 1;
DELETE 1
test=> SELECT * FROM interface;
┌────┬─────────┬─────────┬─────────┐
│ id │ factor1 │ factor2 │ product │
├────┼─────────┼─────────┼─────────┤
│ 2 │ 3 │ 14 │ 42 │
└────┴─────────┴─────────┴─────────┘
(1 row)

Resources