Clickhouse - cross table TTL expressions - clickhouse

Is it possible to define the TTL for a table in Clickhouse so that it references other table? Let's say I have a chat application and in my database I have two tables: chats and chat_messages. Chats have start and stop time information and I want to delete old chats along with their messages entirely when they expire - so basing on the chat stop_time. I tried to create those tables in following way:
db43af298bb9 :) CREATE TABLE chats (id Int64, start_time DateTime, stop_time DateTime) ENGINE = MergeTree() ORDER BY (start_time, id) TTL stop_time + INTERVAL 1 MONTH;
CREATE TABLE chats
(
`id` Int64,
`start_time` DateTime,
`stop_time` DateTime
)
ENGINE = MergeTree()
ORDER BY (start_time, id)
TTL stop_time + toIntervalMonth(1)
Ok.
0 rows in set. Elapsed: 0.014 sec.
db43af298bb9 :) CREATE TABLE chat_messages (id Int64, text String, chat_id Int64) ENGINE = MergeTree() ORDER BY id TTL (SELECT stop_time from chats where chats.id = chat_id) + INTERVAL 1 MONTH;
CREATE TABLE chat_messages
(
`id` Int64,
`text` String,
`chat_id` Int64
)
ENGINE = MergeTree()
ORDER BY id
TTL
(
SELECT stop_time
FROM chats
WHERE chats.id = chat_id
) + toIntervalMonth(1)
Received exception from server (version 19.16.10):
Code: 47. DB::Exception: Received from localhost:9000. DB::Exception: Missing columns: 'chat_id' while processing query: 'SELECT stop_time FROM chats WHERE id = chat_id', required columns: 'id' 'chat_id' 'stop_time', source columns: 'stop_time' 'id' 'start_time'.
0 rows in set. Elapsed: 0.017 sec.
The TTL definition for the second table fails because it tries to find the 'call_id' column in 'chats' table instead of the source 'chat_messages' table. Is what I'm trying to achieve even possible or am I forced to use ALTER DELETE mechanism instead?

Related

Materialzed view works for few days and then stops

I have these three table (I cleaned them)
CREATE TABLE Record (
`visitId` String,
`visitorId` String,
`pageUrl` LowCardinality(String),
`createdAtDay` Date DEFAULT now()
) ENGINE = MergeTree PARTITION BY toYYYYMM(createdAtDay) PRIMARY KEY (
visitorId,
visitId,
pageUrl,
createdAtDay
)
ORDER BY
(visitorId, visitId, pageUrl)
CREATE MATERIALIZED VIEW DurationPerPage (
`visits` Int64 CODEC(DoubleDelta, LZ4),
`pageUrl` LowCardinality(String),
`visitors` Int64 CODEC(DoubleDelta, LZ4),
`duration` Int64 CODEC(DoubleDelta, LZ4),
`createdAtDay` Date,
) ENGINE = SummingMergeTree((visits, visitors, duration))
ORDER BY
(createdAtDay, pageUrl) AS
SELECT
countDistinct(visitId) AS visits,
cutQueryStringAndFragment(pageUrl) AS pageUrl,
countDistinct(visitorId) AS visitors,
sum(e.value) AS duration,
createdAtDay
FROM
Record AS r
LEFT JOIN Events AS e ON (r.visitId = e.visitId)
AND (e.eventType = 6)
WHERE
pageType LIKE '%single%'
GROUP BY
(createdAtDay, pageUrl);
CREATE TABLE Events (
`visitId` String,
`visitorId` String,
`value` Int64 CODEC(DoubleDelta, LZ4),
`eventType` Int16 CODEC(DoubleDelta, LZ4)
) ENGINE = MergeTree PARTITION BY (toYYYYMM(createdAtDay), eventType) PRIMARY KEY (visitId, eventType, createdAtDay)
ORDER BY
(visitId, eventType, createdAtDay)
as you can see I'm using both Record and Events table to feed my materialzed view. it works good for few days and then it stops and starts saving weird data (mostly zeros at the duration field) and I have then to delete and recreate it.
is there a related bug to this ? or something is wrong the View ?

clickhouse create table Exception: Aggregate function minState(origin_user) is found in wrong place in query

CREATE TABLE user_dwd.user_tag_bitmap_local
(
`tag` String,
`tag_item` String,
`p_day` Date,
`origin_user` UInt64,
`users` AggregateFunction(min, UInt64) MATERIALIZED minState(origin_user)
)
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(p_day)
ORDER BY (tag, tag_item)
SETTINGS index_granularity = 8192;
when running sql to create table, show error:
[2021-10-17 12:05:28] Code: 184, e.displayText() = DB::Exception: Aggregate function minState(origin_user) is found in wrong place in query: While processing minState(origin_user) AS users_tmp_alter9508717652815860223: default expression and column type are incompatible. (version 21.8.4.51 (official build))
how to solve the error?
minState is an aggregating function, you cannot use it like this (it is for queries with a groupby section).
To solve it you can use MATERIALIZED initializeAggregation... or MATERIALIZED arrayReduce(minState...
But actually you don't need the second column.
You are looking for SimpleAggregateFunction:
https://clickhouse.com/docs/en/sql-reference/data-types/simpleaggregatefunction/
CREATE TABLE user_dwd.user_tag_bitmap_local
(
`tag` String,
`tag_item` String,
`p_day` Date,
`origin_user` SimpleAggregateFunction(min, UInt64) ---<<<-----
)
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(p_day)
ORDER BY (tag, tag_item)
SETTINGS index_granularity = 8192;
https://clickhouse.com/docs/en/sql-reference/functions/other-functions/#initializeaggregation
CREATE TABLE user_tag_bitmap_local
(
`tag` String,
`tag_item` String,
`p_day` Date,
`origin_user` UInt64,
`users` AggregateFunction(min, UInt64) MATERIALIZED initializeAggregation('minState', origin_user)
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMMDD(p_day)
ORDER BY (tag, tag_item)
SETTINGS index_granularity = 8192
https://clickhouse.com/docs/en/sql-reference/functions/array-functions/#arrayreduce
CREATE TABLE user_tag_bitmap_local
(
`tag` String,
`tag_item` String,
`p_day` Date,
`origin_user` UInt64,
`users` AggregateFunction(min, UInt64) MATERIALIZED arrayReduce('minState', [origin_user])
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMMDD(p_day)
ORDER BY (tag, tag_item)
SETTINGS index_granularity = 8192

clickhouse MATERIALIZED VIEW issues

I created MATERIALIZED VIEW like this :
create target table:
CREATE TABLE user_deatils_daily (
day date,
hour UInt8 ,
appid UInt32,
isp String,
city String,
country String,
session_count UInt64,
avg_score AggregateFunction(avg, Float32),
min_revenue AggregateFunction(min, Float32),
max_load_time AggregateFunction(max, Int32)
)
ENGINE = SummingMergeTree()
PARTITION BY toRelativeWeekNum(day)
ORDER BY (day,hour)
create mv:
CREATE MATERIALIZED VIEW user_deatils_daily_mv
TO user_deatils_daily as
select toDate(session_ts) as day, toHour(toDateTime(session_ts)) as hour,appid,isp,city,country,
count(session_uuid) as session_count,avgState() as avg_score,
minState(revenue) as min_revenue,
maxState(perf_page_load_time) as max_load_time
from user_deatils where toDate(session_ts)>='2020-08-26' group by session_ts,appid,isp,city,country
the data in the target table starting to fill with data.
after some times the target table is getting fill with new data and doesn't' save the old one.
why is that?
SummingMergeTree() PARTITION BY toRelativeWeekNum(day) ORDER BY (day,hour)
means calculate sums groupby toRelativeWeekNum(day), day,hour)
user_deatils_daily knows nothing about user_deatils_daily_mv. They are not related.
user_deatils_daily_mv just does inserts into user_deatils_daily
SummingMergeTree knows nothing about group by session_ts,appid,isp,city,country
I would expect to see ORDER BY (ts,appid,isp,city,country);
I would do:
CREATE TABLE user_details_daily
( ts DateTime,
appid UInt32,
isp String,
city String,
country String,
session_count SimpleAggregateFunction(sum,UInt64),
avg_score AggregateFunction(avg, Float32),
min_revenue SimpleAggregateFunction(min, Float32),
max_load_time SimpleAggregateFunction(max, Int32) )
ENGINE = AggregatingMergeTree()
PARTITION BY toStartOfWeek(ts)
ORDER BY (ts,appid,isp,city,country);
CREATE MATERIALIZED VIEW user_deatils_daily_mv TO user_details_daily
as select
toStartOfHour(toDateTime(session_ts)) ts,
appid,
isp,
city,
country,
count(session_uuid) as session_count ,
avgState() as avg_score,
min(revenue) as min_revenue,
max(perf_page_load_time) as max_load_time
from user_details
where toDate(session_ts)>='2020-08-26' group by ts,appid,isp,city,country;

How to decide the partition key for clickhouse

I want to know what's the best practice for the partition key.
In my project, we have a table with event_date, app_id and other columns. The app_id will be growing and could be thousands.
The select query is based on event_date and app_id.
The simple data schema is as below:
CREATE TABLE test.test_custom_partition (
company_id UInt64,
app_id String,
event_date DateTime,
event_name String ) ENGINE MergeTree()
PARTITION BY (toYYYYMMDD(event_date), app_id)
ORDER BY (app_id, company_id, event_date)
SETTINGS index_granularity = 8192;
the select query is like below:
select event_name from test_custom_partition
where event_date >= '2020-07-01 00:00:00' AND event_date <= '2020-07-15 00:00:00'
AND app_id = 'test';
I want to use (toYYYYMMDD(event_date), app_id) as the partition key, as the query could read the minimal data parts. But it could cause the partitions more than 1000, from the document I see
A merge only works for data parts that have the same value for the
partitioning expression. This means you shouldn't make overly granular
partitions (more than about a thousand partitions). Otherwise, the
SELECT query performs poorly because of an unreasonably large number
of files in the file system and open file descriptors.
Or should I use the partition key only toYYYYMMDD(event_date)?
also, could anyone explain why the partition shouldn't more than 1000 partitions? even if the query only use a small set of the data part, it still could cause performance issue?
Thanks

Is there a way to sum two columns into another column in Hive HQL?

I'm looking to get a running daily, weekly, and monthly sum of the number of messages I sent out. There are about 500 different message types.
I have the following tables:
Table name: messages
int message_type
BIGINT num_sent
string date
Table name: stats
int message_type
BIGINT num_sent_today
BIGINT num_sent_week
BIGINT num_sent_month
Table messages is updated daily with new rows for today's date. Is there a single hive query I can run daily to update the stats table? Note I can't get the running counts by querying the messages table directly using WHERE date >= 30 days ago because the table is too big. I have to add/subtract daily values from table stats instead. Something like this:
// pseudocode
// Get this table (call it table b) from table messages
int message_type
BIGINT num_sent_today
BIGINT num_sent_seven_days_ago
BIGINT num_sent_thirty_days_ago
// join b with table stats so that I can
// Set stats.num_sent_today = b.num_sent_today
// Set stats.num_sent_week = stats.num_sent_week + b.num_sent_today - b.num_sent_seven_days_ago
// Set stats.num_sent_month = stats.num_sent_month + b.num_sent_today - b.num_sent_thirty_days_ago
looks like I can just directly add the columns with +

Resources