clickhouse MATERIALIZED VIEW issues - clickhouse

I created MATERIALIZED VIEW like this :
create target table:
CREATE TABLE user_deatils_daily (
day date,
hour UInt8 ,
appid UInt32,
isp String,
city String,
country String,
session_count UInt64,
avg_score AggregateFunction(avg, Float32),
min_revenue AggregateFunction(min, Float32),
max_load_time AggregateFunction(max, Int32)
)
ENGINE = SummingMergeTree()
PARTITION BY toRelativeWeekNum(day)
ORDER BY (day,hour)
create mv:
CREATE MATERIALIZED VIEW user_deatils_daily_mv
TO user_deatils_daily as
select toDate(session_ts) as day, toHour(toDateTime(session_ts)) as hour,appid,isp,city,country,
count(session_uuid) as session_count,avgState() as avg_score,
minState(revenue) as min_revenue,
maxState(perf_page_load_time) as max_load_time
from user_deatils where toDate(session_ts)>='2020-08-26' group by session_ts,appid,isp,city,country
the data in the target table starting to fill with data.
after some times the target table is getting fill with new data and doesn't' save the old one.
why is that?

SummingMergeTree() PARTITION BY toRelativeWeekNum(day) ORDER BY (day,hour)
means calculate sums groupby toRelativeWeekNum(day), day,hour)
user_deatils_daily knows nothing about user_deatils_daily_mv. They are not related.
user_deatils_daily_mv just does inserts into user_deatils_daily
SummingMergeTree knows nothing about group by session_ts,appid,isp,city,country
I would expect to see ORDER BY (ts,appid,isp,city,country);
I would do:
CREATE TABLE user_details_daily
( ts DateTime,
appid UInt32,
isp String,
city String,
country String,
session_count SimpleAggregateFunction(sum,UInt64),
avg_score AggregateFunction(avg, Float32),
min_revenue SimpleAggregateFunction(min, Float32),
max_load_time SimpleAggregateFunction(max, Int32) )
ENGINE = AggregatingMergeTree()
PARTITION BY toStartOfWeek(ts)
ORDER BY (ts,appid,isp,city,country);
CREATE MATERIALIZED VIEW user_deatils_daily_mv TO user_details_daily
as select
toStartOfHour(toDateTime(session_ts)) ts,
appid,
isp,
city,
country,
count(session_uuid) as session_count ,
avgState() as avg_score,
min(revenue) as min_revenue,
max(perf_page_load_time) as max_load_time
from user_details
where toDate(session_ts)>='2020-08-26' group by ts,appid,isp,city,country;

Related

Materialzed view works for few days and then stops

I have these three table (I cleaned them)
CREATE TABLE Record (
`visitId` String,
`visitorId` String,
`pageUrl` LowCardinality(String),
`createdAtDay` Date DEFAULT now()
) ENGINE = MergeTree PARTITION BY toYYYYMM(createdAtDay) PRIMARY KEY (
visitorId,
visitId,
pageUrl,
createdAtDay
)
ORDER BY
(visitorId, visitId, pageUrl)
CREATE MATERIALIZED VIEW DurationPerPage (
`visits` Int64 CODEC(DoubleDelta, LZ4),
`pageUrl` LowCardinality(String),
`visitors` Int64 CODEC(DoubleDelta, LZ4),
`duration` Int64 CODEC(DoubleDelta, LZ4),
`createdAtDay` Date,
) ENGINE = SummingMergeTree((visits, visitors, duration))
ORDER BY
(createdAtDay, pageUrl) AS
SELECT
countDistinct(visitId) AS visits,
cutQueryStringAndFragment(pageUrl) AS pageUrl,
countDistinct(visitorId) AS visitors,
sum(e.value) AS duration,
createdAtDay
FROM
Record AS r
LEFT JOIN Events AS e ON (r.visitId = e.visitId)
AND (e.eventType = 6)
WHERE
pageType LIKE '%single%'
GROUP BY
(createdAtDay, pageUrl);
CREATE TABLE Events (
`visitId` String,
`visitorId` String,
`value` Int64 CODEC(DoubleDelta, LZ4),
`eventType` Int16 CODEC(DoubleDelta, LZ4)
) ENGINE = MergeTree PARTITION BY (toYYYYMM(createdAtDay), eventType) PRIMARY KEY (visitId, eventType, createdAtDay)
ORDER BY
(visitId, eventType, createdAtDay)
as you can see I'm using both Record and Events table to feed my materialzed view. it works good for few days and then it stops and starts saving weird data (mostly zeros at the duration field) and I have then to delete and recreate it.
is there a related bug to this ? or something is wrong the View ?

ClickHouse: Materialized view is not be timely optimized to merge the partitions

I created a table and two materialized views recursively.
Table:
CREATE TABLE `log_details` (
date String,
event_time DateTime,
username String,
city String)
ENGINE = MergeTree()
ORDER BY (date, event_time)
PARTITION BY date TTL event_time + INTERVAL 1 MONTH
Materialized views:
CREATE MATERIALIZED VIEW `log_u_c_day_mv`
ENGINE = SummingMergeTree()
PARTITION BY date
ORDER BY (date, username, city)
AS
SELECT date, username, city, count() as times
FROM `log_details`
GROUP BY date, username, city
CREATE MATERIALIZED VIEW `log_u_day_mv`
ENGINE = SummingMergeTree()
PARTITION BY date
ORDER BY (date, username)
AS
SELECT date, username, SUM(times) as total_times
FROM `.inner.log_u_c_day_mv`
GROUP BY date, username
Insert into log_details → Insert into log_u_c_day_mv → Insert into log_u_day_mv.
log_u_day_mv is not be optimized after 15 minutes inserting log_u_c_day_mv even over one day.
I tried to optimize log_u_day_mv manually and it works.
OPTIMIZE TABLE `.inner.log_u_day_mv` PARTITION 20210110
But ClickHouse does not timely optimize it.
How to solve it?
Data always is not fully aggregated/collapsed in MT.
If you do optimize final the next insert into creates a new part.
CH does not merge parts by time. Merge scheduler selects parts by own algorithm based on the current node workload / number of parts / size of parts.
SummingMT MUST BE QUERIED with sum / groupby ALWAYS.
select sum(times), username
from log_u_day_mv
group by username
DO NOT USE from log_u_day_mv FINAL it reads excessive columns!!!!!!!!!!!!!!

How to pass values to object columns from one table to another?

I have my main table like this
create table final(
Province varchar2(50),
Country varchar2(100),
Latitude Number(10,0),
Longitude Number(10,0),
Cdate varchar2(20),
Confirmed int,
killed int,
Recover int
)
Then I have created Table with nested table like this
create type virus_Statistic_t as object(
vDate varchar2(20),
infection int,
dead int,
recovered int
)
/
create type virus_Statistic_tlb as table of virus_Statistic_t
/
create type countries_t as object(
Province_or_State varchar2(50),
Country_or_Region varchar2(100),
Lat Number(10,0),
Longt Number(10,0),
virus virus_Statistic_tlb
)
/
create table countries of countries_t (
Lat not null,
Longt not null
) nested table virus store as virus_ntb;
Now I am trying to pass all columns values from final to countries table.
This is I have tried
INSERT INTO countries(Province_or_State, Country_or_Region, Lat, Longt, vDate, infection, dead, recovered)
SELECT Province, Country, Latitude, Longitude, Cdate, Confirmed, killed, Recover
FROM final
/
It gives this error
ERROR at line 1:
ORA-00904: "RECOVERED": invalid identifier
How can I pass all values from final to countries table?
You need to use type constructors. The correct syntax is this:
INSERT INTO countries(Province_or_State, Country_or_Region, Lat, Longt, virus)
SELECT Province, Country, Latitude, Longitude,
virus_Statistic_tlb (virus_Statistic_t(Cdate, Confirmed, killed, Recover))
FROM final
/
Though note that this is only inserting one virus row per country, is that what you meant? To insert multiple virus rows per country do this:
INSERT INTO countries(Province_or_State, Country_or_Region, Lat, Longt, virus)
SELECT Province, Country, Latitude, Longitude,
CAST(MULTISET(SELECT virus_Statistic_t(Cdate, Confirmed, killed, Recover)
FROM final f2
WHERE f2.Province = f1.Province
AND ...etc.
) AS virus_Statistic_tlb
)
FROM final f1
GROUP BY Province, Country, Latitude, Longitude;
Opinion
I can never respond to a question about using nested tables without saying that the correct way to use them is not at all! In a real database you should have a separate database table for virus_statistics with a foreign key to the countries table. I realise you are probably doing this for educational purposes, but you should also be aware that no one should ever use nested tables in real life. No doubt you'll soon realise why, when you try to use this data :-)

Hive select from table as complex type

Considering a base table employee and a table derived from employee called employee_salary_period which contains a complex datatype map. How to select and insert data from employee into employee_salary_period where salary_period_map is a key value pair i.e. salary: period
CREATE TABLE employee(
emp_id bigint,
name string,
address string,
salary double,
period string,
position string
)
PARTITIONED BY (
dept_id bigint)
STORED AS PARQUET
CREATE TABLE employee_salary_period(
emp_id
name string,
salary string,
period string,
salary_period_map Map<String,String>,
)
PARTITIONED BY (
dept_id bigint)
STORED AS PARQUET
I'm stuck trying to figure out how to select data as salary_period_map
Consider using str_to_map function provided by hive. I hope you have only one key (salary) in you map
select
emp_id
name,
salary,
period,
str_to_map(concat(salary,":",period),'&',':') as salary_period_map
from employee_salary_period

Hive partitions on tables

when we partition a table the columns on which the table is being partitioned are not mentioned in the create statement and separately used in the partitioned by.What is the reason behind this.
CREATE TABLE REGISTRATION DATA (
userid BIGINT,
First_Name STRING,
Last_Name STRING,
address1 STRING,
address2 STRING,
city STRING,
zip_code STRING,
state STRING
)
PARTITION BY (
REGION STRING,
COUNTRY STRING
)
The partition that we create in hive makes a pseudocolumn on which we can query directly without having them in create statement.
So when we include partition column on the data of the table itself(create query) we will be getting error like 'Error in semantic analysis. Columns repeated in partitioning columns'

Resources