AggregatingMergeTree not aggregating inserts properly - clickhouse

I have a table that aggregates the number of sales across various products by minute/hour/day and computes various metrics.
The table below has 1 minute increment calculations that compute off core_product_tbl. After the computations are in product_agg_tbl, other tables compute by hour, day, week etc off of product_agg_tbl.
CREATE TABLE product_agg_tbl (
product String,
minute DateTime,
high Nullable(Float32),
low Nullable(Float32),
average AggregateFunction(avg, Nullable(Float32)),
first Nullable(Float32),
last Nullable(Float32),
total_sales Nullable(UInt64)
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(minute)
ORDER BY (product, minute);
CREATE MATERIALIZED VIEW product_agg_mv TO product_agg_tbl AS
SELECT
product,
minute,
max(price) AS high,
min(price) AS low,
avgState(price) AS average,
argMin(price, sales_timestamp) AS first,
argMax(price, sales_timestamp) AS last,
sum(batch_size) as total_sales
FROM core_product_tbl
WHERE minute >= today()
GROUP BY product, toStartOfMinute(sales_timestamp) AS minute;
CREATE VIEW product_agg_1w AS
SELECT
product,
toStartOfHour(minute) AS minute,
max(high) AS high,
min(low) AS low,
avgMerge(average) AS average_price,
argMin(first, minute) AS first,
argMax(last, minute) AS last,
sum(total_sales) as total_sales
FROM product_agg_tbl
WHERE minute >= date_sub(today(), interval 7 + 7 day)
GROUP BY product, minute;
The issue I have is that when I run the query below straight off of core_product_tbl, I get much different numbers than product_agg_1w. What could be going on?
SELECT
product,
toStartOfHour(minute) AS minute,
max(price) AS high,
min(price) AS low,
avgState(price) AS average,
argMin(price, sales_timestamp) AS first,
argMax(price, sales_timestamp) AS last,
sum(batch_size) as total_sales
FROM core_product_tbl
WHERE minute >= today()
GROUP BY product, toStartOfMinute(sales_timestamp) AS minute;

You should use SimpleAggregateFunction or AggregateFunction in the table AggregatingMergeTree.
AggregatingMergeTree knows nothing about Materialized View and about select in the Materialized View. https://den-crane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf
CREATE TABLE product_agg_tbl (
product String,
minute DateTime,
high SimpleAggregateFunction(max, Nullable(Float32)),
low SimpleAggregateFunction(min, Nullable(Float32)),
average AggregateFunction(avg, Nullable(Float32), DateTime),
first AggregateFunction(argMin, Nullable(Float32), DateTime),
last AggregateFunction(argMax, Nullable(Float32),DateTime),
total_sales SimpleAggregateFunction(sum,Nullable(UInt64))
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(minute)
ORDER BY (product, minute);
CREATE MATERIALIZED VIEW product_agg_mv TO product_agg_tbl AS
SELECT
product,
minute,
max(price) AS high,
min(price) AS low,
avgState(price) AS average,
argMinState(price, sales_timestamp) AS first,
argMaxState(price, sales_timestamp) AS last,
sum(batch_size) as total_sales
FROM core_product_tbl
WHERE minute >= today()
GROUP BY product, toStartOfMinute(sales_timestamp) AS minute;
CREATE VIEW product_agg_1w AS
SELECT
product,
toStartOfHour(minute) AS minute,
max(high) AS high,
min(low) AS low,
avgMerge(average) AS average_price,
argMinMerge(first, minute) AS first,
argMaxMerge(last, minute) AS last,
sum(total_sales) as total_sales
FROM product_agg_tbl
WHERE minute >= date_sub(today(), interval 7 + 7 day)
GROUP BY product, minute;
Don't use view (product_agg_1w) because it's counterproductive for performance. It reads excessive data. Use select directly to product_agg_tbl.

Related

Can I query per hour increment of a accumulation column in clickhouse?

I want to save event time and the total amount of generated electric per 30 seconds. The total amount is not reseted to zero everytime. It's just the total from the meter first started to now, not total amount generated in the 30 seconds.
Is there any way of querying daily, weekly or monthly aggregations on the total amount of generated electric column (Maybe not just sum or avg)?
Or by design a AggregatingMergeTree table?
I don't need to keep every record, just need the daily, weekly and monthly aggregations.
For example :
create table meter_record (
event_time Datetime,
generated_total Int64
)
UPDATE
Prefer to use SimpleAggregateFunction instead of AggregateFunction for simple functions like median, avg, min, max to speed up aggregates calculation.
Let's suggest you need to calculate median, average and dispersion aggregations for this table:
CREATE TABLE meter_record (
event_time Datetime,
generated_total Int64
)
ENGINE = MergeTree
PARTITION BY (toYYYYMM(event_time))
ORDER BY (event_time);
Use AggregatingMergeTree to calculate required aggregates:
CREATE MATERIALIZED VIEW meter_aggregates_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(day)
ORDER BY (day)
AS
SELECT
toDate(toStartOfDay(event_time)) AS day,
/* aggregates to calculate the day's section left and right endpoints */
minState(generated_total) min_generated_total,
maxState(generated_total) max_generated_total,
/* specific aggregates */
medianState(generated_total) AS totalMedian,
avgState(generated_total) AS totalAvg,
varPopState(generated_total) AS totalDispersion
/* ... */
FROM meter_record
GROUP BY day;
To get required daily / weekly / montly (and any day-base aggregation like quarterly or yearly) aggregates use these queries:
/* daily report */
SELECT
day,
minMerge(min_generated_total) min_generated_total,
maxMerge(max_generated_total) max_generated_total,
medianMerge(totalMedian) AS totalMedian,
avgMerge(totalAvg) AS totalAvg,
varPopMerge(totalDispersion) AS totalDispersion
FROM meter_aggregates_mv
/*WHERE day >= '2019-02-05' and day < '2019-07-01'*/
GROUP BY day;
/* weekly report */
SELECT
toStartOfWeek(day, 1) monday,
minMerge(min_generated_total) min_generated_total,
maxMerge(max_generated_total) max_generated_total,
medianMerge(totalMedian) AS totalMedian,
avgMerge(totalAvg) AS totalAvg,
varPopMerge(totalDispersion) AS totalDispersion
FROM meter_aggregates_mv
/*WHERE day >= '2019-02-05' and day < '2019-07-01'*/
GROUP BY monday;
/* monthly report */
SELECT
toStartOfMonth(day) month,
minMerge(min_generated_total) min_generated_total,
maxMerge(max_generated_total) max_generated_total,
medianMerge(totalMedian) AS totalMedian,
avgMerge(totalAvg) AS totalAvg,
varPopMerge(totalDispersion) AS totalDispersion
FROM meter_aggregates_mv
/*WHERE day >= '2019-02-05' and day < '2019-07-01'*/
GROUP BY month;
/* get daily / weekly / monthly reports in one query (thanks #Denis Zhuravlev for advise) */
SELECT
day,
toStartOfWeek(day, 1) AS week,
toStartOfMonth(day) AS month,
minMerge(min_generated_total) min_generated_total,
maxMerge(max_generated_total) max_generated_total,
medianMerge(totalMedian) AS totalMedian,
avgMerge(totalAvg) AS totalAvg,
varPopMerge(totalDispersion) AS totalDispersion
FROM meter_aggregates_mv
/*WHERE (day >= '2019-05-01') AND (day < '2019-06-01')*/
GROUP BY month, week, day WITH ROLLUP
ORDER BY day, week, month;
Remarks:
you point that raw-data is not required to you just aggregates, so you can set engine for meter_record-table as Null, manually clean meter_record (see DROP PARTITION) or define the TTL to do it automatically
removing raw-data is bad practice because it makes impossible to calculate new aggregates on historical data or restore exist aggregates etc
the materialized view meter_aggregates_mv will contains only the data inserted in the table meter_record after creating the view. To change this behavior use POPULATE in view definition

Frequency Histogram in Clickhouse with unique and non unique data

I have a event table with created_at(DateTime), userid(String), eventid(String) column. Here userid can be repetitive while eventid is always unique uuid.
I am looking to build both unique and non unique frequency histogram.
This is for both eventid and userid on basis of given three input
start_datetime
end_datetime and
interval (1 min, 1 hr, 1 day, 7 day, 1 month).
Here, bucket will be decided by (end_datetime - start_datetime)/interval.
Output comes as start_datetime, end_datetime and frequency.
For any interval, if data is not available then start_datetime and end_datetime comes but with frequency as 0.
How can I build a generic query for this?
I looked in histogram function but could not find any documentation for this. While trying it, i could not understand relation behind the input and output.
count(distinct XXX) is deprecated.
More useful uniq(XXX) or uniqExact(XXX)
I got it work using following. Here, toStartOfMonth can be changed to other similar functions in CH.
select toStartOfMonth(`timestamp`) interval_data , count(distinct uid) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
and
select toStartOfMonth(`timestamp`) interval_data , count(*) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
But performance is very low for >2 billion records each month in event table where toYYYYMM(timestamp) is partition and toYYYYMMDD(timestamp) is order by.
Distinct count query takes > 30GB of space and 30 sec of time. Yet didn't complete.
While, General count query takes 10-20 sec to complete.

How to get out only change record from daily level data?

I have a huge amount stock data. We are tracking every day stock level amount. My purpose is to get query which take records only when the amount is change:
Sample:
Desired result:
Try this query with subquery:
Select date_id, product,warehouse,amount
From
(
Select
date_id, product, warehouse, amount,
lag(amount) over (partition by product, warehouse order by date_id) amount_prev
From TABLENAME
) x
Where amount <> amount_prev

clickhouse - how get count datetime per 1minute or 1day ,

I have a table in Clickhouse. for keep statistics and metrics.
and structure is:
datetime|metric_name|metric_value
I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every metric_name and I want to prepare statistics in a chart.
I do not know how to make a query. I get the count of metrics statistics based on the exact for example 1 minute, 1 hour, 1 day and so on.
I used to work on inflxdb:
SELECT SUM(value) FROM `TABLE` WHERE `metric_name`=`metric_value` AND time >= now() - 1h GROUP BY time(5m) fill(0)
In fact, I want to get the number of each metric per 5 minutes in the previous 1 hour.
I do not know how to use aggregations for this problem
ClickHouse has functions for generating Date/DateTime group buckets such as toStartOfWeek, toStartOfHour, toStartOfFiveMinute. You can also use intDiv function to manually divide value ranges. However the fill feature is still in the roadmap.
For example, you can rewrite the influx sql without the fill in ClickHouse like this,
SELECT SUM(value) FROM `TABLE` WHERE `metric_name`=`metric_value` AND
time >= now() - 1h GROUP BY toStartOfFiveMinute(time)
You can also refer to this discussion https://github.com/yandex/ClickHouse/issues/379
update
There is a timeSlots function that can help generating empty buckets. Here is a working example
SELECT
slot,
metric_value_sum
FROM
(
SELECT
toStartOfFiveMinute(datetime) AS slot,
SUM(metric_value) AS metric_value_sum
FROM metrics
WHERE (metric_name = 'k1') AND (datetime >= (now() - toIntervalHour(1)))
GROUP BY slot
)
ANY RIGHT JOIN
(
SELECT arrayJoin(timeSlots(now() - toIntervalHour(1), toUInt32(3600), 300)) AS slot
) USING (slot)

How to avoid expensive Cartesian product using row generator

I'm working on a query (Oracle 11g) that does a lot of date manipulation. Using a row generator, I'm examining each date within a range of dates for each record in another table. Through another query, I know that my row generator needs to generate 8500 dates, and this amount will grow by 365 days each year. Also, the table that I'm examining has about 18000 records, and this table is expected to grow by several thousand records a year.
The problem comes when joining the row generator to the other table to get the range of dates for each record. SQLTuning Advisor says that there's an expensive Cartesian product, which makes sense given that the query currently could generate up to 8500 x 18000 records. Here's the query in its stripped down form, without all the date logic etc.:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on origdate + n - 1 <= closeddate -- here's the problem join
order by t.id, t.origdate;
Is there an alternate way to join these two tables without the Cartesian product?
I need to calculate the elapsed time for each of these records, disallowing weekends and federal holidays, so that I can sort on the elapsed time. Also, the pagination for the table is done server-side, so we can't just load into the table and sort client-side.
The maximum age of a record in the system right now is 3656 days, and the average is 560, so it's not quite as bad as 8500 x 18000; but it's still bad.
I've just about resigned myself to adding a field to store the opendays, computing it once and storing the elapsed time, and creating a scheduled task to update all open records every night.
I think that you would get better performance if you rewrite the join condition slightly:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on Closeddate - Origdate + 1 <= n --you could even create a function-based index
order by t.id, t.origdate;

Resources