Can I query per hour increment of a accumulation column in clickhouse? - clickhouse

I want to save event time and the total amount of generated electric per 30 seconds. The total amount is not reseted to zero everytime. It's just the total from the meter first started to now, not total amount generated in the 30 seconds.
Is there any way of querying daily, weekly or monthly aggregations on the total amount of generated electric column (Maybe not just sum or avg)?
Or by design a AggregatingMergeTree table?
I don't need to keep every record, just need the daily, weekly and monthly aggregations.
For example :
create table meter_record (
event_time Datetime,
generated_total Int64
)

UPDATE
Prefer to use SimpleAggregateFunction instead of AggregateFunction for simple functions like median, avg, min, max to speed up aggregates calculation.
Let's suggest you need to calculate median, average and dispersion aggregations for this table:
CREATE TABLE meter_record (
event_time Datetime,
generated_total Int64
)
ENGINE = MergeTree
PARTITION BY (toYYYYMM(event_time))
ORDER BY (event_time);
Use AggregatingMergeTree to calculate required aggregates:
CREATE MATERIALIZED VIEW meter_aggregates_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(day)
ORDER BY (day)
AS
SELECT
toDate(toStartOfDay(event_time)) AS day,
/* aggregates to calculate the day's section left and right endpoints */
minState(generated_total) min_generated_total,
maxState(generated_total) max_generated_total,
/* specific aggregates */
medianState(generated_total) AS totalMedian,
avgState(generated_total) AS totalAvg,
varPopState(generated_total) AS totalDispersion
/* ... */
FROM meter_record
GROUP BY day;
To get required daily / weekly / montly (and any day-base aggregation like quarterly or yearly) aggregates use these queries:
/* daily report */
SELECT
day,
minMerge(min_generated_total) min_generated_total,
maxMerge(max_generated_total) max_generated_total,
medianMerge(totalMedian) AS totalMedian,
avgMerge(totalAvg) AS totalAvg,
varPopMerge(totalDispersion) AS totalDispersion
FROM meter_aggregates_mv
/*WHERE day >= '2019-02-05' and day < '2019-07-01'*/
GROUP BY day;
/* weekly report */
SELECT
toStartOfWeek(day, 1) monday,
minMerge(min_generated_total) min_generated_total,
maxMerge(max_generated_total) max_generated_total,
medianMerge(totalMedian) AS totalMedian,
avgMerge(totalAvg) AS totalAvg,
varPopMerge(totalDispersion) AS totalDispersion
FROM meter_aggregates_mv
/*WHERE day >= '2019-02-05' and day < '2019-07-01'*/
GROUP BY monday;
/* monthly report */
SELECT
toStartOfMonth(day) month,
minMerge(min_generated_total) min_generated_total,
maxMerge(max_generated_total) max_generated_total,
medianMerge(totalMedian) AS totalMedian,
avgMerge(totalAvg) AS totalAvg,
varPopMerge(totalDispersion) AS totalDispersion
FROM meter_aggregates_mv
/*WHERE day >= '2019-02-05' and day < '2019-07-01'*/
GROUP BY month;
/* get daily / weekly / monthly reports in one query (thanks #Denis Zhuravlev for advise) */
SELECT
day,
toStartOfWeek(day, 1) AS week,
toStartOfMonth(day) AS month,
minMerge(min_generated_total) min_generated_total,
maxMerge(max_generated_total) max_generated_total,
medianMerge(totalMedian) AS totalMedian,
avgMerge(totalAvg) AS totalAvg,
varPopMerge(totalDispersion) AS totalDispersion
FROM meter_aggregates_mv
/*WHERE (day >= '2019-05-01') AND (day < '2019-06-01')*/
GROUP BY month, week, day WITH ROLLUP
ORDER BY day, week, month;
Remarks:
you point that raw-data is not required to you just aggregates, so you can set engine for meter_record-table as Null, manually clean meter_record (see DROP PARTITION) or define the TTL to do it automatically
removing raw-data is bad practice because it makes impossible to calculate new aggregates on historical data or restore exist aggregates etc
the materialized view meter_aggregates_mv will contains only the data inserted in the table meter_record after creating the view. To change this behavior use POPULATE in view definition

Related

AggregatingMergeTree not aggregating inserts properly

I have a table that aggregates the number of sales across various products by minute/hour/day and computes various metrics.
The table below has 1 minute increment calculations that compute off core_product_tbl. After the computations are in product_agg_tbl, other tables compute by hour, day, week etc off of product_agg_tbl.
CREATE TABLE product_agg_tbl (
product String,
minute DateTime,
high Nullable(Float32),
low Nullable(Float32),
average AggregateFunction(avg, Nullable(Float32)),
first Nullable(Float32),
last Nullable(Float32),
total_sales Nullable(UInt64)
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(minute)
ORDER BY (product, minute);
CREATE MATERIALIZED VIEW product_agg_mv TO product_agg_tbl AS
SELECT
product,
minute,
max(price) AS high,
min(price) AS low,
avgState(price) AS average,
argMin(price, sales_timestamp) AS first,
argMax(price, sales_timestamp) AS last,
sum(batch_size) as total_sales
FROM core_product_tbl
WHERE minute >= today()
GROUP BY product, toStartOfMinute(sales_timestamp) AS minute;
CREATE VIEW product_agg_1w AS
SELECT
product,
toStartOfHour(minute) AS minute,
max(high) AS high,
min(low) AS low,
avgMerge(average) AS average_price,
argMin(first, minute) AS first,
argMax(last, minute) AS last,
sum(total_sales) as total_sales
FROM product_agg_tbl
WHERE minute >= date_sub(today(), interval 7 + 7 day)
GROUP BY product, minute;
The issue I have is that when I run the query below straight off of core_product_tbl, I get much different numbers than product_agg_1w. What could be going on?
SELECT
product,
toStartOfHour(minute) AS minute,
max(price) AS high,
min(price) AS low,
avgState(price) AS average,
argMin(price, sales_timestamp) AS first,
argMax(price, sales_timestamp) AS last,
sum(batch_size) as total_sales
FROM core_product_tbl
WHERE minute >= today()
GROUP BY product, toStartOfMinute(sales_timestamp) AS minute;
You should use SimpleAggregateFunction or AggregateFunction in the table AggregatingMergeTree.
AggregatingMergeTree knows nothing about Materialized View and about select in the Materialized View. https://den-crane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf
CREATE TABLE product_agg_tbl (
product String,
minute DateTime,
high SimpleAggregateFunction(max, Nullable(Float32)),
low SimpleAggregateFunction(min, Nullable(Float32)),
average AggregateFunction(avg, Nullable(Float32), DateTime),
first AggregateFunction(argMin, Nullable(Float32), DateTime),
last AggregateFunction(argMax, Nullable(Float32),DateTime),
total_sales SimpleAggregateFunction(sum,Nullable(UInt64))
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(minute)
ORDER BY (product, minute);
CREATE MATERIALIZED VIEW product_agg_mv TO product_agg_tbl AS
SELECT
product,
minute,
max(price) AS high,
min(price) AS low,
avgState(price) AS average,
argMinState(price, sales_timestamp) AS first,
argMaxState(price, sales_timestamp) AS last,
sum(batch_size) as total_sales
FROM core_product_tbl
WHERE minute >= today()
GROUP BY product, toStartOfMinute(sales_timestamp) AS minute;
CREATE VIEW product_agg_1w AS
SELECT
product,
toStartOfHour(minute) AS minute,
max(high) AS high,
min(low) AS low,
avgMerge(average) AS average_price,
argMinMerge(first, minute) AS first,
argMaxMerge(last, minute) AS last,
sum(total_sales) as total_sales
FROM product_agg_tbl
WHERE minute >= date_sub(today(), interval 7 + 7 day)
GROUP BY product, minute;
Don't use view (product_agg_1w) because it's counterproductive for performance. It reads excessive data. Use select directly to product_agg_tbl.

AVERAGEX Confusion

In my data model I have a table named 'Online Sales' and a Dates table (daily dates from 2005 to 2010). They are joined M:1.
I am attempting to use AVERAGEX in the following ways. The first approach grossly inflates my daily average when placed in a matrix containing a filter context of year and month. The second approach generates correct results. I don't understand why they both don't produce the same results.
1
Average Sales By Day =
AVERAGEX(
'Dates',
[Sales Amount Online]
)
2
Average Sales By Day =
AVERAGEX(
'Online Sales'
[Sales Amount Online]
)
[Sales Amount Online] is a measure as follows:
Sales Amount Online = SUMX(
'Online Sales',
'Online Sales'[Sales Quantity] * 'Online Sales'[Unit Price] - 'Online Sales'[Discount Amount]
)
In the first measure, you are iterating through each row in the 'Dates' table and calculating [Sales Amount Online] for each day (assuming daily level granularity).
When you evaluate the [Sales Amount Online] measure with a day as your filter context, you get the sum of all sales that occur on that day (which could be many).
In the second measure, you are iterating through each row in the 'Online Sales' table and calculating [Sales Amount Online] for each transaction (assuming that's what each row represents).
When you evaluate [Sales Amount Online] measure within 'Online Sales' row context, the measure only sums sales from that single row (assuming all rows are unique).
Basically, #1 is average per day and #2 is average per transaction (provided my assumptions are correct).

Frequency Histogram in Clickhouse with unique and non unique data

I have a event table with created_at(DateTime), userid(String), eventid(String) column. Here userid can be repetitive while eventid is always unique uuid.
I am looking to build both unique and non unique frequency histogram.
This is for both eventid and userid on basis of given three input
start_datetime
end_datetime and
interval (1 min, 1 hr, 1 day, 7 day, 1 month).
Here, bucket will be decided by (end_datetime - start_datetime)/interval.
Output comes as start_datetime, end_datetime and frequency.
For any interval, if data is not available then start_datetime and end_datetime comes but with frequency as 0.
How can I build a generic query for this?
I looked in histogram function but could not find any documentation for this. While trying it, i could not understand relation behind the input and output.
count(distinct XXX) is deprecated.
More useful uniq(XXX) or uniqExact(XXX)
I got it work using following. Here, toStartOfMonth can be changed to other similar functions in CH.
select toStartOfMonth(`timestamp`) interval_data , count(distinct uid) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
and
select toStartOfMonth(`timestamp`) interval_data , count(*) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
But performance is very low for >2 billion records each month in event table where toYYYYMM(timestamp) is partition and toYYYYMMDD(timestamp) is order by.
Distinct count query takes > 30GB of space and 30 sec of time. Yet didn't complete.
While, General count query takes 10-20 sec to complete.

PL/SQL weekly Aggregation Logic with dynamic time range

I need to aggregate the values at weekly interval. My date range is dynamic means i can give any start date and end date. Every sunday should be the starting week of every month. say if i have two columns and my start and end date is 07/11/2016 to 13/11/2016
column A column B
07/11/2016 23
08/11/2016 20
09/11/2016 10
10/11/2016 05
11/11/2016 10
12/11/2016 20
13/11/2016 10
My result should come like taking the average of column B
Column A Column B
13/11/2016 14.00
It means i should consider the past value and aggregate it to the day Sunday of that week. Also if my start and end date is like 07/11/2016 to 10/11/2016 then I should not aggregate the value as my week is not complete. I am able to aggregate the values but if my week is not complete i m not able to restrict the aggregation.
Is there any way to do this in PL/SQL??
Thank you in advance.
select to_char(columnA, 'iw') as weeknumber, avg(columnB)
from table
group by to_char(columnA, 'iw');
This will aggregate by number of week. If you need to show last day of week as a label you can get it as max(columnA) over (partition by to_char(columnA, 'iw'))

Get records from database ordered by year > month > day

I have an Item model.
There are many records in the database with column created_at filled in.
I want to generate a view with such a hierarchy:
2014
December
31
items here
30
items here
29
items here
...
November
31
...
...
2013
...
What's the most elegant way to do that?
EDIT: Thank you so much for queries. How do I get that worked in Ruby on Rails?
To achieve this, we will order the records by the parts of date. Sample query below
SELECT
ItemDescription,
Year(DateField) AS Year,
Datename(mm, DateField) AS Month,
Day(DateField) AS Day
FROM tblName
ORDER BY
Year(DateField) DESC,
Month(DateField) DESC,
Day(DateField) DESC
This will provide you the data in the order expected. Now you can either create a stored procedure to modify the output to the format you need.
SELECT DATEPART(Year, PaymentDate) Year, DATEPART(Month, PaymentDate) Month, DATEPART(day, PaymentDate) Day,item_name
FROM Payments
GROUP BY DATEPART(Year, PaymentDate), DATEPART(Month, PaymentDate),DATEPART(day, PaymentDate) desc
ORDER BY Year, Month,day

Resources