I have a table that aggregates the number of sales across various products by minute/hour/day and computes various metrics.
The table below has 1 minute increment calculations that compute off core_product_tbl. After the computations are in product_agg_tbl, other tables compute by hour, day, week etc off of product_agg_tbl.
CREATE TABLE product_agg_tbl (
product String,
minute DateTime,
high Nullable(Float32),
low Nullable(Float32),
average AggregateFunction(avg, Nullable(Float32)),
first Nullable(Float32),
last Nullable(Float32),
total_sales Nullable(UInt64)
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(minute)
ORDER BY (product, minute);
CREATE MATERIALIZED VIEW product_agg_mv TO product_agg_tbl AS
SELECT
product,
minute,
max(price) AS high,
min(price) AS low,
avgState(price) AS average,
argMin(price, sales_timestamp) AS first,
argMax(price, sales_timestamp) AS last,
sum(batch_size) as total_sales
FROM core_product_tbl
WHERE minute >= today()
GROUP BY product, toStartOfMinute(sales_timestamp) AS minute;
CREATE VIEW product_agg_1w AS
SELECT
product,
toStartOfHour(minute) AS minute,
max(high) AS high,
min(low) AS low,
avgMerge(average) AS average_price,
argMin(first, minute) AS first,
argMax(last, minute) AS last,
sum(total_sales) as total_sales
FROM product_agg_tbl
WHERE minute >= date_sub(today(), interval 7 + 7 day)
GROUP BY product, minute;
The issue I have is that when I run the query below straight off of core_product_tbl, I get much different numbers than product_agg_1w. What could be going on?
SELECT
product,
toStartOfHour(minute) AS minute,
max(price) AS high,
min(price) AS low,
avgState(price) AS average,
argMin(price, sales_timestamp) AS first,
argMax(price, sales_timestamp) AS last,
sum(batch_size) as total_sales
FROM core_product_tbl
WHERE minute >= today()
GROUP BY product, toStartOfMinute(sales_timestamp) AS minute;
You should use SimpleAggregateFunction or AggregateFunction in the table AggregatingMergeTree.
AggregatingMergeTree knows nothing about Materialized View and about select in the Materialized View. https://den-crane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf
CREATE TABLE product_agg_tbl (
product String,
minute DateTime,
high SimpleAggregateFunction(max, Nullable(Float32)),
low SimpleAggregateFunction(min, Nullable(Float32)),
average AggregateFunction(avg, Nullable(Float32), DateTime),
first AggregateFunction(argMin, Nullable(Float32), DateTime),
last AggregateFunction(argMax, Nullable(Float32),DateTime),
total_sales SimpleAggregateFunction(sum,Nullable(UInt64))
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(minute)
ORDER BY (product, minute);
CREATE MATERIALIZED VIEW product_agg_mv TO product_agg_tbl AS
SELECT
product,
minute,
max(price) AS high,
min(price) AS low,
avgState(price) AS average,
argMinState(price, sales_timestamp) AS first,
argMaxState(price, sales_timestamp) AS last,
sum(batch_size) as total_sales
FROM core_product_tbl
WHERE minute >= today()
GROUP BY product, toStartOfMinute(sales_timestamp) AS minute;
CREATE VIEW product_agg_1w AS
SELECT
product,
toStartOfHour(minute) AS minute,
max(high) AS high,
min(low) AS low,
avgMerge(average) AS average_price,
argMinMerge(first, minute) AS first,
argMaxMerge(last, minute) AS last,
sum(total_sales) as total_sales
FROM product_agg_tbl
WHERE minute >= date_sub(today(), interval 7 + 7 day)
GROUP BY product, minute;
Don't use view (product_agg_1w) because it's counterproductive for performance. It reads excessive data. Use select directly to product_agg_tbl.
In my data model I have a table named 'Online Sales' and a Dates table (daily dates from 2005 to 2010). They are joined M:1.
I am attempting to use AVERAGEX in the following ways. The first approach grossly inflates my daily average when placed in a matrix containing a filter context of year and month. The second approach generates correct results. I don't understand why they both don't produce the same results.
1
Average Sales By Day =
AVERAGEX(
'Dates',
[Sales Amount Online]
)
2
Average Sales By Day =
AVERAGEX(
'Online Sales'
[Sales Amount Online]
)
[Sales Amount Online] is a measure as follows:
Sales Amount Online = SUMX(
'Online Sales',
'Online Sales'[Sales Quantity] * 'Online Sales'[Unit Price] - 'Online Sales'[Discount Amount]
)
In the first measure, you are iterating through each row in the 'Dates' table and calculating [Sales Amount Online] for each day (assuming daily level granularity).
When you evaluate the [Sales Amount Online] measure with a day as your filter context, you get the sum of all sales that occur on that day (which could be many).
In the second measure, you are iterating through each row in the 'Online Sales' table and calculating [Sales Amount Online] for each transaction (assuming that's what each row represents).
When you evaluate [Sales Amount Online] measure within 'Online Sales' row context, the measure only sums sales from that single row (assuming all rows are unique).
Basically, #1 is average per day and #2 is average per transaction (provided my assumptions are correct).
I have a event table with created_at(DateTime), userid(String), eventid(String) column. Here userid can be repetitive while eventid is always unique uuid.
I am looking to build both unique and non unique frequency histogram.
This is for both eventid and userid on basis of given three input
start_datetime
end_datetime and
interval (1 min, 1 hr, 1 day, 7 day, 1 month).
Here, bucket will be decided by (end_datetime - start_datetime)/interval.
Output comes as start_datetime, end_datetime and frequency.
For any interval, if data is not available then start_datetime and end_datetime comes but with frequency as 0.
How can I build a generic query for this?
I looked in histogram function but could not find any documentation for this. While trying it, i could not understand relation behind the input and output.
count(distinct XXX) is deprecated.
More useful uniq(XXX) or uniqExact(XXX)
I got it work using following. Here, toStartOfMonth can be changed to other similar functions in CH.
select toStartOfMonth(`timestamp`) interval_data , count(distinct uid) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
and
select toStartOfMonth(`timestamp`) interval_data , count(*) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
But performance is very low for >2 billion records each month in event table where toYYYYMM(timestamp) is partition and toYYYYMMDD(timestamp) is order by.
Distinct count query takes > 30GB of space and 30 sec of time. Yet didn't complete.
While, General count query takes 10-20 sec to complete.
I need to aggregate the values at weekly interval. My date range is dynamic means i can give any start date and end date. Every sunday should be the starting week of every month. say if i have two columns and my start and end date is 07/11/2016 to 13/11/2016
column A column B
07/11/2016 23
08/11/2016 20
09/11/2016 10
10/11/2016 05
11/11/2016 10
12/11/2016 20
13/11/2016 10
My result should come like taking the average of column B
Column A Column B
13/11/2016 14.00
It means i should consider the past value and aggregate it to the day Sunday of that week. Also if my start and end date is like 07/11/2016 to 10/11/2016 then I should not aggregate the value as my week is not complete. I am able to aggregate the values but if my week is not complete i m not able to restrict the aggregation.
Is there any way to do this in PL/SQL??
Thank you in advance.
select to_char(columnA, 'iw') as weeknumber, avg(columnB)
from table
group by to_char(columnA, 'iw');
This will aggregate by number of week. If you need to show last day of week as a label you can get it as max(columnA) over (partition by to_char(columnA, 'iw'))
I have an Item model.
There are many records in the database with column created_at filled in.
I want to generate a view with such a hierarchy:
2014
December
31
items here
30
items here
29
items here
...
November
31
...
...
2013
...
What's the most elegant way to do that?
EDIT: Thank you so much for queries. How do I get that worked in Ruby on Rails?
To achieve this, we will order the records by the parts of date. Sample query below
SELECT
ItemDescription,
Year(DateField) AS Year,
Datename(mm, DateField) AS Month,
Day(DateField) AS Day
FROM tblName
ORDER BY
Year(DateField) DESC,
Month(DateField) DESC,
Day(DateField) DESC
This will provide you the data in the order expected. Now you can either create a stored procedure to modify the output to the format you need.
SELECT DATEPART(Year, PaymentDate) Year, DATEPART(Month, PaymentDate) Month, DATEPART(day, PaymentDate) Day,item_name
FROM Payments
GROUP BY DATEPART(Year, PaymentDate), DATEPART(Month, PaymentDate),DATEPART(day, PaymentDate) desc
ORDER BY Year, Month,day