Where clause not working with the ClickHouse view - view

I am trying to create a view on a source table and then select the data for a specific timestamp range from the view and put it into the targe table.
We have a source table:
1 Million rows are pushed to the source table which corresponds to data from 1 January to 31 January.
CREATE TABLE IF NOT EXISTS source(
CELL String,
TIMESTAMP DateTime,
COUNTER1 Float32,
COUNTER2 Float32,
COUNTER3 Float32,
COUNTER4 Float32,
COUNTER5 Float32,
COUNTER6 Float32,
InsertionTime DateTime DEFAULT now(), /* Insertion Time */
QHour DateTime DEFAULT toStartOfFifteenMinutes(TIMESTAMP)
) ENGINE=ReplacingMergeTree()
PARTITION BY toYYYYMM(TIMESTAMP)
ORDER BY (QHour, TIMESTAMP, CELL)
SETTINGS index_granularity = 31768
Now, we created a view on the source table.
CREATE VIEW IF NOT EXISTS myView
AS SELECT
CELL,
QHour,
toStartOfFifteenMinutes(TIMESTAMP) AS ViewQHour,
100 * sum(COUNTER1) / sum(COUNTER2) AS KPI1
FROM (
SELECT
CELL,
TIMESTAMP,
any(QHour) as QHour
argMax(COUNTER1, InsertionTime) AS COUNTER1,
argMax(COUNTER2, InsertionTime) AS COUNTER2
FROM ThreeGCell
GROUP BY TIMESTAMP, CELL, QHour)
GROUP BY ViewQHour, CELL
ORDER BY ViewQHour, CELL
Now, on the view I need to select the data for a time period from view: 1 January to 10th January.
SELECT *
FROM myView
WHERE QHour >= toDateTime('2020-01-01 00:00:00') AND QHour <= toDateTime('2020-01-10 00:00:00')
But the select query on the view adds all 1 million rows to the target table which is whole January data while I am looking for data for specific period only 1 January to 10th January.
I have following questions:
Can we modify the query on the view to only process the specific time period?
Can we generate a view on the latest dataset pushed in the source table on the fly?
I mean we have only the filtered dataset from source table and use the filtered data in the view?
Can such filters be modified to use different time ranges?
For example, we have data from 1 January to 10th January in the view.
Then in second run, we have data from 11th Janury to 20th January in the view.

CREATE VIEW IF NOT EXISTS myView
....
ORDER BY ViewQHour, CELL
Most of databases forbid to create view with ORDER BY unfortunately CH does not. It was a mistake in initial design to allow to create such views.
ORDER BY cancels predicate pushdown because it corrupts results in some cases (optimizer still is very weak and does't understand cases with runningDiff() and neighbour().
https://github.com/ClickHouse/ClickHouse/issues/9425#issuecomment-592658368

Related

Complex procedure to adjust data continously

I solved this in SQL Server with a trigger. Now I face it on Oracle.
I have a big set of data that periodically increases with new items.
The item has these fundamental columns:
ID string identifier (not null)
DATETIME (not null)
optional (eventually null, always null for type 1) DATETIME_EMIS emission datetame equal to the DATETIME of the corresponding emission item
type (0 or 1)
value (only if type 1)
It is basically a logbook.
For example: An item with ID='FIREALARM' and datetime='2023-02-12 12:02' has closing like this:
ID='FIREALARM' in datetime='2023-02-12 15:11', emission datetime='2023-02-12 12:02' (equal to the emission item).
What I need is to obtain a final item in the destination table like this:
ID='FIREALARM' in DATETIME_BEGIN ='2023-02-12 12:02', DATETIME_END ='2023-02-12 15:11'
Not all the items have the closing datetime (the ones of Type=1 instead 0), in this case the next item should be use to close the previous one (with the problem of finding it). For example:
Item1:
ID='DEVICESTATUS', datetime='2023-02-12 22:11', Value='Broken' ;
Item2:
ID='DEVICESTATUS', datetime='2023-02-12 22:14', Value='Running'
Should result in
ID='DEVICESTATUS', DATETIME_BEGIN ='2023-02-12 22:11',DATETIME_END ='2023-02-12 22:14', Value='Broken'
The final data should be extracted by a select query as faster as possible.
The process of the elaboration should be independent from the order of inserting.
In SQL Server, I created a trigger with several operations which involve a temporary table, some queries on the inserted set and the entire destination table, so a complex procedure that is not worth to be shown to understand the problem.
Now I discovered that Oracle has some limitations and is not easy to port the trigger on it. For example is not easy to use a temporary table in the same way, and the operation are for each row.
I am asking what could be a good strategy in Oracle to elaborate the data in the final form considering that the set increase continuously and the open and the closure items must be reduce to a single item. I am not asking for a solution of the problem, I am trying to understand what could be the instrument in Oracle useful to achieve a complex elaboration like this. Thanks.
From Oracle 12, you can use MATCH_RECOGNIZE to perform row-by-row pattern matching:
SELECT *
FROM destination
MATCH_RECOGNIZE(
PARTITION BY id
ORDER BY datetime
MEASURES
FIRST(datetime) AS datetime_begin,
LAST(datetime) AS datetime_end,
FIRST(value) AS value
PATTERN ( ^ any_row+ $ )
DEFINE
any_row AS 1 = 1
)
Which, for the sample data:
CREATE TABLE destination (id, datetime, value) AS
SELECT 'DEVICESTATUS', DATE '2023-02-12' + INTERVAL '22:11' HOUR TO MINUTE, 'Broken' FROM DUAL UNION ALL
SELECT 'DEVICESTATUS', DATE '2023-02-12' + INTERVAL '22:14' HOUR TO MINUTE, 'Running' FROM DUAL;
Outputs:
ID
DATETIME_BEGIN
DATETIME_END
VALUE
DEVICESTATUS
2023-02-12 22:11:00
2023-02-12 22:14:00
Broken
fiddle

Can I create a Materialized View from another Matrialized View in Clickhouse?

The tile pretty much says it. I want to create a Materialized View whose "SELECT" clause SELECTs data from another Materialized View in Clickhouse. I have tried this. The SQL for "createion" fo the two views runs without an error. But upon runtime, the first view is populated, but the second one isn't.
I need to know if I am making a mistake in my SQL or this is just simply not possible.
Here's my two views:
CREATE MATERIALIZED VIEW IF NOT EXISTS production_gross
ENGINE = ReplacingMergeTree
ORDER BY (profile_type, reservoir, case_tag, variable_name, profile_phase, well_name, case_name,
timestamp) POPULATE
AS
SELECT profile_type,
reservoir,
case_tag,
is_endorsed,
toDateTime64(endorsement_date / 1000.0, 0) AS endorsement_date,
endorsed_for_month,
variable_name,
profile_phase,
well_name,
case_name,
asset_id,
toDateTime64(eoh / 1000, 0) as end_of_history,
toDateTime64(ts / 1000, 0) as timestamp,
value, -- AS rate, -- cubic meters per second rate for this month
value * dateDiff('second',
toStartOfMonth(subtractMonths(now(), 1)),
toStartOfMonth(now())) AS volume -- cubic meters volume for this month
FROM (
SELECT pp.profile_type AS profile_type,
trimBoth(splitByChar('-', case_name)[1]) AS reservoir,
JSONExtractString(cd.data, 'case_data', 'Tags$$Tag') AS case_tag,
JSONExtractString(cd.data, 'case_data', 'Tags$$Endorsed') AS is_endorsed,
-- Endorsement Data, is the timestamp when the user "endorsed" the case
JSONExtract(cd.data, 'case_data', 'Tags$$EndorsementDate', 'time_stamp', 'Int64') AS endorsement_date,
-- Endorsement Month is the month of year for which the case was actually endorsed
JSONExtractString(cd.data, 'case_data', 'Tags$$MonthTags') AS endorsed_for_month,
pp.variable_name AS variable_name,
JSONExtractString(pp.data, 'profile_phase') AS profile_phase,
JSONExtractString(wd.data, 'name') AS well_name,
JSONExtractString(cd.data, 'header', 'name') AS case_name,
-- We might want to have asset id here to use in roll-up
JSONExtract(cd.data, 'header', 'reservoir_asset_id', 'Int64') AS asset_id, -- Asset Id in ARM
JSONExtract(pp.data, 'end_of_history', 'Int64') AS end_of_history,
JSONExtract(pp.data, 'values', 'Array(Float64)') AS values,
JSONExtract(pp.data, 'timestamps', 'Array(Int64)') AS timestamps,
JSONExtract(pp.data, 'end_of_history', 'Int64') AS eoh
FROM production_profile AS pp
INNER JOIN well_data AS wd ON wd.uuid = pp.well_id
INNER JOIN case_data AS cd ON cd.uuid = pp.case_id
)
ARRAY JOIN
values AS value,
timestamps AS ts
;
CREATE MATERIALIZED VIEW IF NOT EXISTS production_volume_actual
ENGINE = ReplacingMergeTree
ORDER BY (asset_id,
case_tag,
variable_name,
endorsement_date) POPULATE
AS
SELECT profile_type,
case_tag,
is_endorsed,
endorsement_date,
endorsed_for_month,
variable_name,
profile_phase,
asset_id,
sum(volume) AS total_actual_volume
FROM production_gross
WHERE timestamp < end_of_history
GROUP BY profile_type,
case_tag,
is_endorsed,
endorsement_date,
endorsed_for_month,
variable_name,
profile_phase,
asset_id
ORDER BY asset_id ASC,
case_tag ASC,
variable_name ASC,
endorsement_date ASC
;
As you can see, the second view is an "aggregation" on the first, and that is why I need it. If I want to do the aggregation from scratch, a lot of processes has to be done twice.
Update:
I have tried to change the query to the following:
SELECT ...
FROM `.inner.production_gross`
...
Which did not help. This query resulted in the following error:
Code: 60. DB::Exception: Table default.`.inner.production_gross` doesn't exist.
Then, based on the comment by #DennyCrane and using this answer: https://stackoverflow.com/a/67709334/959156, I run this query:
SELECT
uuid,
name
FROM system.tables
WHERE database = 'default' AND engine = 'MaterializedView'
Which gave me the uuid of the inner table:
ebab2dc5-2887-4e7d-998d-6acaff122fc7
So, I ran this query:
SELECT ...
FROM `.inner.ebab2dc5-2887-4e7d-998d-6acaff122fc7`
Which resulted in the following error:
Code: 60. DB::Exception: Table default.`.inner.ebab2dc5-2887-4e7d-998d-6acaff122fc7` doesn't exist.
Materialized views work as insert triggers on actual data tables, so your production_volume_actual table has to do a SELECT on a data table, not a "view".
If you CREATE a materialized view using an ENGINE (and not as TO another data table), ClickHouse actually creates a data table with the name .inner.<mv_name> on older versions (not using an Atomic database engine), or .inner_id.<some UUID>. if using an Atomic or Replicated database engine. So if you change the select in your second view to this "inner" table name, either:
select from `.inner.production_gross`
select from `.inner_id.<UUID>` -- note the extra '_id' on 'inner'
It should work.
This answer can point you to the right UUID.
At ClickHouse we actually recommend you always create Materialized Views as TO <second_table> to avoid this kind of confusion, and to make operations on <second_table> simpler and more transparent.
(Thanks to OP Mostafa Zeinali and Denny Crane for the clarification for more recent ClickHouse versions)

How to iterate over a hive table row by row and calculate metric when a specific condition is met?

I have a requirement as below:
I am trying to convert a MS Access table macro loop to work for a hive table. The table called trip_details contains details about a specific trip taken by a truck. The truck can stop at multiple locations and the type of stop is indicated by a flag called type_of_trip. This column contains values like arrival, departure, loading etc.
The ultimate aim is to calculate the dwell time of each truck (how much time does the truck take before beginning for another trip). To calculate this we have to iterate the table row by row and check for trip type.
A typical example look like this:
Do while end of file:
Store the first row in a variable.
Move to the second row.
If the type_of_trip = Arrival:
Move to the third row
If the type_of_trip = End Trip:
Store the third row
Take the difference of timestamps to calculate dwell time
Append the row into the output table
End
What is the best approach to tackle this problem in hive?
I tried checking if hive contains a keyword for loop but could not find one. I was thinking of doing this using a shell script but need guidance on how to approach this.
I cannot disclose the entire data but feel free to shoot any questions in the comments section.
Input
Trip ID type_of_trip timestamp location
1 Departure 28/5/2019 15:00 Warehouse
1 Arrival 28/5/2019 16:00 Store
1 Live Unload 28/5/2019 16:30 Store
1 End Trip 28/5/2019 17:00 Store
Expected Output
Trip ID Origin_location Destination_location Dwell_time
1 Warehouse Store 2 hours
You do not need loop for this, use the power of SQL query.
Convert your timestamps to seconds (using your format specified 'dd/MM/yyyy HH:mm'), calculate min and max per trip_id, taking into account type, subtract seconds, convert seconds difference to 'HH:mm' format or any other format you prefer:
with trip_details as (--use your table instead of this subquery
select stack (4,
1,'Departure' ,'28/5/2019 15:00','Warehouse',
1,'Arrival' ,'28/5/2019 16:00','Store',
1,'Live Unload' ,'28/5/2019 16:30','Store',
1,'End Trip' ,'28/5/2019 17:00','Store'
) as (trip_id, type_of_trip, `timestamp`, location)
)
select trip_id, origin_location, destination_location,
from_unixtime(destination_time-origin_time,'HH:mm') dwell_time
from
(
select trip_id,
min(case when type_of_trip='Departure' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) origin_time,
max(case when type_of_trip='End Trip' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) destination_time,
max(case when type_of_trip='Departure' then location end) origin_location,
max(case when type_of_trip='End Trip' then location end) destination_location
from trip_details
group by trip_id
)s;
Result:
trip_id origin_location destination_location dwell_time
1 Warehouse Store 02:00

How to get the difference between the values selected by slicer?

I am new to Power BI and currently I am working with table visulaizations and slicers.
My data is as follows
Student table:
Date table:
Exam table:
The relationships within the table are as follows:
I want an output like the image shown below, I would like to create 2 table visuals that can be filtered on Student Name, Classroom and also have slicer on 2 dates. I need to compute minimum score. The user must be able to select 2 dates at a time on the slicer, the first date selected on the slicer should be attached to my 'Min Score at date1' and second date selected on the slicer should be attached to my 'Min Score at date2', and the third column 'Difference in Score' must be able to calculate the difference between the Min Score at date1 and Min Score at date2.
Similarly I also want to calculate the average minimum score too
Please let me know how to proceed or what alternative formula or query or method should I apply to get the desired result.Thanks!
Before I start, let me mention that this example was done in SSAS so it may need some tweaking in PowerBi but the logic is identical nonetheless.
First create a clone date table and call it something else e.g. 'Compare Date'. Next, create an inactive, one to many relationship between the 'Compare Date' and your 'Fact' table, see the image below, in this case I am joining on [Year Month], you will need to adjust to fit your needs:
If you are unsure how to do this, just right click on the new table and select the create relationship option, ensure that the relationship is like the image below:
Once this has been done, right click on the 'relationship' and mark it as inactive.
Now that you have the new date table and the relationships set up, I want you to create a few DAX measures:
Min Date 1 = Min('Student Table'[Score])
Min Date 2 = CALCULATE(Min('Student Table'[Score]), ALL('Dates'), USERELATIONSHIP('Compare Date'[Date], 'Fact'[Date]))
Avg Date 1 = AVERAGE('Student Table'[Score])
Avg Date 2 = CALCULATE(AVERAGE('Student Table'[Score]), ALL('Dates'), USERELATIONSHIP('Compare Date'[Date], 'Fact'[Date]))
Delta Min = [Min Date 2] - [Min Date 1]
Delta Avg = [Avg Date 2] - [Avg Date 1]
These measures will calculate exactly what you need and can be filtered independently via two date slicers tied to each date table. The rest is just busy work.
I hope this helps.

How to pull the data from Oracle on quarterly based on EFF_Date

I have a data in the table POL_INFO pol_num,pol_sym,pol_mod,eff_date. I need to pull the data from it on quarterly basis using EFF_DATE.
I'm not sure what you want to query, so here's an example that will hopefully get you started; it counts rows by quarter based on eff_date:
SELECT TO_CHAR(eff_date, 'YYYYQ'), COUNT(*)
FROM my_table
GROUP BY TO_CHAR(eff_date, 'YYYYQ')
The query relies on the TO_CHAR date format code Q, which returns the calendar quarter (Jan-Mar = quarter 1, Apr-Jun = quarter 2, etc.).
Finally, be warned that the WHERE clause is not optimizable. If you have millions of rows you'll want a different approach.

Resources