Can I create a Materialized View from another Matrialized View in Clickhouse? - clickhouse

The tile pretty much says it. I want to create a Materialized View whose "SELECT" clause SELECTs data from another Materialized View in Clickhouse. I have tried this. The SQL for "createion" fo the two views runs without an error. But upon runtime, the first view is populated, but the second one isn't.
I need to know if I am making a mistake in my SQL or this is just simply not possible.
Here's my two views:
CREATE MATERIALIZED VIEW IF NOT EXISTS production_gross
ENGINE = ReplacingMergeTree
ORDER BY (profile_type, reservoir, case_tag, variable_name, profile_phase, well_name, case_name,
timestamp) POPULATE
AS
SELECT profile_type,
reservoir,
case_tag,
is_endorsed,
toDateTime64(endorsement_date / 1000.0, 0) AS endorsement_date,
endorsed_for_month,
variable_name,
profile_phase,
well_name,
case_name,
asset_id,
toDateTime64(eoh / 1000, 0) as end_of_history,
toDateTime64(ts / 1000, 0) as timestamp,
value, -- AS rate, -- cubic meters per second rate for this month
value * dateDiff('second',
toStartOfMonth(subtractMonths(now(), 1)),
toStartOfMonth(now())) AS volume -- cubic meters volume for this month
FROM (
SELECT pp.profile_type AS profile_type,
trimBoth(splitByChar('-', case_name)[1]) AS reservoir,
JSONExtractString(cd.data, 'case_data', 'Tags$$Tag') AS case_tag,
JSONExtractString(cd.data, 'case_data', 'Tags$$Endorsed') AS is_endorsed,
-- Endorsement Data, is the timestamp when the user "endorsed" the case
JSONExtract(cd.data, 'case_data', 'Tags$$EndorsementDate', 'time_stamp', 'Int64') AS endorsement_date,
-- Endorsement Month is the month of year for which the case was actually endorsed
JSONExtractString(cd.data, 'case_data', 'Tags$$MonthTags') AS endorsed_for_month,
pp.variable_name AS variable_name,
JSONExtractString(pp.data, 'profile_phase') AS profile_phase,
JSONExtractString(wd.data, 'name') AS well_name,
JSONExtractString(cd.data, 'header', 'name') AS case_name,
-- We might want to have asset id here to use in roll-up
JSONExtract(cd.data, 'header', 'reservoir_asset_id', 'Int64') AS asset_id, -- Asset Id in ARM
JSONExtract(pp.data, 'end_of_history', 'Int64') AS end_of_history,
JSONExtract(pp.data, 'values', 'Array(Float64)') AS values,
JSONExtract(pp.data, 'timestamps', 'Array(Int64)') AS timestamps,
JSONExtract(pp.data, 'end_of_history', 'Int64') AS eoh
FROM production_profile AS pp
INNER JOIN well_data AS wd ON wd.uuid = pp.well_id
INNER JOIN case_data AS cd ON cd.uuid = pp.case_id
)
ARRAY JOIN
values AS value,
timestamps AS ts
;
CREATE MATERIALIZED VIEW IF NOT EXISTS production_volume_actual
ENGINE = ReplacingMergeTree
ORDER BY (asset_id,
case_tag,
variable_name,
endorsement_date) POPULATE
AS
SELECT profile_type,
case_tag,
is_endorsed,
endorsement_date,
endorsed_for_month,
variable_name,
profile_phase,
asset_id,
sum(volume) AS total_actual_volume
FROM production_gross
WHERE timestamp < end_of_history
GROUP BY profile_type,
case_tag,
is_endorsed,
endorsement_date,
endorsed_for_month,
variable_name,
profile_phase,
asset_id
ORDER BY asset_id ASC,
case_tag ASC,
variable_name ASC,
endorsement_date ASC
;
As you can see, the second view is an "aggregation" on the first, and that is why I need it. If I want to do the aggregation from scratch, a lot of processes has to be done twice.
Update:
I have tried to change the query to the following:
SELECT ...
FROM `.inner.production_gross`
...
Which did not help. This query resulted in the following error:
Code: 60. DB::Exception: Table default.`.inner.production_gross` doesn't exist.
Then, based on the comment by #DennyCrane and using this answer: https://stackoverflow.com/a/67709334/959156, I run this query:
SELECT
uuid,
name
FROM system.tables
WHERE database = 'default' AND engine = 'MaterializedView'
Which gave me the uuid of the inner table:
ebab2dc5-2887-4e7d-998d-6acaff122fc7
So, I ran this query:
SELECT ...
FROM `.inner.ebab2dc5-2887-4e7d-998d-6acaff122fc7`
Which resulted in the following error:
Code: 60. DB::Exception: Table default.`.inner.ebab2dc5-2887-4e7d-998d-6acaff122fc7` doesn't exist.

Materialized views work as insert triggers on actual data tables, so your production_volume_actual table has to do a SELECT on a data table, not a "view".
If you CREATE a materialized view using an ENGINE (and not as TO another data table), ClickHouse actually creates a data table with the name .inner.<mv_name> on older versions (not using an Atomic database engine), or .inner_id.<some UUID>. if using an Atomic or Replicated database engine. So if you change the select in your second view to this "inner" table name, either:
select from `.inner.production_gross`
select from `.inner_id.<UUID>` -- note the extra '_id' on 'inner'
It should work.
This answer can point you to the right UUID.
At ClickHouse we actually recommend you always create Materialized Views as TO <second_table> to avoid this kind of confusion, and to make operations on <second_table> simpler and more transparent.
(Thanks to OP Mostafa Zeinali and Denny Crane for the clarification for more recent ClickHouse versions)

Related

Translate SQL's first_value and partition by into SAS

I have this code in SQL
SELECT acc_id,
time,
approved_amount,
balance,
coalesce(approved_amount,
first_value(balance) OVER (PARTITION BY acc_id
ORDER BY time)) orig_amount
FROM table;
Is it possible somehow to translate it into SAS? It is not working in proc sql step.
I don't use nor know SAS, however if it is something what does not support window functions, you can replace it by joins. I assume you want second argument of coalesce as the balance of oldest record of those in acc_id group, hence:
select acc_id,
time,
approved_amount,
balance,
coalesce(approved_amount, acc_id_to_balance.balance_fallback)
from table t
join (
select t.acc_id, t.balance as balance_fallback
from (
select acc_id, min(time) as min_time
from table
group by acc_id
) acc_id_to_min_time
join table t on acc_id_to_min_time.acc_id = t.acc_id and acc_id_to_min_time.min_time = t.time
) acc_id_to_balance on t.acc_id = acc_id_to_balance.acc_id
Just worked out in head, didn't try. Problems might appear in case of duplicate minimal time, which would require another level of grouping.
This is how you would do that in SAS since unlike SQL when you use a data step it will process the data in the order that it appears in the source dataset.
data want;
set table ;
by acc_id time;
if first.id then first_balance=balance;
retain first_balance;
orig_amount = coalesce(approved_amount,first_balance);
run;

Oracle tuning for query with query annidate

i am trying to better a query. I have a dataset of ticket opened. Every ticket has different rows, every row rappresent an update of the ticket. There is a field (dt_update) that differs it every row.
I have this indexs in the st_remedy_full_light.
IDX_ASSIGNMENT (ASSIGNMENT)
IDX_REMEDY_INC_ID (REMEDY_INC_ID)
IDX_REMDULL_LIGHT_DTUPD (DT_UPDATE)
Now, the query is performed in 8 second. Is high for me.
WITH last_ticket AS
( SELECT *
FROM st_remedy_full_light a
WHERE a.dt_update IN
( SELECT MAX(dt_update)
FROM st_remedy_full_light
WHERE remedy_inc_id = a.remedy_inc_id
)
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
This is the plan
How i could to better this query?
P.S. This is just a part of a big query
Additional information:
- The table st_remedy_full_light contain 529.507 rows
You could try:
WITH last_ticket AS
( SELECT remedy_inc_id, ASSIGNMENT,
rank() over (partition by remedy_inc_id order by dt_update desc) rn
FROM st_remedy_full_light a
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
where rn = 1;
The best alternative query, which is also much easier to execute, is this:
select remedy_inc_id
, max(assignment) keep (dense_rank last order by dt_update)
from st_remedy_full_light
group by remedy_inc_id
This will use only one full table scan and a (hash/sort) group by, no self joins.
Don't bother about indexed access, as you'll probably find a full table scan is most appropriate here. Unless the table is really wide and a composite index on all columns used (remedy_inc_id,dt_update,assignment) would be significantly quicker to read than the table.

Delete duplicate rows from a BigQuery table

I have a table with >1M rows of data and 20+ columns.
Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).
If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.
I am not proficient in SQL or any other programming language so please excuse my ignorance.
delete from Accidents.CleanedFilledCombined
where Fixed_Accident_Index
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined
group by Fixed_Accident_Index
having count(Fixed_Accident_Index) >1);
You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).
A query that should work is here:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1
UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:
https://stackoverflow.com/a/57900778/132438
An alternative to Jordan's answer - this one scales better when having too many duplicates:
#standardSQL
SELECT event.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.created_at DESC LIMIT 1
)[OFFSET(0)] event
FROM `githubarchive.month.201706` t
# GROUP BY the id you are de-duplicating by
GROUP BY actor.id
)
Or a shorter version (takes any row, instead of the newest one):
SELECT k.*
FROM (
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM `fh-bigquery.reddit_comments.2017_01` x
GROUP BY id
)
To de-duplicate rows on an existing table:
CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k
FROM `deleting.deduplicating_table` row
GROUP BY id
)
Not sure why nobody mentioned DISTINCT query.
Here is the way to clean duplicate rows:
CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table
If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.
SELECT <list of original fields>
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1
In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.
I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row.
In this case duplicate rows will be eliminated by BigQuery
Of course, this will involve some client side coding - so might be not relevant for this particular question.
I havent tried this approach by myself either but feel it might be interesting to try :o)
If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a
Easier answer, without a subselect
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
WHERE TRUE
QUALIFY row_number = 1
The Where True is neccesary because qualify needs a where, group by or having clause
Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:
CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT
Fixed_Accident_Index,
ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;
To be safe, make sure you backup the original table before you run this ^^
I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.
Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING

Create duplicate rows by running same command 5 times for example
insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')
Check if duplicate entries exist
select * from beginner-290513.917834811114.messages where id = 19999
Use generate uuid function to generate uuid corresponding to each message

UPDATE beginner-290513.917834811114.messages
SET bq_uuid = GENERATE_UUID()
where id>0
Clean duplicate entries
DELETE FROM beginner-290513.917834811114.messages
WHERE bq_uuid IN
(SELECT bq_uuid
FROM
(SELECT bq_uuid,
ROW_NUMBER() OVER( PARTITION BY updated_at
ORDER BY bq_uuid ) AS row_num
FROM beginner-290513.917834811114.messages ) t
WHERE t.row_num > 1 );

Oracle - View to fetch data gives different results in different environments

In Oracle (PROD), we will be creating views on table(s) and the users will be querying the views to fetch data for each reporting period (a single month, eg: between '01-DEC-2015' and '31-DEC-2015'). We created a view as
CREATE OR REPLACE VIEW VW_TABLE1 AS SELECT ACCNT_NBR, BIZ_DATE, MAX(COL1) COL1, MAX(COL2) COL2 FROM TABLE1_D WHERE BIZ_DATE IN (SELECT BIZ_DATE FROM TABLE2_M GROUP BY BIZ_DATE) GROUP BY ACCNT_NBR, BIZ_DATE;
The issue here is TABLE1_D (daily table, has data from Dec2015 to Feb2016) has records with multiple dates for a month say for Dec2015, it has records with 01-DEC-2015, 02-DEC-2015,....,29-DEC-2015,30-DEC-2015 (may not be continuous, but loaded on business date) with each day having close to 2,500,000 of records.
TABLE2_M is a monthly table and has a single date for a month (eg for Dec2015 say 30-DEC-2015) with around 4000 records for each date.
When we query the view as
SELECT * FROM VW_TABLE1 WHERE BIZ_DATE BETWEEN '01-DEC-2015' AND '31-DEC-2015'
it returns the aggregated data in table TABLE1_D for 30-DEC-2015 as expected. I thought the Grouping on BIZ_DATE in TABLE1_D is unnecessary as only one BIZ_DATE will be the output from the INNER query.
Checked by removing the BIZ_DATE in the final GROUP BY assuming that there will be data for a single day from the inner query.
Hence took 2 rows for the dates 30-dec-2015 and 30-jan-2016 from both tables and created them in SIT for testing and created view as
CREATE VIEW VW_TABLE1 AS SELECT ACCNT_NBR, MAX(BIZ_DATE) BIZ_DATE, MAX(COL1) COL1, MAX(COL2) COL2 FROM TABLE1_D WHERE BIZ_DATE IN (SELECT BIZ_DATE FROM TABLE2_M GROUP BY BIZ_DATE) GROUP BY ACCNT_NBR;
The select with between for each month (or = exact month date) gives correct data in SIT; i.e., when used BETWEEN for a single month, it produces the respective months data.
SELECT * FROM VW_TABLE1 WHERE BIZ_DATE BETWEEN '01-DEC-2015' AND '31-DEC-2015';
SELECT * FROM VW_TABLE1 WHERE BIZ_DATE = '30-DEC-2015';
With this, I modified the view DDL in PROD (to be same as SIT). But surprisingly the same select (2nd one with ='30-DEC-2015' ; 1st one was taking too long due to volume of data, hence aborted)
returned no data; as I hope that the inner query is sending out dates all 30-DEC-2015 to 30-JAN-2016 and thereby the MAX(BIZ_DATE) is being derived to be from 30-jan-2016. (Table2_M doesn't have FEB2016 data)
I verified whether there was any version differences of Oracle in SIT and PROD and found it to be same from v$version (11.2.0.4.0). Can you please explain this behavior as the same query on same view DDL in different environments returning different results with same data ...

How to create select SQL statement that would produce "merged" dataset from two tables(Oracle DBMS)?

Here's my original question:
merging two data sets
Unfortunately I omitted some intircacies, that I'd like to elaborate here.
So I have two tables events_source_1 and events_source_2 tables. I have to produce the data set from those tables into resultant dataset (that I'd be able to insert into third table, but that's irrelevant).
events_source_1 contain historic event data and I have to do get the most recent event (for such I'm doing the following:
select event_type,b,c,max(event_date),null next_event_date
from events_source_1
group by event_type,b,c,event_date,null
events_source_2 contain the future event data and I have to do the following:
select event_type,b,c,null event_date, next_event_date
from events_source_2
where b>sysdate;
How to put outer join statement to fill the void (i.e. when same event_type,b,c found from event_source_2 then next_event_date will be filled with the first date found
GREATLY APPRECIATE FOR YOUR HELP IN ADVANCE.
Hope I got your question right. This should return the latest event_date of events_source_1 per event_type, b, c and add the lowest event_date of event_source_2.
Select es1.event_type, es1.b, es1.c,
Max(es1.event_date),
Min(es2.event_date) As next_event_date
From events_source_1 es1
Left Join events_source_2 es2 On ( es2.event_type = es1.event_type
And es2.b = es1.b
And es2.c = es1.c
)
Group By c1.event_type, c1.b, c1.c
You could just make the table where you need to select a max using a group by into a virtual table, and then do the full outer join as I provided in the answer to the prior question.
Add something like this to the top of the query:
with past_source as (
select event_type, b, c, max(event_date)
from event_source_1
group by event_type, b, c, event_date
)
Then you can use past_source as if it were an actual table, and continue your select right after the closing parens on the with clause shown.
I end up doing two step process: 1st step populates the data from event table 1, 2nd step MERGES the data between target (the dataset from 1st step) and another source. Please forgive me, but I had to obfuscate table name and omit some columns in the code below for legal reasons. Here's the SQL:
INSERT INTO EVENTS_TARGET (VEHICLE_ID,EVENT_TYPE_ID,CLIENT_ID,EVENT_DATE,CREATED_DATE)
select VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID,
max(EVENT_INITIATED_DATE) EVENT_DATE, sysdate CREATED_DATE
FROM events_source_1
GROUP BY VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID, sysdate;
Here's the second step:
MERGE INTO EVENTS_TARGET tgt
USING (
SELECT ee.VEHICLE_ID VEHICLE_ID, ee.POTENTIAL_EVENT_TYPE_ID POTENTIAL_EVENT_TYPE_ID, ee.CLIENT_ID CLIENT_ID,ee.POTENTIAL_EVENT_DATE POTENTIAL_EVENT_DATE FROM EVENTS_SOURCE_2 ee WHERE ee.POTENTIAL_EVENT_DATE>SYSDATE) src
ON (tgt.vehicle_id = src.VEHICLE_ID AND tgt.client_id=src.client_id AND tgt.EVENT_TYPE_ID=src.POTENTIAL_EVENT_TYPE_ID)
WHEN MATCHED THEN
UPDATE SET tgt.NEXT_EVENT_DATE=src.POTENTIAL_EVENT_DATE
WHEN NOT MATCHED THEN
insert (tgt.VEHICLE_ID,tgt.EVENT_TYPE_ID,tgt.CLIENT_ID,tgt.NEXT_EVENT_DATE,tgt.CREATED_DATE) VALUES (src.VEHICLE_ID, src.POTENTIAL_EVENT_TYPE_ID, src.CLIENT_ID, src.POTENTIAL_EVENT_DATE, SYSDATE)
;

Resources