Clickhouse - Latest Record - clickhouse

We have almost 1B records in a replicated merge tree table.
The primary key is a,b,c
Our App keeps writing into this table with every user action. (we accumulate almost a million records per hour)
We append (store) the latest timestamp (updated_at) for a given unique combination of (a,b)
The key requirement is to provide a roll-up against the latest timestamp for a given combination of a,b,c
Currently, we are processing the queries as
select a,b,c, sum(x), sum(y)...etc
from table_1
where (a,b,updated_at) in (select a,b,max(updated_at) from table_1 group by a,b)
and c in (...)
group by a,b,c
clarification on the sub-query
(select a,b,max(updated_at) from table_1 group by a,b)
^ This part is for illustration only.. our app writes latest updated_at for every a,b implying that the clause shown above is more like
(select a,b,updated_at from tab_1_summary)
[where tab_1_summary has latest record for a given a,b]
Note: We have to keep the grouping criteria as-is.
The table is structured with partition (c) order by (a, b, updated_at)
Question is, is there a way to write a better query. (that can returns results faster..we are required to shave off few seconds from the overall processing)
FYI: We toyed working with Materialized View ReplicatedReplacingMergeTree. But, given the size of this table, and constant inserts + the FINAL clause doesn't necessarily work well as compared to the query above.
Thanks in advance!

Just for test try to use join instead of tuple in (tuples):
select t.a, t.b, t.c, sum(x), sum(y)...etc
from table_1 AS t inner join tab_1_summary using (a, b, updated_at)
where c in (...)
group by t.a, t.b, t.c
Consider using AggregatingMergeTree to pre-calculate result metrics:
CREATE MATERIALIZED VIEW table_1_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(updated_at)
ORDER BY (updated_at, a, b, c)
AS SELECT
updated_at,
a,b,c,
sum(x) AS x, /* see [SimpleAggregateFunction data type](https://clickhouse.tech/docs/en/sql-reference/data-types/simpleaggregatefunction/) */
sum(y) AS y,
/* For non-simple functions should be used [AggregateFunction data type](https://clickhouse.tech/docs/en/sql-reference/data-types/aggregatefunction/). */
// etc..
FROM table_1
GROUP BY updated_at, a, b, c;
And use this way to get result:
select a,b,c, sum(x), sum(y)...etc
from table_1_mv
where (updated_at,a,b) in (select updated_at,a,b from tab_1_summary)
and c in (...)
group by a,b,c

Related

Oracle maximum number of expression issue

I have this query
#Query("SELECT b from BillInfoDetails b where b.masterAcctCode in :masterAccountList and b.msisdn in :msisdnList") List<BillInfoDetails>findAllByMsisdnAndMasterAcctList(#Param("masterAccountList")List<String> masterAccountList, #Param("msisdnList") List<String> msisdnList);
and then find this error ORA-01795: maximum number of expressions in a list is 1000.
This error has some manual solution that I have split the list manually 1 to 999 and then 1000 to 1999 and so on. But this will not the good one for me cause in this msisdnList there could be 1500 or 18000 or some other more values. Moreover I want a dynamic solution actually where any dynamic value whatever it is, it will work properly
One option is to store all those values into a table; then you'd be able to use it as a join (or a subquery). For example:
select b.*
from billinfodetails b join new_table n on n.masteracctcode = b.masteracctcode
or
select b.*
from billinfodetails b
where b.masteracctcode in (select n.masteracctcode
from new_table n)
or
select b.*
from billinfodetails b
where exists (select null
from new_table n
where n.masteracctcode = b.masteracctcode)

check for data completeness oracle etl

I am new to oracle and I would like to know how do we check for complete data load and validate boundary values as part of the ETL testing process. (The two tables could be T1 and T2). Please let me know a sample query.
Thanks, Santosh
To verify data completeness, perform following validations-:
• Ensure that all expected data is loaded into target table.
• Compare record counts between source and target.
• Check for any rejected records
• Check data should not be truncated in the column of target tables
You can write simple minus queries to check the same-:
((Select [column1], [column2]…,[column n] from t1
Minus
Select [column1], [column2]…,[column n] from t2)
Union
(Select [column1], [column2]…,[column n] from t2
Minus
Select [column1], [column2]…,[column n] from t1
To check Boundry Values -:
Select * FROM t1 WHERE id NOT BETWEEN x AND y;

Check for data correctness Oracle ETL [duplicate]

I am new to oracle and I would like to know how do we check for complete data load and validate boundary values as part of the ETL testing process. (The two tables could be T1 and T2). Please let me know a sample query.
Thanks, Santosh
To verify data completeness, perform following validations-:
• Ensure that all expected data is loaded into target table.
• Compare record counts between source and target.
• Check for any rejected records
• Check data should not be truncated in the column of target tables
You can write simple minus queries to check the same-:
((Select [column1], [column2]…,[column n] from t1
Minus
Select [column1], [column2]…,[column n] from t2)
Union
(Select [column1], [column2]…,[column n] from t2
Minus
Select [column1], [column2]…,[column n] from t1
To check Boundry Values -:
Select * FROM t1 WHERE id NOT BETWEEN x AND y;

select statement from a table ONLY if some of the fields were updated ORACLE

Can anyone explain, how I can create a select statement and fetch the data from a table, but only if particular fields were updated ?! Let's say I have:
select a, b, c, d , e, f
from table 1 t1
inner join table2 t2
on t1.a = t2.a
I'm interesting if columns d, e, f were updated since yesterday let's say, than I want to include this row in my select statement, but if d, e, f were not updated since yesterday than ignore this row. In table1 I have a date field when the data was inserted (date_created) and the date field when it was updated (date_modified). The tricky bit is, that data in table1 might be updated by the users during the day, but not obligatory fields d, e, f , lets say user simply updated columns a, b, c. But date_modified column will show that the row has been updated. So I cannot rely purely on the date_modified column. My question is, is there any other way how to filter the data and get correct rows in return ? Triggers and stored procedures is not an option, ideally pure sql .. Any help?
It's unclear which columns belong to which table but one solution is to use a flashback query (provided you have sufficient undo retention to accommodate the 24 hour difference between queries).
An example of finding the differences on a table where columns d, e or f have changed from their value 24 hours ago is:
SELECT t.*
FROM table_name t
INNER JOIN
(
SELECT *
FROM table_name
AS OF TIMESTAMP SYSTIMESTAMP - INTERVAL '1' DAY
) p
ON ( t.a = p.a
AND ( t.d <> p.d OR t.e <> p.e OR t.f <> p.f ) );
Solved! Solution: Add an extra column (lets say Total) to the target table as a sum of columns d, e, f and update it for the first time. After that, if columns d,e,f were changed during the day, the sum of columns d, e, f will differ from the Total column, and you can simply filter it in where clause.
Maybe it is not the most elegant solution, but it does the job.
Thanks for yours ideas !!!

How to create select SQL statement that would produce "merged" dataset from two tables(Oracle DBMS)?

Here's my original question:
merging two data sets
Unfortunately I omitted some intircacies, that I'd like to elaborate here.
So I have two tables events_source_1 and events_source_2 tables. I have to produce the data set from those tables into resultant dataset (that I'd be able to insert into third table, but that's irrelevant).
events_source_1 contain historic event data and I have to do get the most recent event (for such I'm doing the following:
select event_type,b,c,max(event_date),null next_event_date
from events_source_1
group by event_type,b,c,event_date,null
events_source_2 contain the future event data and I have to do the following:
select event_type,b,c,null event_date, next_event_date
from events_source_2
where b>sysdate;
How to put outer join statement to fill the void (i.e. when same event_type,b,c found from event_source_2 then next_event_date will be filled with the first date found
GREATLY APPRECIATE FOR YOUR HELP IN ADVANCE.
Hope I got your question right. This should return the latest event_date of events_source_1 per event_type, b, c and add the lowest event_date of event_source_2.
Select es1.event_type, es1.b, es1.c,
Max(es1.event_date),
Min(es2.event_date) As next_event_date
From events_source_1 es1
Left Join events_source_2 es2 On ( es2.event_type = es1.event_type
And es2.b = es1.b
And es2.c = es1.c
)
Group By c1.event_type, c1.b, c1.c
You could just make the table where you need to select a max using a group by into a virtual table, and then do the full outer join as I provided in the answer to the prior question.
Add something like this to the top of the query:
with past_source as (
select event_type, b, c, max(event_date)
from event_source_1
group by event_type, b, c, event_date
)
Then you can use past_source as if it were an actual table, and continue your select right after the closing parens on the with clause shown.
I end up doing two step process: 1st step populates the data from event table 1, 2nd step MERGES the data between target (the dataset from 1st step) and another source. Please forgive me, but I had to obfuscate table name and omit some columns in the code below for legal reasons. Here's the SQL:
INSERT INTO EVENTS_TARGET (VEHICLE_ID,EVENT_TYPE_ID,CLIENT_ID,EVENT_DATE,CREATED_DATE)
select VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID,
max(EVENT_INITIATED_DATE) EVENT_DATE, sysdate CREATED_DATE
FROM events_source_1
GROUP BY VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID, sysdate;
Here's the second step:
MERGE INTO EVENTS_TARGET tgt
USING (
SELECT ee.VEHICLE_ID VEHICLE_ID, ee.POTENTIAL_EVENT_TYPE_ID POTENTIAL_EVENT_TYPE_ID, ee.CLIENT_ID CLIENT_ID,ee.POTENTIAL_EVENT_DATE POTENTIAL_EVENT_DATE FROM EVENTS_SOURCE_2 ee WHERE ee.POTENTIAL_EVENT_DATE>SYSDATE) src
ON (tgt.vehicle_id = src.VEHICLE_ID AND tgt.client_id=src.client_id AND tgt.EVENT_TYPE_ID=src.POTENTIAL_EVENT_TYPE_ID)
WHEN MATCHED THEN
UPDATE SET tgt.NEXT_EVENT_DATE=src.POTENTIAL_EVENT_DATE
WHEN NOT MATCHED THEN
insert (tgt.VEHICLE_ID,tgt.EVENT_TYPE_ID,tgt.CLIENT_ID,tgt.NEXT_EVENT_DATE,tgt.CREATED_DATE) VALUES (src.VEHICLE_ID, src.POTENTIAL_EVENT_TYPE_ID, src.CLIENT_ID, src.POTENTIAL_EVENT_DATE, SYSDATE)
;

Resources