Add next unique value to SQL column - performance

I have two tables which I am trying to join based on two criteria. One of the criteria is that a date from t1 is between a date in t2 and the next date in t2. The other is that the name from t1 matches the name from t2.
I.e. if t2 looks like this:
Record Name Date
1 A1234 2016-01-03 04:58:00
2 A1234 2015-12-15 08:34:00
3 A5678 2016-01-04 03:14:00
4 A1234 2016-01-05 21:06:00
Then:
Any records from t1 for Name A1234 with dates between 2016-01-03 04:58:00 and 2016-01-05 21:06:00 would be joined to record 1.
Any records from t1 for Name A1234 with dates between 2015-12-15 08:34:00 and 2016-01-03 04:58:00 would be joined to record 2
Any records from t1 for A1234 after the date of record 4 would be joined to record 4
Any records from t1 for A5678 would be joined to record 3 because there's only one date.
My initial approach is to use a correlated subquery to find the next date. However, due to a large number of records, I determined this would take over a year to execute because it searches all of t2 for the next later date during each iteration. Original SQLite:
CREATE TABLE outputtable AS SELECT * FROM t1, t2 d
WHERE t1.Name = d.Name AND t1.Date BETWEEN d.Date AND (
SELECT * FROM (
SELECT Date from t2
WHERE t2.Name = d.Name
ORDER BY Date ASC )
WHERE Date > d.Date
LIMIT 1 )
Now, I would like to find the next date only once for all records in t2 and create a new column in t2 that contains the next date. This way, I only search for the next date about 400,000 times instead of 56 billion times, significantly improving my performance.
Thus the output of the query I'm looking for would make t2 look like this:
Record Name Date Next_Date
1 A1234 2016-01-03 04:58:00 2016-01-05 21:06:00
2 A1234 2015-12-15 08:34:00 2016-01-03 04:58:00
3 A5678 2016-01-04 03:14:00 2999-12-31 23:59:59
4 A1234 2016-01-05 21:06:00 2999-12-31 23:59:59
Then I would be able to simply query whether t1.Date is between t2.Date and t2.Next_Date.
How can I build a query that will add the next date to a new column in t2?

Rather than add the new column, you should just be able to use a query like the one below to join the tables:
SELECT
T1.*,
T2_1.*
FROM
T1
INNER JOIN T2 T2_1 ON
T2_1.Name = T1.Name AND
T2_1.some_date < T1.some_date
LEFT OUTER JOIN T2 T2_2 ON
T2_2.Name = T1.Name AND
T2_2.some_date > T2_1.some_date
LEFT OUTER JOIN T2 T2_3 ON
T2_3.Name = T1.Name AND
T2_3.some_date > T2_1.some_date AND
T2_3.some_date < T2_2.some_date
WHERE
T2_3.Name IS NULL
You can do the same with NOT EXISTS, but this method often has better performance.

You can speed up (sub)queries by using proper indexes.
To check which indexes are actually used, use EXPLAIN QUERY PLAN.
Your original query, without any indexes, would be executed by SQLite 3.10.0 like this:
0|0|0|SCAN TABLE t1
0|1|1|SEARCH TABLE t2 AS d USING AUTOMATIC COVERING INDEX (name=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SCAN TABLE t2
1|0|0|USE TEMP B-TREE FOR ORDER BY
(The "automatic" index is created temporarily just for this query; the optimizer has estimated that this would still be faster than not using any index.)
In this case, you get the most optimal query plan by indexing all columns used for lookups:
create index i1nd on t1(name, date);
create index i2nd on t2(name, date);
0|0|1|SCAN TABLE t2 AS d
0|1|0|SEARCH TABLE t1 USING INDEX i1nd (name=? AND date>? AND date<?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE t2 USING COVERING INDEX i2nd (name=? AND date>?)

I've used this method on tables with around 1 mil rows with success. Obviously, creating an index that will cover this query will help performance.
This approach uses RANK to create a value to join against. After creating the RANK in a CTE (I use this for readability reasons, please correct for style or personal preference), use a sub-query to join rnk to rnk + 1; aka the next date.
Here's an example of what the code looks like using your sample values.
IF OBJECT_ID('tempdb..#T2') IS NOT NULL
DROP TABLE #T2
CREATE TABLE #T2
(
Record INT NOT NULL PRIMARY KEY,
Name VARCHAR(10),
[DATE] DATETIME,
)
INSERT INTO #T2
VALUES (1, 'A1234', '2016-01-03 04:58:00'),
(2, 'A1234', '2015-12-15 08:34:00'),
(3, 'A5678', '2016-01-04 03:14:00'),
(4, 'A1234', '2016-01-05 21:06:00');
WITH Rank_Dates
AS (Select *
,rank() OVER(PARTITION BY #t2.name ORDER BY #t2.date DESC) AS rnk
FROM #T2)
select RD1.Record,
RD1.Name,
RD1.DATE,
COALESCE (RD2.DATE, '2999-12-31 23:59:59') AS NEXT_DATE
FROM Rank_Dates RD1
LEFT JOIN Rank_Dates RD2
ON RD1.rnk = RD2.rnk + 1
AND RD1.Name = RD2.Name
ORDER BY RD1.Record -- ORDER BY is optional
;
EDIT: added code output below.
The code above produces the following output.
Record Name DATE NEXT_DATE
1 A1234 2016-01-03 04:58:00.000 2016-01-05 21:06:00.000
2 A1234 2015-12-15 08:34:00.000 2016-01-03 04:58:00.000
3 A5678 2016-01-04 03:14:00.000 2999-12-31 23:59:59.000
4 A1234 2016-01-05 21:06:00.000 2999-12-31 23:59:59.000
On a random note. Would using the CURRENT_TIMESTAMP in place of hard coding '2999-12-31 23:59:59.000' produce a similar result?

Related

ClickHouse correlated subquery

I've some table. Each date contains data snapshots for 15 days ago and 50 days ahead. If I'll go to my table and run SELECT WHERE date = '2022-01-27' I will get rows with block_date from 2022-01-13 to 2022-03-18. I need to build reports per month, so at date = '2022-01-27' I need to see rows with block_date from 2022-01-01 to 2022-01-31. I can find this missing information if I go back a day.
I tried to run query below to achieve this, but I got an error and according to these docs: https://clickhouse.com/docs/en/sql-reference/operators/exists/ I can't use EXISTS here. How can I modify my query? I don't have any ideas.
SELECT
`date`,
block_date,
total_plan_volume_grp20
FROM schema.table
UNION ALL
SELECT
addDays(`date`, 1),
block_date,
total_plan_volume_grp20
FROM schema.table t1
where exists (
select true
from schema.table t2
where t2.`date` = addDays(t1.`date`, 1))
UNION ALL
SELECT
addDays(`date`, 2),
block_date,
total_plan_volume_grp20
FROM schema.table t1
where exists (
select true
from schema.table t2
where t2.`date` = addDays(t1.`date`, 2)

I/O issue with PowerCenter Informatica in Oracle

I have two tables in Oracle and I have to synchronize values (Field column) between the tables. I'm using Informatica PowerCenter for this synchronization operation. The source qualifier query causes high I/O usage and I need to solve it.
Table1
Table1 has about 20M data. Field in Table1 is the actual field. Timestamp field holds create & update date and it has daily partition.
Id
Field
Timestamp
1
A
2017-05-12 03:13:40
2
B
2002-11-01 07:30:46
3
C
2008-03-03 03:26:29
Table2
Table2 has about 500M data. Field in Table2 should be as sync as possible to Field in Table1. Timestamp field holds create & update date and it has daily partition. Table2 is also target in the mapping.
Id
Table1_Id
Field
Timestamp
Action
100
1
A
2005-09-30 03:20:41
Nothing
101
1
B
2015-06-29 09:41:44
Update Field as A
102
1
C
2016-01-10 23:35:49
Update Field as A
103
2
A
2019-05-08 07:42:46
Update Field as B
104
2
B
2003-06-02 11:23:57
Nothing
105
2
C
2021-09-21 12:04:24
Update Field as B
106
3
A
2022-01-23 01:17:18
Update Field as C
107
3
B
2008-04-24 15:17:25
Update Field as C
108
3
C
2010-01-15 07:20:13
Nothing
Mapping Queries
Source Qualifier Query
SELECT *
FROM Table1 t1, Table2 t2
WHERE t1.Id = t2.Table1_Id AND t1.Field <> t2.Field
Update Transformation Query
UPDATE Table2
SET
Field = :tu.Field,
Timestamp = SYSDATE
WHERE Id = :tu.Id
You can use below approach.
SQ - Your SQL is correct and you can use it if you see its working but add a <> clause on partition date key column. You can use this SQL to speed it up as well.
SELECT *
FROM Table2 t2
INNER JOIN Table1 t3 ON t3.Id = t2.Table1_Id
LEFT OUTER JOIN Table1 t1 ON t1.Id = t2.Table1_Id AND t1.Field = t2.Field AND t1.partition_date= t2.partition_date -- You did not mention partition_date column but i am assuming there is a separate column which is used to partition.
WHERE t1.id is null -- <> is inefficient.
Then in your infa target T2 definition, make sure you mention partition_date as part of key along with ID.
Then use a update strategy set to DD_UPDATE. You can set the session to update as well.
And remove that target override. This actually applies the update query on the whole table and sometime can be inefficient abd I/O intensive.
Informatica is powerful to update data in bunch through update strategy. You can increase commit interval as per your performance.
You shouldn't try to update a 500M table in a single go using SQL. Yes, you can use PLSQL to update in a bunch.

Oracle: Update values in table with aggregated values from same table

I am looking for a possibly better approach to this.
I have created a temp table in Oracle 11.2 that I'm using to pre calculate values that I will need in other selects instead of always generating them again with each select.
create global temporary table temp_foo (
DT timestamp(6), --only the date part will be used in this example but for later things I will need the time
Something varchar2(100),
Customer varchar2(100),
MinDate timestamp(6),
MaxDate timestamp(6),
Filecount int,
Errorcount int,
AvgFilecount int,
constraint PK_foo primary key (DT, Customer)
) on commit preserve rows;
I then first insert some fixed values for everything except AvgFilecount. AvgFilecount should contain the average for the Filecount for the 3 previous records (going by the date in DT). It doesn’t matter that the result will be converted to an int, I don’t need the decimal places
DT | Customer | Filecount | AvgFilecount
2019-04-30 | x | 10 | avg(2+3+9)
2019-04-29 | x | 2 | based on values before this
2019-04-28 | x | 3 | based on values before this
2019-04-27 | x | 9 | based on values before this
I thought about using a normal UPDATE statement as this should be faster than looping through the values. I should mention that there are no gaps in the DT field but obviously there is a first one where I won‘t find any previous records. If I would loop through, I could easily calculate AvgFilecount with (the record before previous record/2 + previous record)/3 which I cannot with UPDATE as I cannot guarantee the order of how they are executed. So I‘m fine with just taking the last 3 records (going by DT) and calcuting it from there.
What I thought would be an easy update is giving me headaches. I‘m mostly doing SQL Server where I would just join the 3 other records but it seems is a bit different in Oracle. I have found https://stackoverflow.com/a/2446834/4040068 and wanted to use the second approach in the answer.
update
(select curr.DT, curr.temp_foo, curr.Filecount, curr.AvgFilecount as OLD, (coalesce(Minus1.Filecount, 0) + coalesce(Minus2.Filecount, 0) + coalesce(Minus3.Filecount, 0)) / 3 as NEW
from temp_foo curr
left join temp_foo Minus1 ON Minus1.Customer = curr.Customer and trunc(Minus1.DT) = trunc(curr.DT-1)
left join temp_foo Minus2 ON Minus2.Customer = curr.Customer and trunc(Minus2.DT) = trunc(curr.DT-2)
left join temp_foo Minus3 ON Minus3.Customer = curr.Customer and trunc(Minus3.DT) = curr.DT-3
order by 1, 2
)
set OLD = NEW;
Which gives me an
ORA-01779: cannot modify a column which maps to a non key-preserved
table
01779. 00000 - "cannot modify a column which maps to a non key-preserved table"
*Cause: An attempt was made to insert or update columns of a join view which
map to a non-key-preserved table.
*Action: Modify the underlying base tables directly.
I thought this should work as both join conditions are in the primary key and thus unique. I am currently implementing the first approach in the above mentioned answer but it is getting quite big and it feels like there should be a better solution to this.
Other things I thought about trying:
using a nested subselect (nested because Oracle doesn’t know top(n) and I need to sort the subselect) to select the previous 3 records ordered by DT and then he outer select with rownum <=3 and then I could just use AVG(). However, I was told subselect can be quite slow and joins are better in Oracle performance wise. Dunno if that is really the case, haven‘t done any testing
Edit: My insert right now looks like this. I am already aggregating the Filecount for a day as there can be multiple records per DT per Customer per Something.
insert into temp_foo (DT, Something, Customer, Filecount)
select dates.DT, tbl1.Something, tbl1.Customer, coalesce(sum(tbl3.Filecount),0)
from table(Function_Returning_Daterange(NULL, NULL)) dates
cross join
(SELECT Something,
Code,
Value
FROM Table2 tbl2
WHERE (Something = 'Value')) tbl1
left outer join Table3 tbl3
on tbl3.Customer = tbl1.Customer
and trunc(tbl3.MinDate) = trunc(dates.DT)
group by dates.DT, tbl1.Something, tbl1.Customer;
You could use an analytic average with a window clause:
select dt, customer, filecount,
avg(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding) as avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 6
2019-04-28 x 3 9
2019-04-27 x 9
and then do the update part with a merge statement:
merge into tmp_foo t
using (
select dt, customer,
avg(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding) as avgfilecount
from tmp_foo
) s
on (s.dt = t.dt and s.customer = t.customer)
when matched then update set t.avgfilecount = s.avgfilecount;
4 rows merged.
select dt, customer, filecount, avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 6
2019-04-28 x 3 9
2019-04-27 x 9
You haven't shown your original insert statement; it might be possible to add the analytic calculation to that, and avoid the separate update step.
Also, if you want the first two date values to be calculated as if the 'missing' extra days before them had zero counts, you could use sum and division instead of avg:
select dt, customer, filecount,
sum(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding)/3 as avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 4
2019-04-28 x 3 3
2019-04-27 x 9
It depends what you expect those last calculated values to be.

Aggregate only new rows from source table

I got one Source table with a timestamp column (YYYY.MM.DD HH24:MI:SS) and a target table with aggregated rows on daily basis (Date column: YYYY.MM.DD).
My Problem is: How do I bring new data from source to target and aggregate it?
I tried:
select
a.Sales,
trunc(a.timestamp,'DD') as TIMESTAMP,
count(1) as COUNT,
from
tbl_Source a
where trunc(a.timestamp,'DD') > nvl((select MAX(b.TIME_TO_DAY)from tbl_target b), to_date('01.01.1975 00:00:00','dd.mm.yyyy hh24:mi:ss'))
group by a.sales,
trunc(a.Timestamp,'DD')
The problem with that is: when I have a row with timestamp '2013.11.15 00:01:32' and the max day from target is the 14th of november, it will only aggregate the 15th. Would I use >= instead of > some rows would get loaded twice.
It looks like you are looking for a merge statement: If the day is already present in tbl_target then update the count else insert the record.
merge into tbl_target dest
using
(
select sales, trunc(timestamp) as theday , count(*) as sales_count
from tbl_Source
where trunc(timestamp) >= ( select nvl(max(time_to_day),to_date('01.01.1975','dd.mm.yyyy')) from tbl_target )
group by sales, trunc(timestamp)
) src
on (src.theday = dest.time_to_day)
when matched then update set
dest.sales_count = src.sales_count
when not matched then
insert (time_to_day, sales_count)
values (src.theday, src.sales_count)
;
As far as I can understand your question: you need to get everything since the last reload to target table.
The problem here: you need this date, but it is truncated during the update.
If my guesses are correct you cannot do anything except to store the date of reload as an additional column because there is no way to get it back from the data presented here.
about your query:
count(*) and count(1) are the same in performance (proved many times, at least in 10-11 versions) - do not make this count(1), looks really ugly
do not use nvl, use coalesce instead of it - it is much faster
I would write your query like that:
with t as (select max(b.time_to_day) mx from tbl_target b)
select a.sales,trunc(a.timestamp,'dd') as timestamp,count(*) as count
from tbl_source a,t
where trunc(a.timestamp,'dd') > t.mx or t.mx is null
group by a.sales,trunc(a.timestamp,'dd')
Does this fit your needs:
WHERE trunc(a.timestamp,'DD') > nvl((select MAX(b.TIME_TO_DAY) + 1 - 1/(24*60*60) from tbl_target b), to_date('01.01.1975 00:00:00','dd.mm.yyyy hh24:mi:ss'))
i.e. instead of 2013-11-15 00:00:00 compare to 2013-11-16 23:59:59
Update:
This one?
WHERE trunc(a.timestamp,'DD') BETWEEN nvl((select MAX(b.TIME_TO_DAY) from ...) AND nvl((select MAX(b.TIME_TO_DAY) + 1 - 1/(24*60*60) from ...)

Oracle SQL: How to SELECT N records for each "group" / "cluster"

I've got a table big_table, with 4 million record, they are clustered in 40 groups through a column called "process_type_cod". The list of values that this column may assume is in a second table. Let's call it small_table.
So, we have big_table with a NOT NULL FK called process_type_cod that points to small_table (assume the colum name is the same on both tables).
I want N record (i.e. 10) from big_table, for each record of the small_table.
I.e.
10 record from big_table related to the first record of small_table
UNION
10 different record from big_table related to the second record of small table, and so on.
Is it possible to obtain with a single SQL function?
I recommend an analytical function such as rank() or row_number(). You could do this with hard-coded unions, but the analytical function does all the hard work for you.
select *
from
(
select
bt.col_a,
bt.col_b,
bt.process_type_cod,
row_number() over ( partition by process_type_cod order by col_a nulls last ) rank
from small_table st
inner join big_table bt
on st.process_type_cod = bt.process_type_cod
)
where rank < 11
;
You may not even need that join since big_table has all of the types you care about. In that case, just change the 'from clause' to use big_table and drop the join.
What this does is performs the query and then sorts the records using the 'order by' operator in the partition statement. For a given group (here we grouped by col_a), a numerical row number (i.e. 1, 2, 3, 4, 5, n+1...) is applied to each record consecutively. In the outer where clause, just filter by the records with a number lower than N.

Resources