Clickhouse running diff with grouping - clickhouse

General Task
A table consists of three columns (time, key, value). The task is to calculate a running difference for each key.
So, from input
---------------
| time | key | value |
---------------
| 1 | A | 4 |
| 2 | B | 1 |
| 3 | A | 6 |
| 4 | A | 7 |
| 5 | B | 3 |
| 6 | B | 7 |
it is desired to get
----------------------
| key | value | delta |
----------------------
| A | 4 | 0 |
| B | 1 | 0 |
| A | 6 | 2 |
| A | 7 | 1 |
| B | 3 | 2 |
| B | 7 | 4 |
Approaches
runningDifference function. Works, if the key is fixed. So we can
select *, runningDifference(value) from
(SELECT key, value from table where key = 'A' order by time)
Note that subquery is necessary here. This solution suffers when you want to get this for different keys
groupArray.
select key, groupArray(value) from
(SELECT key, value from table order by time)
group by key
So, now we get a key and an array of elements with this key. Good.
But how to calculate a sliding difference? If we could do that, then ARRAY JOIN would lead us to a result.
Or we can even zip the array with itself and then apply lambda (we have arrayMap for that) but... we don't have any zip alternative.
Any ideas?
Thanks in advance.

Solution with arrays:
WITH
groupArray(value) as time_sorted_vals,
arrayEnumerate(time_sorted_vals) as indexes,
arrayMap( i -> time_sorted_vals[i] - time_sorted_vals[i-1], indexes) as running_diffs
SELECT
key,
running_diffs
FROM
(SELECT key, value from table order by time)
GROUP by key
Other option (doing sort inside each group separately, which is more optimal in a lot of cases)
WITH
groupArray( tuple(value,time) ) as val_time_tuples,
arraySort( x -> x.2, val_time_tuples ) as val_time_tuples_sorted,
arrayMap( t -> t.1, indexes) as time_sorted_vals,
arrayEnumerate(time_sorted_vals) as indexes,
arrayMap( i -> time_sorted_vals[i] - time_sorted_vals[i-1], indexes) as running_diffs
SELECT
key,
running_diffs
FROM
time
GROUP by key
and you can apply ARRAY JOIN on the result afterward.

Lately I've also encountered the problem and Clickhouse offers function arrayDifference.
WITH
groupArray(value) as vals
arrayDifference(vals) as running_diffs
SELECT
key,
running_diffs
FROM
(SELECT key, value from table order by time)
GROUP by key

This question was posted years ago, for today, Sep 29th, 2021, we can use arrayDifference instead of arrayMap. And we can ARRAY JOIN so that we can get a tabulated result instead of a nested array.
SELECT key, sorted_time, time_sorted_vals, running_diffs
FROM (
WITH
groupArray( tuple(value,time) ) as val_time_tuples,
arraySort( x -> x.2, val_time_tuples ) as val_time_tuples_sorted,
arrayMap( t -> t.1, val_time_tuples_sorted) as time_sorted_vals,
arrayMap( t -> t.2, val_time_tuples_sorted) as sorted_time,
arrayDifference(time_sorted_vals) as running_diffs
SELECT
key,
sorted_time,
time_sorted_vals,
running_diffs
FROM
table_name
GROUP by key)
ARRAY JOIN sorted_time, time_sorted_vals, running_diffs
The only restriction is that the value column should not be of nullable types.

Related

How to realize cummulative sum without built-in function?

I need to realize cumulative summing per each day.
For example my data set is as follows:
buyer | bread | date |
---------------------------
b1 | 2 | 2018-01-01|
b1 | 3 | 2018-01-02|
b1 | 1 | 2018-01-04|
b2 | 2 | 2018-01-02|
I need to get selection as follows:
buyer | cum_sum_on_01_01 | cum_sum_on_01_02 | cum_sum_on_01_03 | cum_sum_on_01_04 | cum_sum_on_01_05 |...
----------------------------------------------------------------------------------------------------------
b1 | 2 | 5 | 5 | 6 | 6 |...
b2 | 0 | 2 | 2 | 2 | 2 |...
How to do it?
What's the point of without built-in function? The only way to achieve cumulative sums in ClickHouse for now is arrayCumSum. So the answer is to build the candidate array and pass it to arrayCumSum. Here are the steps:
step 1: building the bread array for each buyer
SELECT
buyer,
groupArray(bread) AS breads
FROM
(
SELECT
buyer,
sum(bread) AS bread,
date
FROM bbd
ALL RIGHT JOIN
(
WITH
toDate('2018-01-01') AS min_date,
toDate('2018-01-31') AS max_date
SELECT
arrayJoin(buyers) AS buyer,
arrayJoin(arrayMap(i -> (min_date + toIntervalDay(i)), range(toUInt64((max_date - min_date) + 1)))) AS date
FROM
(
SELECT groupUniqArray(buyer) AS buyers
FROM bbd
)
) USING (buyer, date)
GROUP BY
buyer,
date
ORDER BY
buyer ASC,
date ASC
)
GROUP BY buyer
┌─buyer─┬─breads──────────────────────────────────────────────────────────┐
│ b1 │ [2,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] │
│ b2 │ [0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] │
└───────┴─────────────────────────────────────────────────────────────────┘
step 2: apply arrayCumSum for each buyer
replace groupArray(bread) AS breads to arrayCumSum(groupArray(bread)) AS breads
┌─buyer─┬─breads──────────────────────────────────────────────────────────┐
│ b1 │ [2,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6] │
│ b2 │ [0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2] │
└───────┴─────────────────────────────────────────────────────────────────┘
The accepted answer is excellent, and you should indeed use the built-in arrayCumSum function for computing cumulative sums. However, if one of the motivations of the original question was to find out how to create accumulate/folding style functions in general when they are not natively supported by ClickHouse (e.g., CumMax, CumMin, etc.), here is an approach that will work with any aggregate function in ClickHouse.
The core piece of logic to achieve this is to use arrayReduceInRanges and generate all tuple ranges of the form (1, 1), (1, 2), ... (1, n) with arrayMap and arrayEnumerate. Then, whichever function you choose as the higher-order aggregate function for arrayReduceInRanges, e.g. 'sum' or 'max', will be turned into a cumulative array-based form of the function. Here is what that logic looks like:
WITH arr as (SELECT groupArray(some_col) AS arr_some_col FROM some_table)
SELECT
arrayReduceInRanges(
'sum'
arrayMap(x -> (1, x), arrayEnumerate(arr_some_col))
arr_some_col
)
FROM arr
From here, you can arrayJoin the values back out from the array or keep them in array form for further calculations.
For your specific application with bread, here's something that would work using the above core logic (assuming your table is named bread_data):
WITH ordered AS (SELECT * FROM bread_data ORDER BY date, buyer),
agg AS (
SELECT
buyer,
untuple(
arrayJoin(
arrayZip(
groupArray(date),
arrayReduceInRanges(
-- 'sum' or any ClickHouse aggregate function.
'sum',
arrayMap(x -> (1, x), arrayEnumerate(groupArray(bread))),
groupArray(bread)
)
)
)
)
FROM ordered
GROUP BY buyer
)
SELECT buyer, _ut_1 AS date, _ut_2 as cum_bread
FROM agg
ORDER BY date
Notice the first WITH clause which orders the table by date and buyer so that the subsequent groupArray calls will be guaranteed to construct their arrays in the same, consistent order (ClickHouse documentation notes that otherwise any call to groupArray can construct the elements in a random order).
It may seem complex, but when you break it down using the first core logic piece and the fact that a lot of the syntax here is around array grouping and ungrouping so that we can do our main work in array space, it should hopefully make some intuitive sense.
The output will look like this:
+-------+------------+-----------+
| buyer | date | cum_bread |
+-------+------------+-----------+
| b1 | 2018-01-01 | 2 |
| b2 | 2018-01-02 | 2 |
| b1 | 2018-01-02 | 5 |
| b1 | 2018-01-04 | 6 |
+-------+------------+-----------+

Left Outer Join via a link table, using min() to restrict join to one row

I am trying to write an Oracle SQL query to join two tables that are linked via a link table (by that I mean a table with 2 columns, each a foreign key to the primary tables). A min() function is to be used to limit the results from the left outer join to a single row.
My model consists of "parents" and "nephews". Parents can have 0 or more nephews. Parents can be enabled or disabled. Each nephew has a birthday date. The goal of my query is:
Print a single row for each enabled parent, listing that parent's oldest nephew (ie the one with the min(birthday)).
My problem is illustrated here at sqlfiddle: http://sqlfiddle.com/#!4/9a3be0d/1
I can form a query that lists all of the nephews for the enabled parents, but that is not good enough- I just want one row per parent which includes just the oldest nephew. Forming the where clause to the outer table seems to be my stumbling block.
My tables and sample data:
create table parent (parent_id number primary key, parent_name varchar2(50), enabled int);
create table nephew (nephew_id number primary key, birthday date, nephew_name varchar2(50));
create table parent_nephew_link (parent_id number not null, nephew_id number not null);
parent table:
+----+-------------+---------+
| id | parent_name | enabled |
+----+-------------+---------+
| 1 | Donald | 1 |
+----+-------------+---------+
| 2 | Minnie | 0 |
+----+-------------+---------+
| 3 | Mickey | 1 |
+----+-------------+---------+
nephew table:
+-----------+------------+-------------+
| nephew_id | birthday | nephew_name |
+-----------+------------+-------------+
| 100 | 01/01/2017 | Huey |
+-----------+------------+-------------+
| 101 | 01/01/2016 | Dewey |
+-----------+------------+-------------+
| 102 | 01/01/2015 | Louie |
+-----------+------------+-------------+
| 103 | 01/01/2014 | Morty |
+-----------+------------+-------------+
| 104 | 01/01/2013 | Ferdie |
+-----------+------------+-------------+
parent_nephew_link table:
+-----------+-----------+
| parent_id | nephew_id |
+-----------+-----------+
| 1 | 100 |
+-----------+-----------+
| 1 | 101 |
+-----------+-----------+
| 1 | 102 |
+-----------+-----------+
| 3 | 103 |
+-----------+-----------+
| 3 | 104 |
+-----------+-----------+
My (not correct) query:
-- This query is not right, it returns a row for each nephew
select parent_name, nephew_name
from parent p
left outer join parent_nephew_link pnl
on p.parent_id = pnl.parent_id
left outer join nephew n
on n.nephew_id = pnl.nephew_id
where enabled = 1
-- I wish I could add this clause to restrict the result to the oldest
-- nephew but p.parent_id is not available in sub-selects.
-- You get an ORA-00904 error if you try this:
-- and n.birthday = (select min(birthday) from nephew nested where nested.parent_id = p.parent_id)
My desired output would be:
+-------------+-------------+
| parent_name | nephew_name |
+-------------+-------------+
| Donald | Louie |
+-------------+-------------+
| Mickey | Ferdie |
+-------------+-------------+
Thanks for any advice!
John
markaaronky's suggestion
I tried using markaaronky's suggestion but this sql is also flawed.
-- This query is not right either, it returns the correct data but only for one parent
select * from (
select parent_name, n.nephew_name, n.birthday
from parent p
left outer join parent_nephew_link pnl
on p.parent_id = pnl.parent_id
left outer join nephew n
on n.nephew_id = pnl.nephew_id
where enabled = 1
order by parent_name, n.birthday asc
) where rownum <= 1
Why not:
(1) include the n.birthday from the nephews table in your SELECT statement
(2) add an ORDER BY n.birthday ASC to your query
(3) also modify your select so that it only takes the top row?
I tried to write this out in sqlfiddle for you but it doesn't seem to like table aliases (e.g. it throws an error when I write n.birthday), but I'm sure that's legal in Oracle, even though I'm a SQL Server guy.
Also, if I recall correctly, Oracle doesn't have a SELECT TOP like SQL Server does... you have to do something like "WHERE ROWNUM = 1" instead? Same concept... you're just ordering your results so the oldest nephew is the first row, and you're only taking the first row.
Perhaps an undesired side effect is you WOULD get the birthday along with the names in your results. If that's unacceptable, my apologies. It looked like your question has been sitting unanswered for a while and this solution should at least give you a start.
Lastly, since you don't have a NOT NULL constraint on your birthday column and are doing left outer joins, you might make the query safer by adding AND n.birthday IS NOT NULL
Use:
select parent_name, nephew_name
from parent p
left outer join
(
SELECT pnl.parent_id, n.nephew_name
FROM parent_nephew_link pnl
join nephew n
on n.nephew_id = pnl.nephew_id
AND n.BIRTHDAY = (
SELECT min( BIRTHDAY )
FROM nephew n1
JOIN parent_nephew_link pnl1
ON pnl1.NEPHEW_ID = n1.NEPHEW_ID
WHERE pnl1.PARENT_ID = pnl.PARENT_ID
)
) ppp
on p.parent_id = ppp.parent_id
where p.enabled = 1
Demo: http://sqlfiddle.com/#!4/98758/23
| PARENT_NAME | NEPHEW_NAME |
|-------------|-------------|
| Mickey | Louie |
| Donald | Ferdie |

Insert value based on min value greater than value in another row

It's difficult to explain the question well in the title.
I am inserting 6 values from (or based on values in) one row.
I also need to insert a value from a second row where:
The values in one column (ID) must be equal
The values in column (CODE) in the main source row must be IN (100,200), whereas the other row must have value of 300 or 400
The value in another column (OBJID) in the secondary row must be the lowest value above that in the primary row.
Source Table looks like:
OBJID | CODE | ENTRY_TIME | INFO | ID | USER
---------------------------------------------
1 | 100 | x timestamp| .... | 10 | X
2 | 100 | y timestamp| .... | 11 | Y
3 | 300 | z timestamp| .... | 10 | F
4 | 100 | h timestamp| .... | 10 | X
5 | 300 | g timestamp| .... | 10 | G
So to provide an example..
In my second table I want to insert OBJID, OBJID2, CODE, ENTRY_TIME, substr(INFO(...)), ID, USER
i.e. from my example a line inserted in the second table would look like:
OBJID | OBJID2 | CODE | ENTRY_TIME | INFO | ID | USER
-----------------------------------------------------------
1 | 3 | 100 | x timestamp| substring | 10 | X
4 | 5 | 100 | h timestamp| substring2| 10 | X
My insert for everything that just comes from one row works fine.
INSERT INTO TABLE2
(ID, OBJID, INFO, USER, ENTRY_TIME)
SELECT ID, OBJID, DECODE(CODE, 100, (SUBSTR(INFO, 12,
LENGTH(INFO)-27)),
600,'CREATE') INFO, USER, ENTRY_TIME
FROM TABLE1
WHERE CODE IN (100,200);
I'm aware that I'll need to use an alias on TABLE1, but I don't know how to get the rest to work, particularly in an efficient way. There are 2 million rows right now, but there will be closer to 20 million once I start using production data.
You could try this:
select primary.* ,
(select min(objid)
from table1 secondary
where primary.objid < secondary.objid
and secondary.code in (300,400)
and primary.id = secondary.id
) objid2
from table1 primary
where primary.code in (100,200);
Ok, I've come up with:
select OBJID,
min(case when code in (300,400) then objid end)
over (partition by id order by objid
range between 1 following and unbounded following
) objid2,
CODE, ENTRY_TIME, INFO, ID, USER1
from table1;
So, you need a insert select the above query with a where objid2 is not null and code in (100,200);

How to transpose/pivot data in hive?

I know there's no direct way to transpose data in hive. I followed this question: Is there a way to transpose data in Hive? , but as there is no final answer there, could not get all the way.
This is the table I have:
| ID | Code | Proc1 | Proc2 |
| 1 | A | p | e |
| 2 | B | q | f |
| 3 | B | p | f |
| 3 | B | q | h |
| 3 | B | r | j |
| 3 | C | t | k |
Here Proc1 can have any number of values. ID, Code & Proc1 together form a unique key for this table. I want to Pivot/ transpose this table so that each unique value in Proc1 becomes a new column, and corresponding value from Proc2 is the value in that column for the corresponding row. In essense, I'm trying to get something like:
| ID | Code | p | q | r | t |
| 1 | A | e | | | |
| 2 | B | | f | | |
| 3 | B | f | h | j | |
| 3 | C | | | | k |
In the new transformed table, ID and code are the only primary key. From the ticket I mentioned above, I could get this far using the to_map UDAF. (Disclaimer - this may not be a step in the right direction, but just mentioning here, if it is)
| ID | Code | Map_Aggregation |
| 1 | A | {p:e} |
| 2 | B | {q:f} |
| 3 | B | {p:f, q:h, r:j } |
| 3 | C | {t:k} |
But don't know how to get from this step to the pivot/transposed table I want.
Any help on how to proceed will be great!
Thanks.
Here is the approach i used to solved this problem using hive's internal UDF function, "map":
select
b.id,
b.code,
concat_ws('',b.p) as p,
concat_ws('',b.q) as q,
concat_ws('',b.r) as r,
concat_ws('',b.t) as t
from
(
select id, code,
collect_list(a.group_map['p']) as p,
collect_list(a.group_map['q']) as q,
collect_list(a.group_map['r']) as r,
collect_list(a.group_map['t']) as t
from (
select
id,
code,
map(proc1,proc2) as group_map
from
test_sample
) a
group by
a.id,
a.code
) b;
"concat_ws" and "map" are hive udf and "collect_list" is a hive udaf.
Here is the solution I ended up using:
add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';
select
id,
code,
group_map['p'] as p,
group_map['q'] as q,
group_map['r'] as r,
group_map['t'] as t
from ( select
id, code,
collect(proc1,proc2) as group_map
from test_sample
group by id, code
) gm;
The to_map UDF was used from the brickhouse repo: https://github.com/klout/brickhouse
Yet another solution.
Pivot using Hivemall to_map function.
SELECT
uid,
kv['c1'] AS c1,
kv['c2'] AS c2,
kv['c3'] AS c3
FROM (
SELECT uid, to_map(key, value) kv
FROM vtable
GROUP BY uid
) t
uid c1 c2 c3
101 11 12 13
102 21 22 23
Unpivot
SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
'c1', c1,
'c2', c2,
'c3', c3
)) t2 as key, value
uid key value
101 c1 11
101 c2 12
101 c3 13
102 c1 21
102 c2 22
102 c3 23
I have not written this code, but I think you can use some of the UDFs provided by klouts brickhouse: https://github.com/klout/brickhouse
Specifically, you could do something like use their collect as mentioned here: http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/
and then explode the arrays (they will be of differing length) using the methods detailed in this post http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_ra
I have created one dummy table called hive using below query-
create table hive (id Int,Code String, Proc1 String, Proc2 String);
Loaded all the data in the table-
insert into hive values('1','A','p','e');
insert into hive values('2','B','q','f');
insert into hive values('3','B','p','f');
insert into hive values('3','B','q','h');
insert into hive values('3','B','r','j');
insert into hive values('3','C','t','k');
Now use the below query to achieve the output.
select id,code,
case when collect_list(p)[0] is null then '' else collect_list(p)[0] end as p,
case when collect_list(q)[0] is null then '' else collect_list(q)[0] end as q,
case when collect_list(r)[0] is null then '' else collect_list(r)[0] end as r,
case when collect_list(t)[0] is null then '' else collect_list(t)[0] end as t
from(
select id, code,
case when proc1 ='p' then proc2 end as p,
case when proc1 ='q' then proc2 end as q,
case when proc1 ='r' then proc2 end as r,
case when proc1 ='t' then proc2 end as t
from hive
) dummy group by id,code;
In case of numeric value you can use below hive query:
Sample data
ID cust_freq Var1 Var2 frequency
220444 1 16443 87128 72.10140547
312554 6 984 7339 0.342452643
220444 3 6201 87128 9.258396518
220444 6 47779 87128 2.831972441
312554 1 6055 7339 82.15209213
312554 3 12868 7339 4.478333954
220444 2 6705 87128 15.80822558
312554 2 37432 7339 13.02712127
select id, sum(a.group_map[1]) as One, sum(a.group_map[2]) as Two, sum(a.group_map[3]) as Three, sum(a.group_map[6]) as Six from
( select id,
map(cust_freq,frequency) as group_map
from table
) a group by a.id having id in
( '220444',
'312554');
ID one two three six
220444 72.10140547 15.80822558 9.258396518 2.831972441
312554 82.15209213 13.02712127 4.478333954 0.342452643
In above example I have't used any custom udf. It is only using in-built hive functions.
Note :For string value in key write the vale as sum(a.group_map['1']) as One.
For Unpivot, we can simply use below logic.
SELECT Cost.Code, Cost.Product, Cost.Size
, Cost.State_code, Cost.Promo_date, Cost.Cost, Sales.Price
FROM
(Select Code, Product, Size, State_code, Promo_date, Price as Cost
FROM Product
Where Description = 'Cost') Cost
JOIN
(Select Code, Product, Size, State_code, Promo_date, Price as Price
FROM Product
Where Description = 'Sales') Sales
on (Cost.Code = Sales.Code
and Cost.Promo_date = Sales.Promo_date);
Below is also a way for Pivot
SELECT TM1_Code, Product, Size, State_code, Description
, Promo_date
, Price
FROM (
SELECT TM1_Code, Product, Size, State_code, Description
, MAP('FY2018Jan', FY2018Jan, 'FY2018Feb', FY2018Feb, 'FY2018Mar', FY2018Mar, 'FY2018Apr', FY2018Apr
,'FY2018May', FY2018May, 'FY2018Jun', FY2018Jun, 'FY2018Jul', FY2018Jul, 'FY2018Aug', FY2018Aug
,'FY2018Sep', FY2018Sep, 'FY2018Oct', FY2018Oct, 'FY2018Nov', FY2018Nov, 'FY2018Dec', FY2018Dec) AS tmp_column
FROM CS_ME_Spirits_30012018) TmpTbl
LATERAL VIEW EXPLODE(tmp_column) exptbl AS Promo_date, Price;
You can use case statements and some help from collect_set to achieve this. You can check this out. You can check detail answer at - http://www.analyticshut.com/big-data/hive/pivot-rows-to-columns-in-hive/
Here is the query for reference,
SELECT resource_id,
CASE WHEN COLLECT_SET(quarter_1)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_1)[0] END AS quarter_1_spends,
CASE WHEN COLLECT_SET(quarter_2)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_2)[0] END AS quarter_2_spends,
CASE WHEN COLLECT_SET(quarter_3)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_3)[0] END AS quarter_3_spends,
CASE WHEN COLLECT_SET(quarter_4)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_4)[0] END AS quarter_4_spends
FROM (
SELECT resource_id,
CASE WHEN quarter='Q1' THEN amount END AS quarter_1,
CASE WHEN quarter='Q2' THEN amount END AS quarter_2,
CASE WHEN quarter='Q3' THEN amount END AS quarter_3,
CASE WHEN quarter='Q4' THEN amount END AS quarter_4
FROM billing_info)tbl1
GROUP BY resource_id;

Oracle Insert Into Child & Parent Tables

I have a table - let's call it MASTER - with a lot of rows in it. Now, I had to created another table called 'MASTER_DETAILS', which will be populated with data from another system. Suh data will be accessed via DB Link.
MASTER has a FK to MASTER_DETAIL (1 -> 1 Relationship).
I created a SQL to populate the MASTER_DETAILS table:
INSERT INTO MASTER_DETAILS(ID, DETAIL1, DETAILS2, BLAH)
WITH QUERY_FROM_EXTERNAL_SYSTEM AS (
SELECT IDENTIFIER,
FIELD1,
FIELD2,
FIELD3
FROM TABLE#DB_LINK
--- DOZENS OF INNERS AND OUTER JOINS HERE
) SELECT MASTER_DETAILS_SEQ.NEXTVAL,
QES.FIELD1,
QES.FIELD2,
QES.FIELD3
FROM MASTER M
INNER JOIN QUERY_FROM_EXTERNAL_SYSTEM QES ON QES.IDENTIFIER = M.ID
--- DOZENS OF JOINS HERE
Approach above works fine to insert all the values into the MASTER_DETAILS.
Problem is:
In the approach above, I cannot insert the value of MASTER_DETAILS_SEQ.CURRVAL into the MASTER table. So I create all the entries into the DETAILS table but I don't link them to the MASTER table.
Does anyone see a way out to this problem using only a INSERT statement? I wish I could avoid creating a complex script with LOOPS and everything to handle this problem.
Ideally I want to do something like this:
INSERT INTO MASTER_DETAILS(ID, DETAIL1, DETAILS2, BLAH) AND MASTER(MASTER_DETAILS_ID)
WITH QUERY_FROM_EXTERNAL_SYSTEM AS (
SELECT IDENTIFIER,
FIELD1,
FIELD2,
FIELD3
FROM TABLE#DB_LINK
--- DOZENS OF INNERS AND OUTER JOINS HERE
) SELECT MASTER_DETAILS_SEQ.NEXTVAL,
QES.FIELD1,
QES.FIELD2,
QES.FIELD3
FROM MASTER M
INNER JOIN QUERY_FROM_EXTERNAL_SYSTEM QES ON QES.IDENTIFIER = M.ID
--- DOZENS OF JOINS HERE,
SELECT MASTER_DETAILS_SEQ.CURRVAL FROM DUAL;
I know such approach does not work on Oracle - but I am showing this SQL to demonstrate what I want to do.
Thanks.
If there is really a 1-to-1 relationship between the two tables, then they could arguably be a single table. Presumably you have a reason to want to keep them separate. Perhaps the master is a vendor-supplied table you shouldn't touch and the detail is extra data; but then you're changing the master anyway by adding the foreign key field. Or perhaps the detail will be reloaded periodically and you don't want to update the master table; but then you have to update the foreign key field anyway. I'll assume you're required to have a separate table, for whatever reason.
If you put a foreign key on the master table that refers to the primary key on the detail table, you're are restricted to it only ever being a 1-to-1 relationship. If that really is the case then conceptually it shouldn't matter which way the relationship is built - which table has the primary key and which has the foreign key. And if it isn't then your model will break when your detail table (or the remote query) comes back with two rows related to the same master - even if you're sure that won't happen today, will it always be true? The pluralisation of the name master_details suggests that might be expected. Maybe. Having the relationship the other way would prevent that being an issue.
I'm guessing you decided to put the relationship that way round so you can join the tables using the detail's key:
select m.column, md.column
from master m
join master_details md on md.id = m.detail_id
... because you expect that to be the quickest way, since md.id will be indexed (implicitly, as a primary key). But you could achieve the same effect by adding the master ID to the details table as a foreign key:
select m.column, md.column
from master m
join master_details md on md.master_id = m.id
It is good practice to index foreign keys anyway, and as long as you have an index on master_details.master_id then the performance should be the same (more or less, other factors may come in to play but I'd expect this to generally be the case). This would also allow multiple detail records in the future, without needing to modify the schema.
So as a simple example, let's say you have a master table created and populated with some dummy data:
create table master(id number, data varchar2(10),
constraint pk_master primary key (id));
create sequence seq_master start with 42;
insert into master (id, data)
values (seq_master.nextval, 'Foo ' || seq_master.nextval);
insert into master (id, data)
values (seq_master.nextval, 'Foo ' || seq_master.nextval);
insert into master (id, data)
values (seq_master.nextval, 'Foo ' || seq_master.nextval);
select * from master;
ID DATA
---------- ----------
42 Foo 42
43 Foo 43
44 Foo 44
The changes you've proposed might look like this:
create table detail (id number, other_data varchar2(10),
constraint pk_detail primary key(id));
create sequence seq_detail;
alter table master add (detail_id number,
constraint fk_master_detail foreign key (detail_id)
references detail (id));
insert into detail (id, other_data)
select seq_detail.nextval, 'Foo ' || seq_detail.nextval
from master m
-- joins etc
;
... plus the update of the master's foreign key, which is what you're struggling with, so let's do that manually for now:
update master set detail_id = 1 where id = 42;
update master set detail_id = 2 where id = 43;
update master set detail_id = 3 where id = 44;
And then you'd query as:
select m.data, d.other_data
from master m
join detail d on d.id = m.detail_id
where m.id = 42;
DATA OTHER_DATA
---------- ----------
Foo 42 Bar 1
Plan hash value: 2192253142
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 22 | 2 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 1 | 22 | 2 (0)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| MASTER | 1 | 13 | 1 (0)| 00:00:01 |
|* 3 | INDEX UNIQUE SCAN | PK_MASTER | 1 | | 0 (0)| 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID| DETAIL | 3 | 27 | 1 (0)| 00:00:01 |
|* 5 | INDEX UNIQUE SCAN | PK_DETAIL | 1 | | 0 (0)| 00:00:01 |
------------------------------------------------------------------------------------------
If you swap the relationship around the changes become:
create table detail (id number, master_id, other_data varchar2(10),
constraint pk_detail primary key(id),
constraint fk_detail_master foreign key (master_id)
references master (id));
create index ix_detail_master_id on detail (master_id);
create sequence seq_detail;
insert into detail (id, master_id, other_data)
select seq_detail.nextval, m.id, 'Bar ' || seq_detail.nextval
from master m
-- joins etc.
;
No update of the master table is needed, and the query becomes:
select m.data, d.other_data
from master m
join detail d on d.master_id = m.id
where m.id = 42;
DATA OTHER_DATA
---------- ----------
Foo 42 Bar 1
Plan hash value: 4273661231
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 19 | 2 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 1 | 19 | 2 (0)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| MASTER | 1 | 10 | 1 (0)| 00:00:01 |
|* 3 | INDEX UNIQUE SCAN | PK_MASTER | 1 | | 0 (0)| 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID| DETAIL | 1 | 9 | 1 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | IX_DETAIL_MASTER_ID | 1 | | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------
The only real difference in the plan is that you now have a range scan instead of a unique scan; if you're really sure it's 1-to-1 you could make the index unique but there's not much benefit.
SQL Fiddle of this approach.

Resources