Sql update takes 4 days for 10 million records - oracle

I want to update a database table which has over 10 million records from a temporary table.
But my update query executes more than 4 days.
1.) I have already created an index for update search criteria for
tax_ledger_item_tab. index creatred on party_type, identity, company.
My search criteria are on party_type, identity, company, delivery_type_id
as the following given query, those columns are not keys in the table.
I believe that I cant add delivery_type_id for the index as it will
update by the query, if I add that into index performance will be worst.
2.) temporary table identity_invoice_info_cfv will also returns 70,000 records
also.
So far I believe my update execution plan cost will be like around 70000*10 million records.
How can I get performance enhancement with following update query? I only want to update delivery_type_id, fetched columns only.
DECLARE
CURSOR get_records IS
SELECT i.COMPANY, i.IDENTITY, i.CF$_DELIVERY_TYPE
FROM identity_invoice_info_cfv i
WHERE i.PARTY_TYPE_DB = 'CUSTOMER';
BEGIN
FOR rec_ IN get_records LOOP
dbms_output.put_line (sysdate );
UPDATE tax_ledger_item_tab t
SET t.delivery_type_id = rec_.CF$_DELIVERY_TYPE, t.fetched = 'TRUE'
WHERE t.party_type = 'CUSTOMER'
AND t.identity = rec_.IDENTITY
AND t.company = rec_.COMPANY
AND t.delivery_type_id IS NULL;
COMMIT;
END LOOP;
END;

Use a MERGE statement:
Oracle Setup:
CREATE TABLE identity_invoice_info_cfv ( COMPANY, IDENTITY, CF$_DELIVERY_TYPE, PARTY_TYPE_DB ) AS
SELECT 'A', 123, 456, 'CUSTOMER' FROM DUAL;
CREATE TABLE tax_ledger_item_tab ( identity, company, party_type, delivery_type_id, fetched ) AS
SELECT 123, 'A', 'CUSTOMER', CAST( NULL AS NUMBER ), 'FALSE' FROM DUAL;
Merge:
MERGE INTO tax_ledger_item_tab t
USING identity_invoice_info_cfv i
ON (
t.identity = i.identity
AND t.company = i.COMPANY
AND t.party_type = 'CUSTOMER'
AND i.PARTY_TYPE_DB = 'CUSTOMER'
)
WHEN MATCHED THEN
UPDATE
SET delivery_type_id = i.CF$_DELIVERY_TYPE,
fetched = 'TRUE'
WHERE t.delivery_type_id IS NULL;
Query:
SELECT * FROM tax_ledger_item_tab;
Output:
IDENTITY | COMPANY | PARTY_TYPE | DELIVERY_TYPE_ID | FETCHED
-------: | :------ | :--------- | ---------------: | :------
123 | A | CUSTOMER | 456 | TRUE
db<>fiddle here

I hope you can achieve this using Merge Statement as well. Below is the code for the same. Please test from your side with some sample data and then proceed.
Merge into tax_ledger_item_tab t
using identity_invoice_info_cfv i
on (t.party_type ='CUSTOMER' and t.identity=i.IDENTITY
and t.company = i.COMPANY and i.PARTY_TYPE_DB = 'CUSTOMER')
when matched then
update set
t.delivery_type_id=i.CF$_DELIVERY_TYPE,
t.fetched = 'TRUE'
where t.delivery_type_id IS NULL;
commit;

Related

Function retuning Sys Refcursor

How to call a function returning sys refcursor in select statement. I have created a function like this and I want to call in the select statement returning both values coming from function. So I used in the query like this, but it is returning cursor in place of column values.
Function HCLT_GET_TASK_DATES(i_ownerid IN NUMBER, i_itemid IN NUMBER)
RETURN SYS_REFCURSOR IS
o_DATACUR SYS_REFCURSOR;
begin
open o_DATACUR for
select nvl(TO_CHAR(min(pref_start), 'DD-MON-YYYY'), '') AS MIN_DATE,
nvl(TO_CHAR(max(pref_finish), 'DD-MON-YYYY'), '') AS MAX_DATE
from autoplanallocation
WHERE project_id = i_ownerid
AND task_id = i_itemid;
RETURN o_DATACUR;
END;
/
SELECT HCLT_GET_TASK_DATES(267157, 15334208),
tv.taskid,
tv.wbs_code AS wbscode,
tv.taskcode,
tv.act_name,
ltrim(regexp_replace(tv.stageactorlovs, '[^#]*#(\d+?),', ',\1'), ',') as stageactorlovs,
tv.createdat,
tv.pushedtoTaskModule,
tv.OVERALLSTATUS AS overallstatus1,
tv.ACTIVITY_CODE_ID,
tv.wbs_code,
TO_CHAR(tv.pref_st, 'DD-MON-YYYY') AS pref_st,
TO_CHAR(tv.pref_fn, 'DD-MON-YYYY') AS pref_fn,
tv.ACTL_EFFORT,
tv.rollup_effort,
tv.overAllStatus,
tv.FIELD5,
tv.FIELD4,
tv.activity_code_id
FROM task_view tv, autoplanallocation al
WHERE al.project_id = tv.ownerid(+)
and al.task_id = tv.taskid(+)
and tv.ownertype = 'Prj'
AND tv.ownerid = 267157
AND (tv.overAllStatus = 'All' OR 'All' = 'All')
AND (TaskId IN
((SELECT xyz
FROM (SELECT ToItemID xyz
FROM ItemTraceability it
WHERE it.FromOwnerType = 'Prj'
AND it.FromOwnerID = 267157
AND it.FromItemType = it.FromItemType
AND it.FromChildItemType = 'USTRY'
AND it.FromItemID = 15334208
AND it.ToOwnerType = 'Prj'
AND it.ToOwnerID = 267157
AND it.ToItemType = it.ToItemType
AND it.ToChildItemType = 'Tsk'
UNION ALL
SELECT FromItemID
FROM ItemTraceability it
WHERE it.ToOwnerType = 'Prj'
AND it.ToOwnerID = 267157
AND it.ToItemType = it.ToItemType
AND it.ToChildItemType = 'USTRY'
AND it.ToItemID = 15334208
AND it.FromOwnerType = 'Prj'
AND it.FromOwnerID = 267157
AND it.FromItemType = it.FromItemType
AND it.FromChildItemType = 'Tsk'))))
ORDER BY UPPER(wbs_code) ASC;
I do not think there is a native way of parsing nested cursors using SQL or PL/SQL code.
In Java with an Oracle JDBC database driver, you can:
Use oracle.jdbc.driver.OraclePreparedStatement.executeQuery to get a java.sql.ResultSet
Which can be cast to an oracle.jdbc.driver.OracleResultSet
Then you can iterate through the rows of the result set and for each row you can use oracle.jdbc.driver.OracleResultSet.getCursor() to get the nested cursor.
You can then iterate through that nested cursor in exactly the same way you iterated through the outer cursor to extract rows from the nested cursor.
You should then close the nested cursor (although it will be automatically closed when the containing parent cursor is closed).
Finally, close the parent cursor.
If you want an SQL solution then do not return a cursor and return a nested table collection data type instead.
Or, for a single row with multiple columns, return an object type:
CREATE TYPE date_range_obj AS OBJECT(
start_date DATE,
end_date DATE
)
/
CREATE FUNCTION HCLT_GET_TASK_DATES(
i_ownerid IN autoplanallocation.project_id%TYPE,
i_itemid IN autoplanallocation.task_id%TYPE
)
RETURN date_range_obj
IS
v_range date_range_obj;
begin
SELECT date_range_obj(MIN(pref_start), MAX(pref_finish))
INTO v_range
FROM autoplanallocation
WHERE project_id = i_ownerid
AND task_id = i_itemid;
RETURN v_range;
END;
/
Then, for example:
SELECT HCLT_GET_TASK_DATES(1,2).start_date,
HCLT_GET_TASK_DATES(1,2).end_date
FROM DUAL;
db<>fiddle here
If you are able to change this design, then it would be better to do in plain join and aggregation (or possibly with left join lateral in case of low cardinality input).
But there's a way to achieve the desired result with plain SQL in 11g and above using the ability of dbms_xmlgen package to process arbitrary cursor. Below is the code:
create table t_lkp (id,dt)
as
select
trunc(level/4 + 1)
, date '2022-01-01' + level
from dual
connect by level < 11
create or replace function f_lkp (
p_id in int
)
return sys_refcursor
as
o_res sys_refcursor;
begin
open o_res for
select
min(dt) as dtfrom
, max(dt) as dtto
from t_lkp
where id = p_id;
return o_res;
end;
/
with a as (
select
level as i,
dbms_xmlgen.getxmltype(
/*ctx doesn't accept sys_refcursor, so we had to create a context*/
ctx => DBMS_XMLGEN.NEWCONTEXT(f_lkp(level))
) as val
from dual
connect by level < 6
)
select
i
, xmlquery(
'/ROWSET/ROW/DTFROM/text()'
passing a.val returning content null on empty
) as dtfrom
, xmlquery(
'/ROWSET/ROW/DTTO/text()'
passing a.val returning content null on empty
) as dtto
from a
I | DTFROM | DTTO
-: | :------------------ | :------------------
1 | 2022-01-02 00:00:00 | 2022-01-04 00:00:00
2 | 2022-01-05 00:00:00 | 2022-01-08 00:00:00
3 | 2022-01-09 00:00:00 | 2022-01-11 00:00:00
4 | null | null
5 | null | null
db<>fiddle here
Please note, that it will open too many cursors in case of large input dataset and parallel processing, which will dramatically consume resourses. So it would be much better to use plain join.

Insert into table two and update table two for BigQuery in one query

I am using StandardSQL in BigQuery. I am writing a scheduled query which inserts records into table (2). Now, given that it's sceduled, I am trying to figure out how to update records in table (2) from the sceduled query, which is always inserting records into table (2).
In particular, when there is a record in table (2) but not generated by my query then I want to update table (2) and a boolean column to No.
Below is my query, where in the query would I add the update logic for table (2)?
INSERT INTO record (airport_name, icao_address, arrival, flight_number, origin_airport_icao, destination_airport_icao)
WITH
planes_stopped_in_airport AS (
SELECT
p.IATA_airport_code,
p.airport_name,
p.airport_ISO_country_code,
p.ICAO_airport_code,
timestamp,
a.icao_address,
a.latitude,
a.longitude,
a.altitude_baro,
a.speed,
heading,
callsign,
source,
a.collection_type,
vertical_rate,
squawk_code,
icao_actype,
flight_number,
origin_airport_icao,
destination_airport_icao,
scheduled_departure_time_utc,
scheduled_arrival_time_utc,
estimated_arrival_time_utc,
tail_number,
ingestion_time
FROM
`updates` a
JOIN
Polygons p
ON
1 = 1
WHERE
a.timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 20 MINUTE) and a.timestamp <= CURRENT_TIMESTAMP()
AND ( latitude IS NULL
AND longitude IS NULL
AND callsign IS NULL
AND speed IS NULL
AND heading IS NULL
AND altitude_baro IS NULL) IS FALSE
AND ST_DWithin( ST_GeogFromText( polygon ),
ST_GeogPoint(a.longitude,
a.latitude),
10)
AND a.collection_type = '1' -- and speed < 50
AND (origin_airport_icao IS NULL
AND destination_airport_icao IS NULL) IS FALSE )
SELECT
p.airport_name,
icao_address,
MIN(timestamp) AS Arrival,
flight_number,
origin_airport_icao,
destination_airport_icao
FROM
planes_stopped_in_airport p
WHERE
flight_number NOT IN (SELECT Distinct flight_number
FROM `table(2)`
)
GROUP BY
icao_address,
p.airport_name,
flight_number,
origin_airport_icao,
destination_airport_icao
HAVING
flight_number IS NOT NULL
ORDER BY
airport_name,
arrival
You can probably do it with MERGE statement, see details in https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement.
If I understood your requirements correctly, you need something like
MERGE dataset.Destination T
USING (SELECT * ...) S
ON T.key = S.key
WHEN MATCHED THEN
UPDATE SET T.foo = S.foo, T.bool_flag = FALSE
WHEN NOT MATCHED THEN
INSERT ...

Knowing when a table was updated in Oracle without a full scan

I'm building an Oracle connector that reads data periodically from a couple of very big table, some are divided into partitions.
I'm trying to figure out which table were updated from the last time they were read to avoid unnecessary queries. I have the last ora_rowscn or updated_at and the only methods I find requires a full table scan to see if there are new or updated rows in the table.
Is there a way to tell if a table a row was inserted or updated without the full scan?
A couple of ideas:
1. Create a table to store last DML by table_name and then create a simple trigger on the table to update meta table.
2. Create a Materialized View Log on the table and use the data from the log to determine the changes.
If there are archive logs for the search period. You can use the utility LogMiner. for example:
insert into "ASOUP"."US"("KEY_COLUMN","COD_ROAD","COD_COMPUTER","COD_STATION_OPER","NUMB_TRAIN","STAT_CREAT","NUMB_SOSTAVA","STAT_APPOINT","COD_OPER","DIRECT_1","DIRECT_2","DATE_OPER","PARK","PATH","LOCOMOT","LATE","CAUSE_LATE","COD_CONNECT","CATEGORY","TIME") values ('42018740','988','0','9200','2624','8642','75','9802','1','8891','0',TO_DATE('18-Dec-2018', 'DD-Mon-RRRR'),'0','0','0','0','0','0',NULL,TO_DATE('18-Dec-2018', 'DD-Mon-RRRR'));
select name, first_time, next_time
from v$archived_log
where first_time >sysdate -3/24
/oracle/app/oracle/product/11.2/redolog/edcu/1_48060_769799469.dbf 18-дек-2018 09:03:06 18-дек-2018 10:22:00
/oracle/app/oracle/product/11.2/redolog/edcu/1_48061_769799469.dbf 18-дек-2018 10:22:00 18-дек-2018 10:30:02
/oracle/app/oracle/product/11.2/redolog/edcu/1_48062_769799469.dbf 18-дек-2018 10:30:02 18-дек-2018 10:56:07
Run the logminer utility.
EXECUTE DBMS_LOGMNR.add_logfile(LOGFILENAME => '/oracle/app/oracle/product/11.2/redolog/edcu/1_48060_769799469.dbf', OPTIONS => DBMS_LOGMNR.NEW);
EXECUTE DBMS_LOGMNR.add_logfile(LOGFILENAME => '/oracle/app/oracle/product/11.2/redolog/edcu/1_48061_769799469.dbf', OPTIONS => DBMS_LOGMNR.addfile);
EXECUTE DBMS_LOGMNR.add_logfile(LOGFILENAME => '/oracle/app/oracle/product/11.2/redolog/edcu/1_48062_769799469.dbf', OPTIONS => DBMS_LOGMNR.addfile);
EXECUTE DBMS_LOGMNR.START_LOGMNR(OPTIONS => DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG);
SELECT scn,ROW_ID,to_char(timestamp,'DD-MM-YYYY HH24:MI:SS'),
table_name,seg_name,operation, sql_redo,sql_undo
FROM v$logmnr_contents
where seg_owner='ASOUP' and table_name='US'
SCN ROW_ID TIMESTAMP TABLE_NAME SEG_NAME OPERATION SQL_REDO SQL_UNDO
1398405575908 AAA3q2AAoAACFweABi 18-12-2018 09:03:15 US US,ADCU201902 INSERT insert into "ASOUP"."US"("KEY_COLUMN","COD_ROAD","COD_COMPUTER","COD_STATION_OPER","NUMB_TRAIN","STAT_CREAT","NUMB_SOSTAVA","STAT_APPOINT","COD_OPER","DIRECT_1","DIRECT_2","DATE_OPER","PARK","PATH","LOCOMOT","LATE","CAUSE_LATE","COD_CONNECT","CATEGORY","TIME") values ('42018727','988','0','8800','4404','1','895','8800','1','8838','0',TO_DATE('18-Dec-2018', 'DD-Mon-RRRR'),'4','2','0','0','0','0',NULL,TO_DATE('18-Dec-2018', 'DD-Mon-RRRR')); delete from "ASOUP"."US" where "KEY_COLUMN" = '42018727' and "COD_ROAD" = '988' and "COD_COMPUTER" = '0' and "COD_STATION_OPER" = '8800' and "NUMB_TRAIN" = '4404' and "STAT_CREAT" = '1' and "NUMB_SOSTAVA" = '895' and "STAT_APPOINT" = '8800' and "COD_OPER" = '1' and "DIRECT_1" = '8838' and "DIRECT_2" = '0' and "DATE_OPER" = TO_DATE('18-Dec-2018', 'DD-Mon-RRRR') and "PARK" = '4' and "PATH" = '2' and "LOCOMOT" = '0' and "LATE" = '0' and "CAUSE_LATE" = '0' and "COD_CONNECT" = '0' and "CATEGORY" IS NULL and "TIME" = TO_DATE('18-Dec-2018', 'DD-Mon-RRRR') and ROWID = 'AAA3q2AAoAACFweABi';
You can see inserted row without full scan:
select * from asoup.us where ROWID = 'AAA3q2AAoAACFweABi';

Slowly changing dimensions- SCD1 and SCD2 implementation in Hive [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I am looking for SCD1 and SCD2 implementation in Hive (1.2.1). I am aware of the workaround to load SCD1 and SCD2 tables prior to Hive (0.14). Here is the link for loading SCD1 and SCD2 with the workaround approach http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
Now that Hive supports ACID operations just want to know if there is a better or direct way of loading it.
As HDFS is immutable storage it could be argued that versioning data and keeping history (SCD2) should be the default behaviour for loading dimensions. You can create a View in your Hadoop SQL query engine (Hive, Impala, Drill etc.) that retrieves the current state/latest value using windowing functions. You can find out more about dimensional models on Hadoop in my blog post, e.g. how to handle a large dimension and fact table.
Well, I work it around using two temp tables:
drop table if exists administrator_tmp1;
drop table if exists administrator_tmp2;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--review_administrator
CREATE TABLE if not exists review_administrator(
admin_id bigint ,
admin_name string,
create_time string,
email string ,
password string,
status_description string,
token string ,
expire_time string ,
granter_user_id bigint ,
admin_time string ,
effect_start_date string ,
effect_end_date string
)
partitioned by (current_row_indicator string comment 'current, expired')
stored as parquet;
--tmp1 is used for saving origin data
CREATE TABLE if not exists administrator_tmp1(
admin_id bigint ,
admin_name string,
create_time string,
email string ,
password string ,
status_description string ,
token string ,
expire_time string ,
granter_user_id bigint ,
admin_time string ,
effect_start_date string ,
effect_end_date string
)
partitioned by (current_row_indicator string comment 'current, expired:')
stored as parquet;
--tmp2 saving the scd data
CREATE TABLE if not exists administrator_tmp2(
admin_id bigint ,
admin_name string,
create_time string,
email string ,
password string ,
status_description string ,
token string ,
expire_time string ,
granter_user_id bigint ,
admin_time string ,
effect_start_date string ,
effect_end_date string
)
partitioned by (current_row_indicator string comment 'current, expired')
stored as parquet;
--insert origin data into tmp1
INSERT OVERWRITE TABLE administrator_tmp1 PARTITION(current_row_indicator)
SELECT
user_id as admin_id,
name as admin_name,
time as create_time,
email as email,
password as password,
status as status_description,
token as token,
expire_time as expire_time,
admin_id as granter_user_id,
admin_time as admin_time,
'{{ ds }}' as effect_start_date,
'9999-12-31' as effect_end_date,
'current' as current_row_indicator
FROM
ks_db_origin.gifshow_administrator_origin
;
--insert scd data into tmp2
--for the data unchanged
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
t2.admin_id,
t2.admin_name,
t2.create_time,
t2.email,
t2.password,
t2.status_description,
t2.token,
t2.expire_time,
t2.granter_user_id,
t2.admin_time,
t2.effect_start_date,
t2.effect_end_date as effect_end_date,
t2.current_row_indicator
FROM
administrator_tmp1 t1
INNER JOIN
(
SELECT * FROM review_administrator
WHERE current_row_indicator = 'current'
) t2
ON
t1.admin_id = t2.admin_id
AND t1.admin_name = t2.admin_name
AND t1.create_time = t2.create_time
AND t1.email = t2.email
AND t1.password = t2.password
AND t1.status_description = t2.status_description
AND t1.token = t2.token
AND t1.expire_time = t2.expire_time
AND t1.granter_user_id = t2.granter_user_id
AND t1.admin_time = t2.admin_time
;
--for the data changed , update the effect_end_date
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
t2.admin_id,
t2.admin_name,
t2.create_time,
t2.email,
t2.password,
t2.status_description,
t2.token,
t2.expire_time,
t2.granter_user_id,
t2.admin_time,
t2.effect_start_date as effect_start_date,
'{{ yesterday_ds }}' as effect_end_date,
'expired' as current_row_indicator
FROM
administrator_tmp1 t1
INNER JOIN
(
SELECT * FROM review_administrator
WHERE current_row_indicator = 'current'
) t2
ON
t1.admin_id = t2.admin_id
WHERE NOT
(
t1.admin_name = t2.admin_name
AND t1.create_time = t2.create_time
AND t1.email = t2.email
AND t1.password = t2.password
AND t1.status_description = t2.status_description
AND t1.token = t2.token
AND t1.expire_time = t2.expire_time
AND t1.granter_user_id = t2.granter_user_id
AND t1.admin_time = t2.admin_time
)
;
--for the changed data and the new data
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
t1.admin_id,
t1.admin_name,
t1.create_time,
t1.email,
t1.password,
t1.status_description,
t1.token,
t1.expire_time,
t1.granter_user_id,
t1.admin_time,
t1.effect_start_date,
t1.effect_end_date,
t1.current_row_indicator
FROM
administrator_tmp1 t1
LEFT OUTER JOIN
(
SELECT * FROM review_administrator
WHERE current_row_indicator = 'current'
) t2
ON
t1.admin_id = t2.admin_id
AND t1.admin_name = t2.admin_name
AND t1.create_time = t2.create_time
AND t1.email = t2.email
AND t1.password = t2.password
AND t1.status_description = t2.status_description
AND t1.token = t2.token
AND t1.expire_time = t2.expire_time
AND t1.granter_user_id = t2.granter_user_id
AND t1.admin_time = t2.admin_time
WHERE t2.admin_id IS NULL
;
--for the data already marked by 'expired'
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
t1.admin_id,
t1.admin_name,
t1.create_time,
t1.email,
t1.password,
t1.status_description,
t1.token,
t1.expire_time,
t1.granter_user_id,
t1.admin_time,
t1.effect_start_date,
t1.effect_end_date,
t1.current_row_indicator
FROM
review_administrator t1
WHERE t1.current_row_indicator = 'expired'
;
--populate the dim table
INSERT OVERWRITE TABLE review_administrator PARTITION(current_row_indicator)
SELECT
t1.admin_id,
t1.admin_name,
t1.create_time,
t1.email,
t1.password,
t1.status_description,
t1.token,
t1.expire_time,
t1.granter_user_id,
t1.admin_time,
t1.effect_start_date,
t1.effect_end_date,
t1.current_row_indicator
FROM
administrator_tmp2 t1
;
--drop the two temp table
drop table administrator_tmp1;
drop table administrator_tmp2;
-- --example data
-- --2017-01-01
-- insert into table review_administrator PARTITION(current_row_indicator)
-- SELECT '1','a','2016-12-31','a#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-01','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- --2017-01-02
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '1','a','2016-12-31','a01#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-02','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '2','b','2016-12-31','a#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-02','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- --2017-01-03
-- --id 1 is changed
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '1','a','2016-12-31','a03#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-03','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- --id 2 is not changed at all
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '2','b','2016-12-31','a#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-03','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- --id 3 is a new record
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '3','c','2016-12-31','c#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-03','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- --now dim table will show you the right SCD.
Here's the detailed implementation of slowly changing dimension type 2 in Hive using exclusive join approach.
Assuming that the source is sending a complete data file i.e. old, updated and new records.
Steps-
Load the recent file data to STG table
Select all the expired records from HIST table
select * from HIST_TAB where exp_dt != '2099-12-31'
Select all the records which are not changed from STG and HIST using inner join and filter on HIST.column = STG.column as below
select hist.* from HIST_TAB hist
inner join STG_TAB stg
on hist.key = stg.key
where hist.column = stg.column
Select all the new and updated records which are changed from STG_TAB using exclusive left join with HIST_TAB and set expiry and effective date as below
select stg.*, eff_dt (yyyy-MM-dd), exp_dt (2099-12-31)
from STG_TAB stg
left join
(select * from HIST_TAB where exp_dt = '2099-12-31') hist
on hist.key = stg.key
where hist.key is null
or hist.column != stg.column
Select all updated old records from the HIST table using exclusive left join with STG table and set their expiry date as shown below:
select hist.*, exp_dt(yyyy-MM-dd) from
(select * from HIST_TAB where exp_dt = '2099-12-31') hist
left join STG_TAB stg
on hist.key= stg.key
where hist.key is null
or hist.column!= stg.column
unionall queries from 2-5 and insert overwrite result to HIST table
More detailed implementation of SCD type 2 can be found here-
https://github.com/sahilbhange/slowly-changing-dimension
drop table if exists harsha.emp;
drop table if exists harsha.emp_tmp1;
drop table if exists harsha.emp_tmp2;
drop table if exists harsha.init_load;
show databases;
use harsha;
show tables;
create table harsha.emp (eid int,ename string,sal int,loc string,dept int,start_date timestamp,end_date timestamp,current_status string)
comment "emp scd implementation"
row format delimited
fields terminated by ','
lines terminated by '\n'
;
create table harsha.emp_tmp1 (eid int,ename string,sal int,loc string,dept int,start_date timestamp,end_date timestamp,current_status string)
comment "emp scd implementation"
row format delimited
fields terminated by ','
lines terminated by '\n'
;
create table harsha.emp_tmp2 (eid int,ename string,sal int,loc string,dept int,start_date timestamp,end_date timestamp,current_status string)
comment "emp scd implementation"
row format delimited
fields terminated by ','
lines terminated by '\n'
;
create table harsha.init_load (eid int,ename string,sal int,loc string,dept int)
row format delimited
fields terminated by ','
lines terminated by '\n'
;
show tables;
insert into table harsha.emp select 101 as eid,'aaaa' as ename,3400 as sal,'chicago' as loc,10 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 102 as eid,'abaa' as ename,6400 as sal,'ny' as loc,10 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 103 as eid,'abca' as ename,2300 as sal,'sfo' as loc,20 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 104 as eid,'afga' as ename,3000 as sal,'seattle' as loc,10 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 105 as eid,'ikaa' as ename,1400 as sal,'LA' as loc,30 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 106 as eid,'cccc' as ename,3499 as sal,'spokane' as loc,20 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 107 as eid,'toiz' as ename,4000 as sal,'WA.DC' as loc,40 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
load data local inpath 'Documents/hadoop_scripts/t3.txt' into table harsha.emp;
load data local inpath 'Documents/hadoop_scripts/t4.txt' into table harsha.init_load;
insert into table harsha.emp_tmp1 select eid,ename,sal,loc,dept,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status
from harsha.init_load;
insert into table harsha.emp_tmp2
select a.eid,a.ename,a.sal,a.loc,a.dept,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'updated' as current_status from emp_tmp1 a
left outer join emp b on
a.eid=b.eid and
a.ename=b.ename and
a.sal=b.sal and
a.loc = b.loc and
a.dept = b.dept
where b.eid is null
union all
select a.eid,a.ename,a.sal,a.loc,a.dept,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from emp_tmp1 a
left outer join emp b on
a.eid = b.eid and
a.ename=b.ename and
a.sal=b.sal and
a.loc=b.loc and
a.dept=b.dept
where b.eid is not null
union all
select b.eid,b.ename,b.sal,b.loc,b.dept,b.start_date as start_date,from_unixtime(unix_timestamp()) as end_date,'expired' as current_status from emp b
inner join emp_tmp1 a on
a.eid=b.eid
where
a.ename <> b.ename or
a.sal <> b.sal or
a.loc <> b.loc or
a.dept <> b.dept
;
insert into table harsha.emp select eid,ename,sal,loc,dept,start_date,end_date,current_status from emp_tmp2;
records including expired:
select * from harsha.emp order by eid;
latest recods:
select a.* from emp a inner join (select eid ,max(start_date) as start_date from emp where current_status <> 'expired' group by eid) b on a.eid=b.eid and a.start_date=b.start_date;
I did use another approach when it come to managing data with SCDs:
Never update data that does exist inside your historical file or table.
Make sure that new rows will be compared to the most recent generation, for instance the load logic will add control columns : loaded_on, checksum and if needed a sequence column that would be used if multiple loads does occur the same day then comparing new data to the most recent generation will use both control columns and a key column that does exist inside your data like a customer or product key.
Now, the magic does take place by computing the checksum of all the column involved but the control columns, creating a unique finger print for each row. The finger print (checksum) column then will be used to determine if any columns have changed compared to the most recent generation (most recent generation is based on the latest state of the data based on the key, loaded_on and sequence).
Now, you know if a row coming from your daily update is new because there is no previous generation or if a row coming from your daily update will require to create a new row (new generation) inside your historical file or table and last if a row coming from your daily update does not have any changes therefore no need to create a row because there is no difference compared to previous generation.
The type of logic needed can be build using Apache Spark, in a single statement you can ask Spark to concatenate any number of columns of any datatypes then compute a hash value that is used to finger print it.
All together now you can develop a utility based on spark that will accept any data source and output a well organized, clean and slow dimensions aware historical file, table,... last, never update append only!

How to update table1 field by using other table and function

I have two table and one function,
Table1 contains shop_code,batch_id,registry_id
shop_code| batch_id|registry_id
123 | 100 |12
124 | 100 |13
125 | 100 |12
Table2 contains shop_code,shop_name
shop_code| shop_name
123 | need to populate
124 | need to populate
125 | need to populate
Function1 take parameter registry_id from table1 and returns shop_name
Table2 shop_name is empty I want to populate against the shop_code.
I have tried my best but all effort is gone in vain.
It will be great if someone can help I am using Oracle.
I tried below code but giving error on from keyword
update TABLE2 set T2.SHOP_NAME = T.SHOP_NAME
from(
select GET_shop_name(t1.registry_id) as shop_name ,
t1.shop_code shop_code
from TABLE1 T1
) t where t.shop_code = t1.shop_code;
I am not entirely 100% sure if I got your question right, but I believe you want something like
update
table2 u
set
shop_name = (
select
get_shop_name(t1.batch_id)
from
table1 t1
where
t1.chop_code = u.shop_code
);
can you try this approach try to put inner query to get shop name value; I have not tested it but I think approach will work for you.
update TABLE2 T2
set T2.SHOP_NAME =
(select GET_shop_name(t1.batch_id, t1.shop_code) from table1 t1 wehre t1.shop_code = t2.shop_code)
where T2.shop_name is null
You want the MERGE statement.
Something like this might work:
MERGE INTO TABLE2 t2
USING (
SELECT GET_shop_name(t1.batch_id) AS shop_name ,
t1.shop_code shop_code
FROM TABLE1 T1 ) t1
ON (t2.shop_code = t1.shop_code)
WHEN MATCHED THEN
UPDATE SET t2.shop_name = t1.shop_name
;
You'll have to excuse if the exact code above doesn't work I don't have SQL Dev where I am right now for syntax details. :)

Resources