Knowing when a table was updated in Oracle without a full scan - oracle

I'm building an Oracle connector that reads data periodically from a couple of very big table, some are divided into partitions.
I'm trying to figure out which table were updated from the last time they were read to avoid unnecessary queries. I have the last ora_rowscn or updated_at and the only methods I find requires a full table scan to see if there are new or updated rows in the table.
Is there a way to tell if a table a row was inserted or updated without the full scan?

A couple of ideas:
1. Create a table to store last DML by table_name and then create a simple trigger on the table to update meta table.
2. Create a Materialized View Log on the table and use the data from the log to determine the changes.

If there are archive logs for the search period. You can use the utility LogMiner. for example:
insert into "ASOUP"."US"("KEY_COLUMN","COD_ROAD","COD_COMPUTER","COD_STATION_OPER","NUMB_TRAIN","STAT_CREAT","NUMB_SOSTAVA","STAT_APPOINT","COD_OPER","DIRECT_1","DIRECT_2","DATE_OPER","PARK","PATH","LOCOMOT","LATE","CAUSE_LATE","COD_CONNECT","CATEGORY","TIME") values ('42018740','988','0','9200','2624','8642','75','9802','1','8891','0',TO_DATE('18-Dec-2018', 'DD-Mon-RRRR'),'0','0','0','0','0','0',NULL,TO_DATE('18-Dec-2018', 'DD-Mon-RRRR'));
select name, first_time, next_time
from v$archived_log
where first_time >sysdate -3/24
/oracle/app/oracle/product/11.2/redolog/edcu/1_48060_769799469.dbf 18-дек-2018 09:03:06 18-дек-2018 10:22:00
/oracle/app/oracle/product/11.2/redolog/edcu/1_48061_769799469.dbf 18-дек-2018 10:22:00 18-дек-2018 10:30:02
/oracle/app/oracle/product/11.2/redolog/edcu/1_48062_769799469.dbf 18-дек-2018 10:30:02 18-дек-2018 10:56:07
Run the logminer utility.
EXECUTE DBMS_LOGMNR.add_logfile(LOGFILENAME => '/oracle/app/oracle/product/11.2/redolog/edcu/1_48060_769799469.dbf', OPTIONS => DBMS_LOGMNR.NEW);
EXECUTE DBMS_LOGMNR.add_logfile(LOGFILENAME => '/oracle/app/oracle/product/11.2/redolog/edcu/1_48061_769799469.dbf', OPTIONS => DBMS_LOGMNR.addfile);
EXECUTE DBMS_LOGMNR.add_logfile(LOGFILENAME => '/oracle/app/oracle/product/11.2/redolog/edcu/1_48062_769799469.dbf', OPTIONS => DBMS_LOGMNR.addfile);
EXECUTE DBMS_LOGMNR.START_LOGMNR(OPTIONS => DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG);
SELECT scn,ROW_ID,to_char(timestamp,'DD-MM-YYYY HH24:MI:SS'),
table_name,seg_name,operation, sql_redo,sql_undo
FROM v$logmnr_contents
where seg_owner='ASOUP' and table_name='US'
SCN ROW_ID TIMESTAMP TABLE_NAME SEG_NAME OPERATION SQL_REDO SQL_UNDO
1398405575908 AAA3q2AAoAACFweABi 18-12-2018 09:03:15 US US,ADCU201902 INSERT insert into "ASOUP"."US"("KEY_COLUMN","COD_ROAD","COD_COMPUTER","COD_STATION_OPER","NUMB_TRAIN","STAT_CREAT","NUMB_SOSTAVA","STAT_APPOINT","COD_OPER","DIRECT_1","DIRECT_2","DATE_OPER","PARK","PATH","LOCOMOT","LATE","CAUSE_LATE","COD_CONNECT","CATEGORY","TIME") values ('42018727','988','0','8800','4404','1','895','8800','1','8838','0',TO_DATE('18-Dec-2018', 'DD-Mon-RRRR'),'4','2','0','0','0','0',NULL,TO_DATE('18-Dec-2018', 'DD-Mon-RRRR')); delete from "ASOUP"."US" where "KEY_COLUMN" = '42018727' and "COD_ROAD" = '988' and "COD_COMPUTER" = '0' and "COD_STATION_OPER" = '8800' and "NUMB_TRAIN" = '4404' and "STAT_CREAT" = '1' and "NUMB_SOSTAVA" = '895' and "STAT_APPOINT" = '8800' and "COD_OPER" = '1' and "DIRECT_1" = '8838' and "DIRECT_2" = '0' and "DATE_OPER" = TO_DATE('18-Dec-2018', 'DD-Mon-RRRR') and "PARK" = '4' and "PATH" = '2' and "LOCOMOT" = '0' and "LATE" = '0' and "CAUSE_LATE" = '0' and "COD_CONNECT" = '0' and "CATEGORY" IS NULL and "TIME" = TO_DATE('18-Dec-2018', 'DD-Mon-RRRR') and ROWID = 'AAA3q2AAoAACFweABi';
You can see inserted row without full scan:
select * from asoup.us where ROWID = 'AAA3q2AAoAACFweABi';

Related

Getting ORA-00001 unique constraint violated error when calling a trigger

create or replace TRIGGER "DB"."TRIG_PERIOD_TRUANCY_INS_UPD"
AFTER UPDATE OR INSERT
ON AT_PERIOD_ATTENDANCE_RECORDS
REFERENCING OLD AS OLD NEW AS NEW
FOR EACH ROW
BEGIN
IF UPDATING THEN
delete at_period_truancy where period_attendance_records_id = :old.period_attendance_records_id;
END IF;
insert into at_period_truancy (period_attendance_records_id, district_number, school_id, student_id, calendar_date, school_year, minutes)
select :new.period_attendance_records_id, :new.district_number, :new.school_id, :new.student_id, :new.calendar_date, :new.school_year,
(case when :new.attendance_status = 'A' then period.end_time - period.begin_time
when coalesce(:new.tardy_time_in_time, period.begin_time) - period.begin_time >
period.end_time - coalesce(:new.tardy_time_out_time, period.end_time)
then coalesce(:new.tardy_time_in_time, period.begin_time) - period.begin_time
else period.end_time - coalesce(:new.tardy_time_out_time, period.end_time) end)*24*60
from ca_calendar cal
inner join ca_school_calendar calendar
on (cal.district_number = calendar.district_number
and cal.calendar_id = calendar.calendar_id )
inner join sc_class_meeting_pattern meeting
on (calendar.cycle_day_cd = meeting.cycle_day_cd)
inner join sc_class class
on (class.school_scheduling_param_id = meeting.school_scheduling_param_id
and class.class_id = meeting.class_id)
inner join sc_period_info period
on (meeting.school_scheduling_param_id = period.school_scheduling_param_id
and meeting.period = period.period)
where :new.district_number = cal.district_number
and cal.is_active_ind = 1
and :new.school_id = cal.school_id
and :new.school_year = cal.school_year
and :new.calendar_type_cd = cal.calendar_type_cd
and :new.track_number = cal.track_number
and :new.calendar_date = calendar.calendar_date
and :new.school_id = class.school_id
and :new.class_id = class.class_id
and 1 in (select use_in_truancy_report_ind
from enum_at_absence_reason_code
where district_number = :new.district_number
and school_id = :new.school_id
and value = :new.absence_reason_code
union all
select use_in_truancy_report_ind
from enum_at_tardy_reason_code
where district_number = :new.district_number
and school_id = :new.school_id
and value = :new.tardy_reason_code);
END TRIG_PERIOD_TRUANCY_INS_UPD;
This is the trigger that I am using. When calling the update statement this trigger is getting invoked and when I pass tardy_reason_code as UN this error is happening. It executes without any issues if I pass tardy_reason_code with different values.
Trigger is inserting into at_period_truancy tables.
As Oracle raises ORA-00001 (unique constraint violated), it means that you're trying to insert primary key value which already exists in the table.
You didn't post create table statement so it is difficult to guess which columns make the primary key, but - you should know it so check which values you already have in there, compare that to values currently being inserted and you'll know what to do.
Maybe you'll have to modify primary key (add other columns? Abandon idea of current primary key and using a sequence (or identity column)) or the way you're inserting values into the table.

Insert into table two and update table two for BigQuery in one query

I am using StandardSQL in BigQuery. I am writing a scheduled query which inserts records into table (2). Now, given that it's sceduled, I am trying to figure out how to update records in table (2) from the sceduled query, which is always inserting records into table (2).
In particular, when there is a record in table (2) but not generated by my query then I want to update table (2) and a boolean column to No.
Below is my query, where in the query would I add the update logic for table (2)?
INSERT INTO record (airport_name, icao_address, arrival, flight_number, origin_airport_icao, destination_airport_icao)
WITH
planes_stopped_in_airport AS (
SELECT
p.IATA_airport_code,
p.airport_name,
p.airport_ISO_country_code,
p.ICAO_airport_code,
timestamp,
a.icao_address,
a.latitude,
a.longitude,
a.altitude_baro,
a.speed,
heading,
callsign,
source,
a.collection_type,
vertical_rate,
squawk_code,
icao_actype,
flight_number,
origin_airport_icao,
destination_airport_icao,
scheduled_departure_time_utc,
scheduled_arrival_time_utc,
estimated_arrival_time_utc,
tail_number,
ingestion_time
FROM
`updates` a
JOIN
Polygons p
ON
1 = 1
WHERE
a.timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 20 MINUTE) and a.timestamp <= CURRENT_TIMESTAMP()
AND ( latitude IS NULL
AND longitude IS NULL
AND callsign IS NULL
AND speed IS NULL
AND heading IS NULL
AND altitude_baro IS NULL) IS FALSE
AND ST_DWithin( ST_GeogFromText( polygon ),
ST_GeogPoint(a.longitude,
a.latitude),
10)
AND a.collection_type = '1' -- and speed < 50
AND (origin_airport_icao IS NULL
AND destination_airport_icao IS NULL) IS FALSE )
SELECT
p.airport_name,
icao_address,
MIN(timestamp) AS Arrival,
flight_number,
origin_airport_icao,
destination_airport_icao
FROM
planes_stopped_in_airport p
WHERE
flight_number NOT IN (SELECT Distinct flight_number
FROM `table(2)`
)
GROUP BY
icao_address,
p.airport_name,
flight_number,
origin_airport_icao,
destination_airport_icao
HAVING
flight_number IS NOT NULL
ORDER BY
airport_name,
arrival
You can probably do it with MERGE statement, see details in https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement.
If I understood your requirements correctly, you need something like
MERGE dataset.Destination T
USING (SELECT * ...) S
ON T.key = S.key
WHEN MATCHED THEN
UPDATE SET T.foo = S.foo, T.bool_flag = FALSE
WHEN NOT MATCHED THEN
INSERT ...

Sql update takes 4 days for 10 million records

I want to update a database table which has over 10 million records from a temporary table.
But my update query executes more than 4 days.
1.) I have already created an index for update search criteria for
tax_ledger_item_tab. index creatred on party_type, identity, company.
My search criteria are on party_type, identity, company, delivery_type_id
as the following given query, those columns are not keys in the table.
I believe that I cant add delivery_type_id for the index as it will
update by the query, if I add that into index performance will be worst.
2.) temporary table identity_invoice_info_cfv will also returns 70,000 records
also.
So far I believe my update execution plan cost will be like around 70000*10 million records.
How can I get performance enhancement with following update query? I only want to update delivery_type_id, fetched columns only.
DECLARE
CURSOR get_records IS
SELECT i.COMPANY, i.IDENTITY, i.CF$_DELIVERY_TYPE
FROM identity_invoice_info_cfv i
WHERE i.PARTY_TYPE_DB = 'CUSTOMER';
BEGIN
FOR rec_ IN get_records LOOP
dbms_output.put_line (sysdate );
UPDATE tax_ledger_item_tab t
SET t.delivery_type_id = rec_.CF$_DELIVERY_TYPE, t.fetched = 'TRUE'
WHERE t.party_type = 'CUSTOMER'
AND t.identity = rec_.IDENTITY
AND t.company = rec_.COMPANY
AND t.delivery_type_id IS NULL;
COMMIT;
END LOOP;
END;
Use a MERGE statement:
Oracle Setup:
CREATE TABLE identity_invoice_info_cfv ( COMPANY, IDENTITY, CF$_DELIVERY_TYPE, PARTY_TYPE_DB ) AS
SELECT 'A', 123, 456, 'CUSTOMER' FROM DUAL;
CREATE TABLE tax_ledger_item_tab ( identity, company, party_type, delivery_type_id, fetched ) AS
SELECT 123, 'A', 'CUSTOMER', CAST( NULL AS NUMBER ), 'FALSE' FROM DUAL;
Merge:
MERGE INTO tax_ledger_item_tab t
USING identity_invoice_info_cfv i
ON (
t.identity = i.identity
AND t.company = i.COMPANY
AND t.party_type = 'CUSTOMER'
AND i.PARTY_TYPE_DB = 'CUSTOMER'
)
WHEN MATCHED THEN
UPDATE
SET delivery_type_id = i.CF$_DELIVERY_TYPE,
fetched = 'TRUE'
WHERE t.delivery_type_id IS NULL;
Query:
SELECT * FROM tax_ledger_item_tab;
Output:
IDENTITY | COMPANY | PARTY_TYPE | DELIVERY_TYPE_ID | FETCHED
-------: | :------ | :--------- | ---------------: | :------
123 | A | CUSTOMER | 456 | TRUE
db<>fiddle here
I hope you can achieve this using Merge Statement as well. Below is the code for the same. Please test from your side with some sample data and then proceed.
Merge into tax_ledger_item_tab t
using identity_invoice_info_cfv i
on (t.party_type ='CUSTOMER' and t.identity=i.IDENTITY
and t.company = i.COMPANY and i.PARTY_TYPE_DB = 'CUSTOMER')
when matched then
update set
t.delivery_type_id=i.CF$_DELIVERY_TYPE,
t.fetched = 'TRUE'
where t.delivery_type_id IS NULL;
commit;

Oracle trying to update a table by joining a non indexed table

I tried looking for a similar example to my problem but could not reproduce the solution to my success.
I have 2 tables, Controller and Actions.
The Actions table has the columns Step, Script, Description, Wait_Until and Ref_Code.
The Controller table can only be joined on the Action table by the Ref_Code.
The Action table cannot have a PK because for each Ref_Code there is a Step to be taken.
Im getting an error when trying to update the Controller table using a merge statement:
ORA-30926: unable to get a stable set of rows in the source tables
My merge statement is as follows:
MERGE INTO DSTETL.SHB_FTPS_CONTROLLER ftpsc
USING (SELECT DISTINCT FTPSC.SESSION_ID,
FTPSC.ORDER_DATE,
sa.step,
sa.next_step,
LAST_ACTION_TMSTMP,
SA.ACTION_SCRIPT,
sa.ref_code,
SA.WAIT_UNTIL
FROM DSTETL.SHB_FTPS_CONTROLLER ftpsc, DSTETL.SHB_ACTIONS sa
WHERE SA.REF_CODE = FTPSC.REF_CODE
AND SA.STEP > ftpsc.curr_step
AND sa.step = ftpsc.next_step) v1
ON (v1.REF_CODE = FTPSC.REF_CODE)
WHEN MATCHED
THEN
UPDATE SET FTPSC.LAST_ACTION_TMSTMP = CURRENT_TIMESTAMP,
ftpsc.next_step = v1.next_step,
ftpsc.curr_step = v1.STEP,
ftpsc.action_script = v1.action_script
WHERE CURRENT_TIMESTAMP >= v1.LAST_ACTION_TMSTMP + v1.WAIT_UNTIL;
COMMIT;
I tried doing this using a normal update as well but Im getting ORA-01732: data manipulation operation not legal on this view.
UPDATE (SELECT FTPSC.SESSION_ID,
FTPSC.ORDER_DATE,
FTPSC.CURR_STEP,
FTPSC.NEXT_STEP,
FTPSC.ACTION_SCRIPT,
sa.step, --New Step
sa.next_step AS "NNS", --New Next Step
FTPSC.LAST_ACTION_TMSTMP,
SA.ACTION_SCRIPT AS "NAS", --New action script
sa.ref_code,
SA.WAIT_UNTIL
FROM DSTETL.SHB_FTPS_CONTROLLER ftpsc
LEFT JOIN
DSTETL.SHB_ACTIONS sa
ON SA.REF_CODE = FTPSC.REF_CODE
AND SA.STEP > ftpsc.curr_step
AND sa.step = ftpsc.next_step) t
SET t.curr_step = t.step,
t.LAST_ACTION_TMSTMP = CURRENT_TIMESTAMP,
t.next_step = t."NNS",
t.action_script = t."NAS";
COMMIT;
Any advice would be appreciated, I already understand this is because the Action table has multiple Ref_Codes but REF_CODE||STEP is unique. And the output of:
SELECT DISTINCT FTPSC.SESSION_ID,
FTPSC.ORDER_DATE,
sa.step,
sa.next_step,
LAST_ACTION_TMSTMP,
SA.ACTION_SCRIPT,
sa.ref_code,
SA.WAIT_UNTIL
FROM DSTETL.SHB_FTPS_CONTROLLER ftpsc, DSTETL.SHB_ACTIONS sa
WHERE SA.REF_CODE = FTPSC.REF_CODE
AND SA.STEP > ftpsc.curr_step
AND sa.step = ftpsc.next_step;
Is how I want the Controller table to be updated like.
Thanks in advance.
It sounds like what you want to do is: update each row in the Controller table with the matching "next step" details from the Actions table. But your Merge statement is querying the Controller table twice, which confuses things.
Is this what you're trying to do?
MERGE INTO DSTETL.SHB_FTPS_CONTROLLER ftpsc
USING (SELECT
step,
next_step,
ACTION_SCRIPT,
ref_code,
WAIT_UNTIL
FROM DSTETL.SHB_ACTIONS
) sa
ON (sa.REF_CODE = FTPSC.REF_CODE)
WHEN MATCHED
THEN
UPDATE SET FTPSC.LAST_ACTION_TMSTMP = CURRENT_TIMESTAMP,
ftpsc.next_step = sa.next_step,
ftpsc.curr_step = sa.STEP,
ftpsc.action_script = sa.action_script
WHERE CURRENT_TIMESTAMP >= ftpsc.LAST_ACTION_TMSTMP + sa.WAIT_UNTIL
AND SA.STEP > ftpsc.curr_step
AND sa.step = ftpsc.next_step;
EDIT: updated query
EDIT2: So, in your original query, in the USING section you were selecting the rows in the Controller table that you wanted to update... but you never joined those rows to the Controller table from the MERGE INTO section to match them up. Having the same alias "ftpsc" just made it less clear that they're two separate objects in the query, and which one you wanted to update.
Honestly I don't really understand why Oracle won't let you update columns that appear in the USING..ON clause. It apparently works fine in SQL Server.

Slowly changing dimensions- SCD1 and SCD2 implementation in Hive [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I am looking for SCD1 and SCD2 implementation in Hive (1.2.1). I am aware of the workaround to load SCD1 and SCD2 tables prior to Hive (0.14). Here is the link for loading SCD1 and SCD2 with the workaround approach http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
Now that Hive supports ACID operations just want to know if there is a better or direct way of loading it.
As HDFS is immutable storage it could be argued that versioning data and keeping history (SCD2) should be the default behaviour for loading dimensions. You can create a View in your Hadoop SQL query engine (Hive, Impala, Drill etc.) that retrieves the current state/latest value using windowing functions. You can find out more about dimensional models on Hadoop in my blog post, e.g. how to handle a large dimension and fact table.
Well, I work it around using two temp tables:
drop table if exists administrator_tmp1;
drop table if exists administrator_tmp2;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--review_administrator
CREATE TABLE if not exists review_administrator(
admin_id bigint ,
admin_name string,
create_time string,
email string ,
password string,
status_description string,
token string ,
expire_time string ,
granter_user_id bigint ,
admin_time string ,
effect_start_date string ,
effect_end_date string
)
partitioned by (current_row_indicator string comment 'current, expired')
stored as parquet;
--tmp1 is used for saving origin data
CREATE TABLE if not exists administrator_tmp1(
admin_id bigint ,
admin_name string,
create_time string,
email string ,
password string ,
status_description string ,
token string ,
expire_time string ,
granter_user_id bigint ,
admin_time string ,
effect_start_date string ,
effect_end_date string
)
partitioned by (current_row_indicator string comment 'current, expired:')
stored as parquet;
--tmp2 saving the scd data
CREATE TABLE if not exists administrator_tmp2(
admin_id bigint ,
admin_name string,
create_time string,
email string ,
password string ,
status_description string ,
token string ,
expire_time string ,
granter_user_id bigint ,
admin_time string ,
effect_start_date string ,
effect_end_date string
)
partitioned by (current_row_indicator string comment 'current, expired')
stored as parquet;
--insert origin data into tmp1
INSERT OVERWRITE TABLE administrator_tmp1 PARTITION(current_row_indicator)
SELECT
user_id as admin_id,
name as admin_name,
time as create_time,
email as email,
password as password,
status as status_description,
token as token,
expire_time as expire_time,
admin_id as granter_user_id,
admin_time as admin_time,
'{{ ds }}' as effect_start_date,
'9999-12-31' as effect_end_date,
'current' as current_row_indicator
FROM
ks_db_origin.gifshow_administrator_origin
;
--insert scd data into tmp2
--for the data unchanged
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
t2.admin_id,
t2.admin_name,
t2.create_time,
t2.email,
t2.password,
t2.status_description,
t2.token,
t2.expire_time,
t2.granter_user_id,
t2.admin_time,
t2.effect_start_date,
t2.effect_end_date as effect_end_date,
t2.current_row_indicator
FROM
administrator_tmp1 t1
INNER JOIN
(
SELECT * FROM review_administrator
WHERE current_row_indicator = 'current'
) t2
ON
t1.admin_id = t2.admin_id
AND t1.admin_name = t2.admin_name
AND t1.create_time = t2.create_time
AND t1.email = t2.email
AND t1.password = t2.password
AND t1.status_description = t2.status_description
AND t1.token = t2.token
AND t1.expire_time = t2.expire_time
AND t1.granter_user_id = t2.granter_user_id
AND t1.admin_time = t2.admin_time
;
--for the data changed , update the effect_end_date
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
t2.admin_id,
t2.admin_name,
t2.create_time,
t2.email,
t2.password,
t2.status_description,
t2.token,
t2.expire_time,
t2.granter_user_id,
t2.admin_time,
t2.effect_start_date as effect_start_date,
'{{ yesterday_ds }}' as effect_end_date,
'expired' as current_row_indicator
FROM
administrator_tmp1 t1
INNER JOIN
(
SELECT * FROM review_administrator
WHERE current_row_indicator = 'current'
) t2
ON
t1.admin_id = t2.admin_id
WHERE NOT
(
t1.admin_name = t2.admin_name
AND t1.create_time = t2.create_time
AND t1.email = t2.email
AND t1.password = t2.password
AND t1.status_description = t2.status_description
AND t1.token = t2.token
AND t1.expire_time = t2.expire_time
AND t1.granter_user_id = t2.granter_user_id
AND t1.admin_time = t2.admin_time
)
;
--for the changed data and the new data
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
t1.admin_id,
t1.admin_name,
t1.create_time,
t1.email,
t1.password,
t1.status_description,
t1.token,
t1.expire_time,
t1.granter_user_id,
t1.admin_time,
t1.effect_start_date,
t1.effect_end_date,
t1.current_row_indicator
FROM
administrator_tmp1 t1
LEFT OUTER JOIN
(
SELECT * FROM review_administrator
WHERE current_row_indicator = 'current'
) t2
ON
t1.admin_id = t2.admin_id
AND t1.admin_name = t2.admin_name
AND t1.create_time = t2.create_time
AND t1.email = t2.email
AND t1.password = t2.password
AND t1.status_description = t2.status_description
AND t1.token = t2.token
AND t1.expire_time = t2.expire_time
AND t1.granter_user_id = t2.granter_user_id
AND t1.admin_time = t2.admin_time
WHERE t2.admin_id IS NULL
;
--for the data already marked by 'expired'
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
t1.admin_id,
t1.admin_name,
t1.create_time,
t1.email,
t1.password,
t1.status_description,
t1.token,
t1.expire_time,
t1.granter_user_id,
t1.admin_time,
t1.effect_start_date,
t1.effect_end_date,
t1.current_row_indicator
FROM
review_administrator t1
WHERE t1.current_row_indicator = 'expired'
;
--populate the dim table
INSERT OVERWRITE TABLE review_administrator PARTITION(current_row_indicator)
SELECT
t1.admin_id,
t1.admin_name,
t1.create_time,
t1.email,
t1.password,
t1.status_description,
t1.token,
t1.expire_time,
t1.granter_user_id,
t1.admin_time,
t1.effect_start_date,
t1.effect_end_date,
t1.current_row_indicator
FROM
administrator_tmp2 t1
;
--drop the two temp table
drop table administrator_tmp1;
drop table administrator_tmp2;
-- --example data
-- --2017-01-01
-- insert into table review_administrator PARTITION(current_row_indicator)
-- SELECT '1','a','2016-12-31','a#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-01','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- --2017-01-02
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '1','a','2016-12-31','a01#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-02','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '2','b','2016-12-31','a#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-02','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- --2017-01-03
-- --id 1 is changed
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '1','a','2016-12-31','a03#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-03','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- --id 2 is not changed at all
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '2','b','2016-12-31','a#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-03','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- --id 3 is a new record
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '3','c','2016-12-31','c#ks.com','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-03','9999-12-31','current'
-- FROM default.sample_07 limit 1;
-- --now dim table will show you the right SCD.
Here's the detailed implementation of slowly changing dimension type 2 in Hive using exclusive join approach.
Assuming that the source is sending a complete data file i.e. old, updated and new records.
Steps-
Load the recent file data to STG table
Select all the expired records from HIST table
select * from HIST_TAB where exp_dt != '2099-12-31'
Select all the records which are not changed from STG and HIST using inner join and filter on HIST.column = STG.column as below
select hist.* from HIST_TAB hist
inner join STG_TAB stg
on hist.key = stg.key
where hist.column = stg.column
Select all the new and updated records which are changed from STG_TAB using exclusive left join with HIST_TAB and set expiry and effective date as below
select stg.*, eff_dt (yyyy-MM-dd), exp_dt (2099-12-31)
from STG_TAB stg
left join
(select * from HIST_TAB where exp_dt = '2099-12-31') hist
on hist.key = stg.key
where hist.key is null
or hist.column != stg.column
Select all updated old records from the HIST table using exclusive left join with STG table and set their expiry date as shown below:
select hist.*, exp_dt(yyyy-MM-dd) from
(select * from HIST_TAB where exp_dt = '2099-12-31') hist
left join STG_TAB stg
on hist.key= stg.key
where hist.key is null
or hist.column!= stg.column
unionall queries from 2-5 and insert overwrite result to HIST table
More detailed implementation of SCD type 2 can be found here-
https://github.com/sahilbhange/slowly-changing-dimension
drop table if exists harsha.emp;
drop table if exists harsha.emp_tmp1;
drop table if exists harsha.emp_tmp2;
drop table if exists harsha.init_load;
show databases;
use harsha;
show tables;
create table harsha.emp (eid int,ename string,sal int,loc string,dept int,start_date timestamp,end_date timestamp,current_status string)
comment "emp scd implementation"
row format delimited
fields terminated by ','
lines terminated by '\n'
;
create table harsha.emp_tmp1 (eid int,ename string,sal int,loc string,dept int,start_date timestamp,end_date timestamp,current_status string)
comment "emp scd implementation"
row format delimited
fields terminated by ','
lines terminated by '\n'
;
create table harsha.emp_tmp2 (eid int,ename string,sal int,loc string,dept int,start_date timestamp,end_date timestamp,current_status string)
comment "emp scd implementation"
row format delimited
fields terminated by ','
lines terminated by '\n'
;
create table harsha.init_load (eid int,ename string,sal int,loc string,dept int)
row format delimited
fields terminated by ','
lines terminated by '\n'
;
show tables;
insert into table harsha.emp select 101 as eid,'aaaa' as ename,3400 as sal,'chicago' as loc,10 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 102 as eid,'abaa' as ename,6400 as sal,'ny' as loc,10 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 103 as eid,'abca' as ename,2300 as sal,'sfo' as loc,20 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 104 as eid,'afga' as ename,3000 as sal,'seattle' as loc,10 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 105 as eid,'ikaa' as ename,1400 as sal,'LA' as loc,30 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 106 as eid,'cccc' as ename,3499 as sal,'spokane' as loc,20 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
insert into table harsha.emp select 107 as eid,'toiz' as ename,4000 as sal,'WA.DC' as loc,40 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;
load data local inpath 'Documents/hadoop_scripts/t3.txt' into table harsha.emp;
load data local inpath 'Documents/hadoop_scripts/t4.txt' into table harsha.init_load;
insert into table harsha.emp_tmp1 select eid,ename,sal,loc,dept,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status
from harsha.init_load;
insert into table harsha.emp_tmp2
select a.eid,a.ename,a.sal,a.loc,a.dept,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'updated' as current_status from emp_tmp1 a
left outer join emp b on
a.eid=b.eid and
a.ename=b.ename and
a.sal=b.sal and
a.loc = b.loc and
a.dept = b.dept
where b.eid is null
union all
select a.eid,a.ename,a.sal,a.loc,a.dept,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from emp_tmp1 a
left outer join emp b on
a.eid = b.eid and
a.ename=b.ename and
a.sal=b.sal and
a.loc=b.loc and
a.dept=b.dept
where b.eid is not null
union all
select b.eid,b.ename,b.sal,b.loc,b.dept,b.start_date as start_date,from_unixtime(unix_timestamp()) as end_date,'expired' as current_status from emp b
inner join emp_tmp1 a on
a.eid=b.eid
where
a.ename <> b.ename or
a.sal <> b.sal or
a.loc <> b.loc or
a.dept <> b.dept
;
insert into table harsha.emp select eid,ename,sal,loc,dept,start_date,end_date,current_status from emp_tmp2;
records including expired:
select * from harsha.emp order by eid;
latest recods:
select a.* from emp a inner join (select eid ,max(start_date) as start_date from emp where current_status <> 'expired' group by eid) b on a.eid=b.eid and a.start_date=b.start_date;
I did use another approach when it come to managing data with SCDs:
Never update data that does exist inside your historical file or table.
Make sure that new rows will be compared to the most recent generation, for instance the load logic will add control columns : loaded_on, checksum and if needed a sequence column that would be used if multiple loads does occur the same day then comparing new data to the most recent generation will use both control columns and a key column that does exist inside your data like a customer or product key.
Now, the magic does take place by computing the checksum of all the column involved but the control columns, creating a unique finger print for each row. The finger print (checksum) column then will be used to determine if any columns have changed compared to the most recent generation (most recent generation is based on the latest state of the data based on the key, loaded_on and sequence).
Now, you know if a row coming from your daily update is new because there is no previous generation or if a row coming from your daily update will require to create a new row (new generation) inside your historical file or table and last if a row coming from your daily update does not have any changes therefore no need to create a row because there is no difference compared to previous generation.
The type of logic needed can be build using Apache Spark, in a single statement you can ask Spark to concatenate any number of columns of any datatypes then compute a hash value that is used to finger print it.
All together now you can develop a utility based on spark that will accept any data source and output a well organized, clean and slow dimensions aware historical file, table,... last, never update append only!

Resources