I am trying to figure out what are the best options to perform archive and purge given our situation.
We have roughly 50 million records in say Table A. We want to archive data into a target table and then purge those data in the source table. We would like to retain the data base on several criteria that overlap with each other. For example, we want to retain the data from the past 5 months in addition to keeping all the records with say Indicator='True'. Indicator='True' will likely return records beyond 5 months. This means I have to use OR condition in order to capture the data. Base on the conditions, we would need to retain 10 million records and archive/purge 40 million records. I would need to create a process that will run every 6 months to do this.
My question is, what are the most efficient options for me to get this done for both archiving and purging? Would a PROC/bulk delete/insert be my best option?
Partition seems to be out of the question since there are several conditions that overlap with each other.
Use composite partitioning, e.g. range (for your time dimension) and list (to distinct between the rows that should be kept long and limited time.
Example
The rows with KEEP_ID='N' should be eliminated after 5 months.
CREATE TABLE tab
( id NUMBER(38,0),
trans_dt DATE,
keep_id VARCHAR2(1)
)
PARTITION BY RANGE (trans_dt) INTERVAL (NUMTOYMINTERVAL(1,'MONTH'))
SUBPARTITION BY LIST (keep_id)
SUBPARTITION TEMPLATE
( SUBPARTITION p_catalog VALUES ('Y'),
SUBPARTITION p_internet VALUES ('N')
)
(PARTITION p_init VALUES LESS THAN (TO_DATE('01-JAN-2019','dd-MON-yyyy'))
);
Populate with sample data for 6 months
insert into tab (id, trans_dt, keep_id)
select rownum, add_months(date'2019-08-01', trunc((rownum-1) / 2)), decode(mod(rownum,2),0,'Y','N')
from dual connect by level <= 12;
select * from tab
order by trans_dt, keep_id;
ID TRANS_DT KEEP_ID
---------- ------------------- -------
1 01.08.2019 00:00:00 N --- this subpartition should be deleted
2 01.08.2019 00:00:00 Y
3 01.09.2019 00:00:00 N
4 01.09.2019 00:00:00 Y
5 01.10.2019 00:00:00 N
6 01.10.2019 00:00:00 Y
7 01.11.2019 00:00:00 N
8 01.11.2019 00:00:00 Y
9 01.12.2019 00:00:00 N
10 01.12.2019 00:00:00 Y
11 01.01.2020 00:00:00 N
12 01.01.2020 00:00:00 Y
Now use partition extended names to reference the subpartition that should be dropped.
Drop subpartition older than 5 months, but only for KEEP_ID = 'N'
alter table tab drop subpartition for (DATE'2019-08-01','N');
New data
ID TRANS_DT KEEP_ID
---------- ------------------- -------
2 01.08.2019 00:00:00 Y
3 01.09.2019 00:00:00 N
4 01.09.2019 00:00:00 Y
.....
Related
I need to figure out a query that will compare two EFFECTIVE dates for a given patient number with different HMOs and determine which is the later date of the two and then populate a TERMINATION date field for only the older of the two effective dates with the last day of the previous month of the newer effective date of the two. This needs to be done across multiple patient, HMO, effective date combinations in a table.
SELECT * FROM tablename
The output is this:
HMO PATIENT EFFECTIVE TERMINATION
16 221135 01-APR-18
18 221135 01-OCT-17
12 251181 01-SEP-16
16 251181 01-MAR-15
12 271126 01-MAR-15
16 271126 01-DEC-16
12 291141 01-DEC-16
16 291141 01-FEB-19
12 391134 09-MAY-13
16 391134 01-APR-18
What I am trying to do via a query or queries is this:
HMO PATIENT EFFECTIVE TERMINATION
16 221235 01-APR-18
18 221235 01-OCT-17 3/31/2018
12 251381 01-SEP-16
16 251381 01-MAR-15 8/31/2016
12 2711126 01-MAR-15 11/30/2016
16 2711126 01-DEC-16
12 292241 01-DEC-16 1/31/2019
16 292241 01-FEB-19
12 391534 09-MAY-13 31-MAR-19
16 391534 01-APR-18
I've tried using a case statement but it is unsurprisingly creating four rows per patient, hmo combo and populating two of the rows with dates and leaving two blank:
SELECT DISTINCT
S.HMO
,S.PATIENT
,S.EFFECTIVE
,CASE WHEN S.EFFECTIVE > E.EFFECTIVE THEN LAST_DAY(ADD_MONTHS(S.EFFECTIVE, -1))
WHEN S.EFFECTIVE < E.EFFECTIVE THEN LAST_DAY(ADD_MONTHS(E.EFFECTIVE, -1))
ELSE NULL END AS TERMINATION
FROM tablename S INNER JOIN tablename E ON S.PATIENT=E.PATIENT
WHERE S.PATIENT =221135
Any ideas or advice would be welcome.
With sample data you posted:
SQL> select * from tablename order by patient, effective;
HMO PATIENT EFFECTIVE TERMINATIO
---------- ---------- ---------- ----------
18 221135 10/01/2017
16 221135 04/01/2018
16 251181 03/01/2015
12 251181 09/01/2016
12 271126 03/01/2015
16 271126 12/01/2016
6 rows selected.
such a MERGE might do:
SQL> merge into tablename a
2 using (select patient, max(effective) max_effective,
3 min(effective) min_effective
4 from tablename
5 group by patient
6 ) x
7 on (a.patient = x.patient)
8 when matched then update set
9 a.termination = x.max_effective - 1
10 where a.effective = x.min_effective;
3 rows merged.
Result is then
SQL> select * from tablename order by patient, effective;
HMO PATIENT EFFECTIVE TERMINATIO
---------- ---------- ---------- ----------
18 221135 10/01/2017 03/31/2018
16 221135 04/01/2018
16 251181 03/01/2015 08/31/2016
12 251181 09/01/2016
12 271126 03/01/2015 11/30/2016
16 271126 12/01/2016
6 rows selected.
SQL>
I'm still fairly new to PL/SQL and I tried to solve the following problem for a long time on my own, but I can not get to a solution (and i did not find a similar problem solved) hence I thought I ask myself.
I am trying to sort a list by the date (which indicates the 'time_left_to_live') so that the result is the same list but, as you can guess, sorted by the date.
EDIT: Maybe i should be even more precise:
My original goal is to write an AFTER UPDATE TRIGGER, which sorts the table again if the time_left_to_live is updated. My idea was to write a Procedure which sorts (or more updates) the list and call it.
Example for clarification:
UPDATE testv1 SET time_left_to_live = '01.05.2020' WHERE list_id = 3;
so after that the old data of list_id 3 should be 1 since it has the shortest amount left to live and the other datas should be incremented by 1.
1 01.05.2020
2 03.05.2020
3 31.05.2020
I hope I could explain the situation somewhat good.
I tried it with a nested loop but I simply can't think inside. Any help is appreciated.
My table:
DROP TABLE testv1;
DROP PROCEDURE wlp_sort;
CREATE TABLE testv1(
list_ID INT,
time_left_to_live DATE
);
INSERT INTO testv1 VALUES (3, '01.06.2020');
INSERT INTO testv1 VALUES (2, '31.05.2020');
INSERT INTO testv1 VALUES (1, '03.05.2020');
--UPDATE testv1 SET time_left_to_live = '01.05.2020' WHERE list_ID = 2;
SELECT * FROM testv1;
I'd suggest you not to do that. I presume that table you posted as an example is exactly that - an example. In reality, it is more complex. Furthermore, I presume that list_id column identifies each row which is OK, but wrong to be used for sorting purposes, especially not the way you wanted - by updating its value via database triggers.
What should you do, then? Use ORDER BY as it is the only mechanism that guarantees that rows will be returned in desired order. A long time ago, I think the last Oracle database version was 8i, you could use group by clause which in the background sorted result for you, but those times have passed so - order by it is.
Here's an example of what I mean.
This is the original contents of your table. Query includes additional column - the one you might want to use - row_number analytic function which "calculates" ordinal number on-the-fly. At first, it matches list_id value.
SQL> select list_id,
2 time_left_to_live,
3 row_number() over (order by time_left_to_live) rn
4 from testv1
5 order by time_left_to_live;
LIST_ID TIME_LEFT_ RN
---------- ---------- ----------
1 03.05.2020 1
2 31.05.2020 2
3 01.06.2020 3
SQL>
Let's update one row (also from your example) and see what happens (note that I used date literal; you updated a date column with a string. Oracle did implicitly try - and succeed - to convert it to a valid date value, but you shouldn't rely on that. 01.05.2020 could be 1st of May or 5th of January, depending on date format which may change and might be different in different databases. Date literal is, on the other hand, always in format date 'yyyy-mm-dd' and there's no confusion):
SQL> update testv1 set time_left_to_live = date '2020-05-01' where list_id = 2;
1 row updated.
SQL> select list_id,
2 time_left_to_live,
3 row_number() over (order by time_left_to_live) rn
4 from testv1
5 order by time_left_to_live;
LIST_ID TIME_LEFT_ RN
---------- ---------- ----------
2 01.05.2020 1
1 03.05.2020 2
3 01.06.2020 3
SQL>
ORDER BY clause sorts data, but now list_id and rn are different; list_id didn't change, but rn represents new order.
If your next step is to do something with a row whose ordinal number is 1, you'd just use query I suggested as an inline view and fetch values whose rn = 1:
SQL> select list_id,
2 time_left_to_live
3 from (select list_id,
4 time_left_to_live,
5 row_number() over (order by time_left_to_live) rn
6 from testv1
7 )
8 where rn = 1;
LIST_ID TIME_LEFT_
---------- ----------
2 01.05.2020
SQL>
I suggest you use this option.
Additional drawbacks related to database trigger: if you write an update statement that modifies list_id, no problem - it works outside of trigger and list_id and rn are synchronized again:
SQL> update testv1 a set
2 a.list_id = (select x.rn
3 from (select b.list_id,
4 b.time_left_to_live,
5 row_number() over (order by b.time_left_to_live) rn
6 from testv1 b
7 ) x
8 where x.list_id = a.list_id
9 );
3 rows updated.
SQL> select list_id,
2 time_left_to_live,
3 row_number() over (order by time_left_to_live) rn
4 from testv1
5 order by time_left_to_live;
LIST_ID TIME_LEFT_ RN
---------- ---------- ----------
1 01.05.2020 1
2 03.05.2020 2
3 01.06.2020 3
SQL>
But, in a trigger, you modify a column which fires a trigger which modifies a column which fires a trigger which modifies a column ... until you exceed maximum number of recursive SQL levels and then Oracle raises an error.
Or, if you planned to select from the table and then do something with it, you'll hit the mutating table error as you can't select from a table which is just being modified. True, you might use a compound trigger (or - in previous Oracle database versions - package and custom type), but - once again, in my opinion, that's just not the way you should handle the problem.
Agree fully with #Littlefoot. But even if you could build a user written sort in a trigger there is NO quarantine that order is preserved when written. A table is by definition an unordered set of rows and the SQL engine in under no obligation to maintain any ordering of presented rows. The way to guarantee any sequence is the order by clause
I am looking for a possibly better approach to this.
I have created a temp table in Oracle 11.2 that I'm using to pre calculate values that I will need in other selects instead of always generating them again with each select.
create global temporary table temp_foo (
DT timestamp(6), --only the date part will be used in this example but for later things I will need the time
Something varchar2(100),
Customer varchar2(100),
MinDate timestamp(6),
MaxDate timestamp(6),
Filecount int,
Errorcount int,
AvgFilecount int,
constraint PK_foo primary key (DT, Customer)
) on commit preserve rows;
I then first insert some fixed values for everything except AvgFilecount. AvgFilecount should contain the average for the Filecount for the 3 previous records (going by the date in DT). It doesn’t matter that the result will be converted to an int, I don’t need the decimal places
DT | Customer | Filecount | AvgFilecount
2019-04-30 | x | 10 | avg(2+3+9)
2019-04-29 | x | 2 | based on values before this
2019-04-28 | x | 3 | based on values before this
2019-04-27 | x | 9 | based on values before this
I thought about using a normal UPDATE statement as this should be faster than looping through the values. I should mention that there are no gaps in the DT field but obviously there is a first one where I won‘t find any previous records. If I would loop through, I could easily calculate AvgFilecount with (the record before previous record/2 + previous record)/3 which I cannot with UPDATE as I cannot guarantee the order of how they are executed. So I‘m fine with just taking the last 3 records (going by DT) and calcuting it from there.
What I thought would be an easy update is giving me headaches. I‘m mostly doing SQL Server where I would just join the 3 other records but it seems is a bit different in Oracle. I have found https://stackoverflow.com/a/2446834/4040068 and wanted to use the second approach in the answer.
update
(select curr.DT, curr.temp_foo, curr.Filecount, curr.AvgFilecount as OLD, (coalesce(Minus1.Filecount, 0) + coalesce(Minus2.Filecount, 0) + coalesce(Minus3.Filecount, 0)) / 3 as NEW
from temp_foo curr
left join temp_foo Minus1 ON Minus1.Customer = curr.Customer and trunc(Minus1.DT) = trunc(curr.DT-1)
left join temp_foo Minus2 ON Minus2.Customer = curr.Customer and trunc(Minus2.DT) = trunc(curr.DT-2)
left join temp_foo Minus3 ON Minus3.Customer = curr.Customer and trunc(Minus3.DT) = curr.DT-3
order by 1, 2
)
set OLD = NEW;
Which gives me an
ORA-01779: cannot modify a column which maps to a non key-preserved
table
01779. 00000 - "cannot modify a column which maps to a non key-preserved table"
*Cause: An attempt was made to insert or update columns of a join view which
map to a non-key-preserved table.
*Action: Modify the underlying base tables directly.
I thought this should work as both join conditions are in the primary key and thus unique. I am currently implementing the first approach in the above mentioned answer but it is getting quite big and it feels like there should be a better solution to this.
Other things I thought about trying:
using a nested subselect (nested because Oracle doesn’t know top(n) and I need to sort the subselect) to select the previous 3 records ordered by DT and then he outer select with rownum <=3 and then I could just use AVG(). However, I was told subselect can be quite slow and joins are better in Oracle performance wise. Dunno if that is really the case, haven‘t done any testing
Edit: My insert right now looks like this. I am already aggregating the Filecount for a day as there can be multiple records per DT per Customer per Something.
insert into temp_foo (DT, Something, Customer, Filecount)
select dates.DT, tbl1.Something, tbl1.Customer, coalesce(sum(tbl3.Filecount),0)
from table(Function_Returning_Daterange(NULL, NULL)) dates
cross join
(SELECT Something,
Code,
Value
FROM Table2 tbl2
WHERE (Something = 'Value')) tbl1
left outer join Table3 tbl3
on tbl3.Customer = tbl1.Customer
and trunc(tbl3.MinDate) = trunc(dates.DT)
group by dates.DT, tbl1.Something, tbl1.Customer;
You could use an analytic average with a window clause:
select dt, customer, filecount,
avg(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding) as avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 6
2019-04-28 x 3 9
2019-04-27 x 9
and then do the update part with a merge statement:
merge into tmp_foo t
using (
select dt, customer,
avg(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding) as avgfilecount
from tmp_foo
) s
on (s.dt = t.dt and s.customer = t.customer)
when matched then update set t.avgfilecount = s.avgfilecount;
4 rows merged.
select dt, customer, filecount, avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 6
2019-04-28 x 3 9
2019-04-27 x 9
You haven't shown your original insert statement; it might be possible to add the analytic calculation to that, and avoid the separate update step.
Also, if you want the first two date values to be calculated as if the 'missing' extra days before them had zero counts, you could use sum and division instead of avg:
select dt, customer, filecount,
sum(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding)/3 as avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 4
2019-04-28 x 3 3
2019-04-27 x 9
It depends what you expect those last calculated values to be.
I need to pull humongous amount of data, say 600-700 variables from different tables in a data warehouse...now the dataset in its raw form will easily touch 150 gigs - 79 MM rows and for my analysis purpose I need only a million rows...how can I pull data using proc sql directly from warehouse by doing simple random sampling on the rows.
Below code wont work as ranuni is not supported by oracle
proc sql outobs =1000000;
select * from connection to oracle(
select * from tbl1 order by ranuni(12345);
quit;
How do you propose I do it
Use the DBMS_RANDOM Package to Sort Records and Then Use A Row Limiting Clause to Restrict to the Desired Sample Size
The dbms_random.value function obtains a random number between 0 and 1 for all rows in the table and we sort in ascending order of the random value.
Here is how to produce the sample set you identified:
SELECT
*
FROM
(
SELECT
*
FROM
tbl1
ORDER BY dbms_random.value
)
FETCH FIRST 1000000 ROWS ONLY;
To demonstrate with the sample schema table, emp, we sample 4 records:
SCOTT#DEV> SELECT
2 empno,
3 rnd_val
4 FROM
5 (
6 SELECT
7 empno,
8 dbms_random.value rnd_val
9 FROM
10 emp
11 ORDER BY rnd_val
12 )
13 FETCH FIRST 4 ROWS ONLY;
EMPNO RND_VAL
7698 0.06857749035643605682648168347885993709
7934 0.07529612360785920635181751566833986766
7902 0.13618520865865754766175030040204331697
7654 0.14056380246495282237607922497308953768
SCOTT#DEV> SELECT
2 empno,
3 rnd_val
4 FROM
5 (
6 SELECT
7 empno,
8 dbms_random.value rnd_val
9 FROM
10 emp
11 ORDER BY rnd_val
12 )
13 FETCH FIRST 4 ROWS ONLY;
EMPNO RND_VAL
7839 0.00430658806761508024693197916281775492
7499 0.02188116061148367312927392115186317884
7782 0.10606515700372416131060633064729870016
7788 0.27865276349549877512032787966777990909
With the example above, notice that the empno changes significantly during the execution of the SQL*Plus command.
The performance might be an issue with the row counts you are describing.
EDIT:
With table sizes in the order of 150 gigs - 79 MM, any sorting would be painful.
If the table had a surrogate key based on a sequence incremented by 1, we could take the approach of selecting every nth record based on the key.
e.g.
--scenario n = 3000
FROM
tbl1
WHERE
mod(table_id, 3000) = 0;
This approach would not use an index (unless a function based index is created), but at least we are not performing a sort on a data set of this size.
I performed an explain plan with a table that has close to 80 million records and it does perform a full table scan (the condition forces this without a function based index) but this looks tenable.
None of the answers posted or comments helped my cause, it could but we have 87 MM rows
Now I wanted the answer with the help of sas: here is what I did: and it works. Thanks all!
libname dwh path username pwd;
proc sql;
create table sample as
(select
<all the variables>, ranuni(any arbitrary seed)
from dwh.<all the tables>
<bunch of where conditions goes here>);
quit);
Im puzzle as to how to build my fact and dimensions to procude the following results:
I want to count the number of occurences of logged people for each time interval.
In this case every 30 mins. It would look like this
Example: Person1 login at 10:05:00 and logout at 12:10:00
Person2 login at 10:45:00 and logout at 11:25:00
Person3 login at 11:05:00 and logout at 14:01:00
TimeStart TimeEnd People logged
00:00:00 00:30:00 0
00:30:00 01:00:00 0
...
10:00:00 10:30:00 1
10:30:00 11:00:00 2
11:00:00 11:30:00 3
11:30:00 12:00:00 2
12:00:00 12:30:00 2
12:30:00 13:00:00 1
13:00:00 13:30:00 1
13:30:00 14:00:00 1
14:00:00 14:30:00 0
...
23:30:00 00:00:00 0
So i have a DimTime and DimDate table that contain hour, halfhour, quarterhour
and i have a FactTimestamp table that has the following:
DateLoginID that points to DimDate dateID
DateLogoutID that points to DimDate dateID
TimeLoginID that points to DimTime timeID
TimeLogoutID that points to DimTime timeID
I'd like to know what kind of cube design i would need to achieve that?
Ive done it in sql if that can help:
--Create tmp table for time interval
CREATE TABLE #tmp(
StartRange time(0),
EndRange time(0),
);
--Interval set to 30 minutes
DECLARE #Interval int = 30
-- Example with #Date = 2017-07-27: Set starttime at 2017-07-27 00:00:00
DECLARE #StartTime datetime = DATEADD(HOUR,0, #Date)
--Set endtime at 2017-07-27 23:59:59
DECLARE #EndTime datetime = DATEADD(SECOND,59,DATEADD(MINUTE,59,DATEADD(HOUR,23, #Date)))
--Populate tmp table with the time interval. from midnight to 23:59:59
;WITH cSequence AS
(
SELECT
#StartTime AS StartRange,
DATEADD(MINUTE, #Interval, #StartTime) AS EndRange
UNION ALL
SELECT
EndRange,
DATEADD(MINUTE, #Interval, EndRange)
FROM cSequence
WHERE DATEADD(MINUTE, #Interval, EndRange) <= #EndTime
)
INSERT INTO #tmp SELECT cast(StartRange as time(0)),cast(EndRange as time(0)) FROM cSequence OPTION (MAXRECURSION 0);
--Insert last record 23:30:00 to 23:59:59
INSERT INTO #tmp (StartRange, EndRange) values ('23:30:00','23:59:59');
SELECT tmp.StartRange as [Interval], COUNT(ts.TimeIn) as [Operators]
FROM #tmp tmp
JOIN Timestamp ts ON
--If timeIn is earlier than StartRange OR within the start/end range
(CAST(ts.TimeIn as time(0)) <= tmp.StartRange OR CAST(ts.TimeIn as time(0)) BETWEEN tmp.StartRange AND tmp.EndRange)
AND
--AND If timeOut is later than EndRange OR within the start/end range
CAST(ts.[TimeOut] as time(0)) >= tmp.EndRange OR CAST(ts.[TimeOut] as time(0)) BETWEEN tmp.StartRange AND tmp.EndRange
GROUP BY tmp.StartRange, tmp.EndRange
END
Really any kind of hint as to how to achieve it in mdx would be greatly appreciated.
Honestly, I wouldn't do it in MDX against that table structure. Even if you succeed in getting an MDX query that returns that value, and surely it can be done, it will most likely be tremendously complex and hard to maintain and debug, and will probably require multiple passes on the fact table to get the numbers, hurting performance.
I think this is a clear cut case for a periodic snapshot table. Pick your granularity, but even at 1 min snapshots you get 1440 points of data per day for each tuple of all other dimensions. If your login/logout table is large you may need to decrease this to keep its size manageable. In the end, you get a table with time_id, count_of_logins, and whatever other keys you need to other dimensions, and the query you need is just a filter on which time periods you want (give me all hours of the day, but filter on only minutes 00 and 30 of each hour) and the count of total number of logged in users is trivial.