Retrieving a large dataset from a very large table - oracle

I have a very large table in oracle that contains 140+ million rows. Currently we are doing three full table scans on this table nightly, and using some of the results to populate a tmp table. That tmp table is then turned into a very large report (usually 140K + lines).
The big table is called tasklog and has the following structure has:
tasklog_id (number) - PK
document_id (number)
date_time_in (date)
+ a few more rows that aren't relevant
There are millions of different document ids each repeated between 1 and several hundred times, date_time_in is the time this entry was put into the database.
All of the full table scans looks like this
DECLARE
n_prevdocid number;
cursor tasks is
select *
from tasklog
order by document_id, date_time_in DESC;
BEGIN
for tk in tasks
loop
if n_prevdocid <> tk.document_id then
-- *code snipped*
end if;
n_prevdocid = tk.document_id;
end loop;
END;
/
So my question: is there a quick (ish) way to get a distinct list of document_ids with the row having the most recent date_time_in. This could dramatically speed up the whole thing. Or can anyone think of a better way of retrieving this data daily?
Things that may be relevant, this table only ever has rows inserted with current date time. It is not range paritioned but I can't see how that might help me. No rows are ever updated or deleted. There are about 70k - 80k rows inserted daily.

I don't think that you're going to get away from doing at least one full table scan, as the only way that it would be efficient would be is if the ratio of distinct document_id's to total records was pretty small. The clustering on the document_id is going to be very poor due to the way that the data is generated and inserted.
How about:
create table tmp nologging compress -- or pctfree 0
as
select ...
from (
select t.*,
max(date_time_in) over (partition by document_id) max_date_time_in
from tasklog t)
where date_time_in = max_date_time_in
Possibly, having created this once, you could then optimise further refreshes by merging into this set only the newer records. Something like ...
merge into tmp
using (
select ...
from (
select t.*,
max(date_time_in) over (partition by document_id) max_date_time_in
from tasklog t
where date_time_in > (select max(date_time_in) from tmp))
where date_time_in = max_date_time_in)
on ... blah blah

Have you tried:
select document_id
from tasklog t1
where date_time_in = (select max(date_time_in)
from tasklog t2
where t1.document_id=t2.document_id)

You can do something like this:
select document_id , date_time from tasklog group by date_time,document_id order by date_time desc;
By this you can retrieve distinct document_id with latest date_time colums.

Related

Delete duplicate rows from a BigQuery table

I have a table with >1M rows of data and 20+ columns.
Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).
If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.
I am not proficient in SQL or any other programming language so please excuse my ignorance.
delete from Accidents.CleanedFilledCombined
where Fixed_Accident_Index
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined
group by Fixed_Accident_Index
having count(Fixed_Accident_Index) >1);
You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).
A query that should work is here:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1
UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:
https://stackoverflow.com/a/57900778/132438
An alternative to Jordan's answer - this one scales better when having too many duplicates:
#standardSQL
SELECT event.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.created_at DESC LIMIT 1
)[OFFSET(0)] event
FROM `githubarchive.month.201706` t
# GROUP BY the id you are de-duplicating by
GROUP BY actor.id
)
Or a shorter version (takes any row, instead of the newest one):
SELECT k.*
FROM (
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM `fh-bigquery.reddit_comments.2017_01` x
GROUP BY id
)
To de-duplicate rows on an existing table:
CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k
FROM `deleting.deduplicating_table` row
GROUP BY id
)
Not sure why nobody mentioned DISTINCT query.
Here is the way to clean duplicate rows:
CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table
If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.
SELECT <list of original fields>
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1
In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.
I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row.
In this case duplicate rows will be eliminated by BigQuery
Of course, this will involve some client side coding - so might be not relevant for this particular question.
I havent tried this approach by myself either but feel it might be interesting to try :o)
If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a
Easier answer, without a subselect
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
WHERE TRUE
QUALIFY row_number = 1
The Where True is neccesary because qualify needs a where, group by or having clause
Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:
CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT
Fixed_Accident_Index,
ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;
To be safe, make sure you backup the original table before you run this ^^
I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.
Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING

Create duplicate rows by running same command 5 times for example
insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')
Check if duplicate entries exist
select * from beginner-290513.917834811114.messages where id = 19999
Use generate uuid function to generate uuid corresponding to each message

UPDATE beginner-290513.917834811114.messages
SET bq_uuid = GENERATE_UUID()
where id>0
Clean duplicate entries
DELETE FROM beginner-290513.917834811114.messages
WHERE bq_uuid IN
(SELECT bq_uuid
FROM
(SELECT bq_uuid,
ROW_NUMBER() OVER( PARTITION BY updated_at
ORDER BY bq_uuid ) AS row_num
FROM beginner-290513.917834811114.messages ) t
WHERE t.row_num > 1 );

HIVE equivalent of FIRST and LAST

I have a table with 3 columns:
table1: ID, CODE, RESULT, RESULT2, RESULT3
I have this SAS code:
data table1
set table1;
BY ID, CODE;
IF FIRST.CODE and RESULT='A' THEN OUTPUT;
ELSE IF LAST.CODE and RESULT NE 'A' THEN OUTPUT;
RUN;
So we are grouping the data by ID and CODE, and then writing to the dataset if certain conditions are met. I want to write a hive query to replicate this. This is what I have:
proc sql;
create table temp as
select *, row_number() over (partition by ID, CODE) as rowNum
from table1;
create table temp2 as
select a.ID, a.CODE, a.RESULT, a.RESULT2, a.RESULT3
from temp a
inner join (select ID, CODE, max(rowNum) as maxRowNum
from temp
group by ID, CODE) b
on a.ID=b.ID and a.CODE=b.CODE
where (a.rowNum=1 and a.RESULT='A') or (a.rowNum=b.maxRowNum and a.RESULT NE 'A');
quit;
There are two issues I see with this.
1) The row that is first or last in each BY group is entirely dependant on the order of rows in table1 in SAS, we aren't ordering by anything. I don't think row order is preserved when translating to a hive query.
2) The SAS code is taking the first row in each BY GROUP or the last, not both. I think that my HIVE query is taking both, resulting in more rows than I want.
Any suggestions or insight on how to improve my query is appreciated. Is it even possible to replicate this SAS code in HIVE?
The SAS code has a by statement (BY ID CODE;), which tells SAS that the set dataset is sorted at those levels. So, not a random selection for first. and last..
That said, we can replicate this in HIVE by using the first_value and last_value window functions.
FIRST.CODE should replicate to
first_value(code) over (partition by Id order by code)fcode
Similarly, LAST.CODE would be
last_value(code) over (partition by Id order by code)lcode
Once you have the fcode and lcode columns, use case when statements for the result column criteria. Like,
case when (code=fcode and result='A') or (code=lcode and result<>'A')
then 1 else 0 end as op_flag
Then the fetch the table with where op_flag = 1
SAMPLE
select id, code, result from (
select *,
first_value(code) over (partition by id order by code)fcode,
last_value(code) over (partition by id order by code)lcode
from footab) f
where (code=fcode and result='A') or (code=lcode and result<>'A')
Regarding point 1) the BY group processing requires the input data to be sorted or indexed on BY variables, so though the code contains no ordering, the source data is processed in order. If the input data was not indexed/sorted, SAS will throw error.
Regarding this, possible differences are on rows with same values of BY variables, especially if the RESULT is different.
In SAS, I would pre-sort data by ID, CODE, RESULT, then use BY ID CODE in order to not be influenced by order of rows.
Regarding 2) FIRST and LAST can be both true in SAS. Since your condition for first and last on RESULT is different, I guess this is not a source of differences.
I guess you could add another field as
row_number() over (partition by ID, CODE desc) as rowNumDesc
to detect last row with rowNumDesc = 1 (so that you skip the join).
EDIT:
I think the two programs above both include random selection of rows for groups with same values of ID and CODE variables, especially with same values of RESULT. But you should get same number of rows from both. If not, just debug it.
However the random aspect in SAS code/storage is based on physical order of rows, while the ROW_NUMBERs randomness within a group will be influenced by the implementation of the function in the engine.

optimizing a dup delete statement Oracle

I have 2 delete statements that are taking a long time to complete. There are several indexes on the columns in where clause.
What is a duplicate?
If 2 or more records have same values in columns id,cid,type,trefid,ordrefid,amount and paydt then there are duplicates.
The DELETEs delete about 1 million record.
Can they be re-written in any way to make it quicker.
DELETE FROM TABLE1 A WHERE loaddt < (
SELECT max(loaddt) FROM TABLE1 B
WHERE
a.id=b.id and
a.cid=b.cid and
NVL(a.type,'-99999') = NVL(b.type,'-99999') and
NVL(a.trefid,'-99999')=NVL(b.trefid,'-99999') and
NVL(a.ordrefid,'-99999')= NVL(b.ordrefid,'-99999') and
NVL(a.amount,'-99999')=NVL(b.amount,'-99999') and
NVL(a.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))=NVL(b.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))
);
COMMIT;
DELETE FROM TABLE1 a where rowid > (
Select min(rowid) from TABLE1 b
WHERE
a.id=b.id and
a.cid=b.cid and
NVL(a.type,'-99999') = NVL(b.type,'-99999') and
NVL(a.trefid,'-99999')=NVL(b.trefid,'-99999') and
NVL(a.ordrefid,'-99999')= NVL(b.ordrefid,'-99999') and
NVL(a.amount,'-99999')=NVL(b.amount,'-99999') and
NVL(a.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))=NVL(b.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))
);
commit;
Explain Plan:
DELETE TABLE1
HASH JOIN 1296491
Access Predicates
AND
A.ID=ITEM_1
A.CID=ITEM_2
ITEM_3=NVL(TYPE,'-99999')
ITEM_4=NVL(TREFID,'-99999')
ITEM_5=NVL(ORDREFID,'-99999')
ITEM_6=NVL(AMOUNT,(-99999))
ITEM_7=NVL(PAYDT,TO_DATE(' 9999-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
Filter Predicates
LOADDT<MAX(LOADDT)
TABLE ACCESS TABLE1 FULL 267904
VIEW VW_SQ_1 690385
SORT GROUP BY 690385
TABLE ACCESS TABLE1 FULL 267904
How large is the table? If count of deleted rows is up to 12% then you may think about index.
Could you somehow partition your table - like week by week and then scan only actual week?
Maybe this could be more effecient. When you're using aggregate function, then oracle must walk through all relevant rows (in your case fullscan), but when you use exists it stops when the first occurence is found. (and of course the query would be much faster, when there was one function-based(because of NVL) index on all columns in where clause)
DELETE FROM TABLE1 A
WHERE exists (
SELECT 1
FROM TABLE1 B
WHERE
A.loaddt != b.loaddt
a.id=b.id and
a.cid=b.cid and
NVL(a.type,'-99999') = NVL(b.type,'-99999') and
NVL(a.trefid,'-99999')=NVL(b.trefid,'-99999') and
NVL(a.ordrefid,'-99999')= NVL(b.ordrefid,'-99999') and
NVL(a.amount,'-99999')=NVL(b.amount,'-99999') and
NVL(a.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))=NVL(b.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))
);
Although some may disagree, I am a proponent of running large, long running deletes procedurally. In my view it is much easier to control and track progress (and your DBA will like you better ;-) Also, not sure why you need to join table1 to itself to identify duplicates (and I'd be curious if you ever run into snapshot too old issues with your current approach). You also shouldn't need multiple delete statements, all duplicates should be handled in one process. Finally, you should check WHY you're constantly re-introducing duplicates each week, and perhaps change the load process (maybe doing a merge/upsert rather than all inserts).
That said, you might try something like:
-- first create mat view to find all duplicates
create materialized view my_dups_mv
tablespace my_tablespace
build immediate
refresh complete on demand
as
select id,cid,type,trefid,ordrefid,amount,paydt, count(1) as cnt
from table1
group by id,cid,type,trefid,ordrefid,amount,paydt
having count(1) > 1;
-- dedup data (or put into procedure and schedule along with mat view refresh above)
declare
-- make sure my_dups_mv is refreshed first
cursor dup_cur is
select * from my_dups_mv;
type duprec_t is record(row_id rowid);
duprec duprec_t;
type duptab_t is table of duprec_t index by pls_integer;
duptab duptab_t;
l_ctr pls_integer := 0;
l_dupcnt pls_integer := 0;
begin
for rec in dup_cur
loop
l_ctr := l_ctr + 1;
-- assuming needed indexes exist
select rowid
bulk collect into duptab
from table1
where id = rec.id
and cid = rec.cid
and type = rec.type
and trefid = rec.trefid
and ordrefid = rec.ordrefid
and amount = rec.amount
and paydt = rec.paydt
-- order by whatever makes sense to make the "keeper" float to top
order by loaddt desc
;
for i in 2 .. duptab.count
loop
l_dupcnt := l_dupcnt + 1;
delete from table1 where rowid = duptab(i).row_id;
end loop;
if (mod(l_ctr, 10000) = 0) then
-- log to log table here (calling autonomous procedure you'll need to implement)
insert_logtable('Table1 deletes', 'Commit reached, deleted ' || l_dupcnt || ' rows');
commit;
end if;
end loop;
commit;
end;
Check your log table for progress status.
1. Parallel
alter session enable parallel dml;
DELETE /*+ PARALLEL */ FROM TABLE1 A WHERE loaddt < (
...
Assuming you have Enterprise Edition, a sane server configuration, and you are on 11g. If you're not on 11g, the parallel syntax is slightly different.
2. Reduce memory requirements
The plan shows a hash join, which is probably a good thing. But without any useful filters, Oracle has to hash the entire table. (Tbone's query, that only use a GROUP BY, looks nicer and may run faster. But it will also probably run into the same problem trying to sort or hash the entire table.)
If the hash can't fit in memory it must be written to disk, which can be very slow. Since you run this query every week, only one of the tables needs to look at all the rows. Depending on exactly when it runs, you can add something like this to the end of the query: ) where b.loaddt >= sysdate - 14. This may significantly reduce the amount of writing to temporary tablespace. And it may also reduce read IO if you use some partitioning strategy like jakub.petr suggested.
3. Active Report
If you want to know exactly what your query is doing, run the Active Report:
select dbms_sqltune.report_sql_monitor(sql_id => 'YOUR_SQL_ID_HERE', type => 'active')
from dual;
(Save the output to an .html file and open it with a browser.)

How to get records randomly from the oracle database?

I need to select rows randomly from an Oracle DB.
Ex: Assume a table with 100 rows, how I can randomly return 20 of those records from the entire 100 rows.
SELECT *
FROM (
SELECT *
FROM table
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum < 21;
SAMPLE() is not guaranteed to give you exactly 20 rows, but might be suitable (and may perform significantly better than a full query + sort-by-random for large tables):
SELECT *
FROM table SAMPLE(20);
Note: the 20 here is an approximate percentage, not the number of rows desired. In this case, since you have 100 rows, to get approximately 20 rows you ask for a 20% sample.
SELECT * FROM table SAMPLE(10) WHERE ROWNUM <= 20;
This is more efficient as it doesn't need to sort the Table.
SELECT column FROM
( SELECT column, dbms_random.value FROM table ORDER BY 2 )
where rownum <= 20;
In summary, two ways were introduced
1) using order by DBMS_RANDOM.VALUE clause
2) using sample([%]) function
The first way has advantage in 'CORRECTNESS' which means you will never fail get result if it actually exists, while in the second way you may get no result even though it has cases satisfying the query condition since information is reduced during sampling.
The second way has advantage in 'EFFICIENT' which mean you will get result faster and give light load to your database.
I was given an warning from DBA that my query using the first way gives loads to the database
You can choose one of two ways according to your interest!
In case of huge tables standard way with sorting by dbms_random.value is not effective because you need to scan whole table and dbms_random.value is pretty slow function and requires context switches. For such cases, there are 3 additional methods:
1: Use sample clause:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
for example:
select *
from s1 sample block(1)
order by dbms_random.value
fetch first 1 rows only
ie get 1% of all blocks, then sort them randomly and return just 1 row.
2: if you have an index/primary key on the column with normal distribution, you can get min and max values, get random value in this range and get first row with a value greater or equal than that randomly generated value.
Example:
--big table with 1 mln rows with primary key on ID with normal distribution:
Create table s1(id primary key,padding) as
select level, rpad('x',100,'x')
from dual
connect by level<=1e6;
select *
from s1
where id>=(select
dbms_random.value(
(select min(id) from s1),
(select max(id) from s1)
)
from dual)
order by id
fetch first 1 rows only;
3: get random table block, generate rowid and get row from the table by this rowid:
select *
from s1
where rowid = (
select
DBMS_ROWID.ROWID_CREATE (
1,
objd,
file#,
block#,
1)
from
(
select/*+ rule */ file#,block#,objd
from v$bh b
where b.objd in (select o.data_object_id from user_objects o where object_name='S1' /* table_name */)
order by dbms_random.value
fetch first 1 rows only
)
);
To randomly select 20 rows I think you'd be better off selecting the lot of them randomly ordered and selecting the first 20 of that set.
Something like:
Select *
from (select *
from table
order by dbms_random.value) -- you can also use DBMS_RANDOM.RANDOM
where rownum < 21;
Best used for small tables to avoid selecting large chunks of data only to discard most of it.
Here's how to pick a random sample out of each group:
SELECT GROUPING_COLUMN,
MIN (COLUMN_NAME) KEEP (DENSE_RANK FIRST ORDER BY DBMS_RANDOM.VALUE)
AS RANDOM_SAMPLE
FROM TABLE_NAME
GROUP BY GROUPING_COLUMN
ORDER BY GROUPING_COLUMN;
I'm not sure how efficient it is, but if you have a lot of categories and sub-categories, this seems to do the job nicely.
-- Q. How to find Random 50% records from table ?
when we want percent wise randomly data
SELECT *
FROM (
SELECT *
FROM table_name
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum <= (select count(*) from table_name) * 50/100;

Improving the performance of keeping the three recent records of each account query

I've a table in an Oracle (10g XE) database, and I'm going to clean it up and only keep the three recent records of each account. Here is what I'm doing right now:
CREATE TABLE ACCOUNT_TRANSACTION_TMP NOLOGGING AS SELECT * FROM ACCOUNT_TRANSACTION WHERE 1=2;
DECLARE
CURSOR mbsacc_cur (account_id_var account_transaction.account_id%TYPE) IS
SELECT * FROM account_transaction WHERE account_id = account_id_var ORDER BY transaction_time DESC;
account_transaction_rec account_transaction%ROWTYPE;
BEGIN
FOR i IN (SELECT DISTINCT(account_id) FROM account_transaction) LOOP
OPEN mbsacc_cur(i.account_id);
LOOP
FETCH mbsacc_cur INTO account_transaction_rec;
EXIT WHEN mbsacc_cur%NOTFOUND OR mbsacc_cur%ROWCOUNT > 3;
INSERT /*+ append */ INTO account_transaction_tmp VALUES account_transaction_rec;
END LOOP;
CLOSE mbsacc_cur;
END LOOP;
END;
/
And then I'll drop the old table, rename this new one to old one and add constraints.
But the problem is the above code runs forever (~3-4 hours) for about 1 million record which approximately half of them should be removed.
Is there any way to improve the performance of this?
Instead of creating an empty table and populating in an RBAR fashion create a table with the rows you want....
CREATE TABLE ACCOUNT_TRANSACTION_TMP NOLOGGING AS
SELECT account_id, col1, col2, col3, transaction_time from
( select at.*
, row_number()
over (partition by at.account_id
order by at.transaction_time desc) as to_keep
FROM ACCOUNT_TRANSACTION at)
where to_keep <= 3
/
Then skip straight to the renaming part of your plan.
You can do that with analytics (although I am not at all well versed in it myself). Take a look at this question, which seems to address a situation similar to yours:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1212501913138

Resources