oracle | delete duplicates records - oracle

I have identified some duplicates in my table:
-- DUPLICATES: ----
select PPLP_NAME,
START_TIME,
END_TIME,
count(*)
from PPLP_LOAD_GENSTAT
group by PPLP_NAME,
START_TIME,
END_TIME
having count(*) > 1
-- DUPLICATES: ----
How is it possible to delete them?

Even if you don't have the primary key, each record has a unique rowid associated.
By using the query below you delete only the records that don't have the maximum row id by self joining a table with the columns that cause duplication. This will make sure that you delete any duplicates.
DELETE FROM PPLP_LOAD_GENSTAT plg_outer
WHERE ROWID NOT IN(
select MAX(ROWID)
from PPLP_LOAD_GENSTAT plg_inner
WHERE plg_outer.pplp_name = plg_inner.pplg_name
AND plg_outer.start_time= plg_inner.start_time
AND plg_outer.end_time = plg_inner.end_time
);

I'd suggest something easier:
CREATE table NewTable as
SELECT DISTINCT pplp_name,start_time,end_time
FROM YourTable
Then delete your table, and rename the new table.
If you really want to delete records, you can find a few examples of how here.

Related

Delete logic is taking a very long time to process in Oracle

I am trying to use the following statement for the Delete process and it has to delete around 23566424 Rows, but oracle takes almost 3 hours to complete the process and we have already created an index on " SCHEDULE_DATE_KEY" but still, the process is very slow.Can someone advise on how to make Deletes faster in oracle
DELETE
FROM
EDWSOURCE.SCHEDULE_DAY_F
WHERE
SCHEDULE_DATE_KEY >
(
SELECT
LAST_PAYROLL_DATE_KEY
FROM
EDWSOURCE.LAST_PAYROLL_DATE
WHERE
CURRENT_FLAG = 'Y'
);
I don't think any index will help here, probably Oracle will decide the best approach is a full table scan to delete 20M rows from 300M. It is deleting at a rate of over 2000 rows per second, which isn't bad. In fact any additional indexes will slow it down as it has to delete the row entry from the index as well.
A quicker approach could be to create a new table of the rows you want to keep, something like:
create table EDWSOURCE.SCHEDULE_DAY_F_KEEP
as
select * from EDWSOURCE.SCHEDULE_DAY_F
where SCHEDULE_DATE_KEY <=
(
SELECT
LAST_PAYROLL_DATE_KEY
FROM
EDWSOURCE.LAST_PAYROLL_DATE
WHERE
CURRENT_FLAG = 'Y'
);
Then recreate any constraints and indexes to use the new table.
Finally drop the old table and rename the new one.
You can try testing a filtered table move. This has an online clause. So you can do this while the application is still running.
Note 12.2 and later the indexes will remain valid. In earlier versions you will need to rebuild the indexes as they will become invalid. Good Luck
Move a Table
Create and populate a new test table.
DROP TABLE t1 PURGE;
CREATE TABLE t1 AS
SELECT level AS id,
'Description for ' || level AS description
FROM dual
CONNECT BY level <= 100;
COMMIT;
Check the contents of the table.
SELECT COUNT(*) AS total_rows,
MIN(id) AS min_id,
MAX(id) AS max_id
FROM t1;
TOTAL_ROWS MIN_ID MAX_ID
---------- ---------- ----------
100 1 100
SQL>
Move the table, filtering out rows with an ID value greater than 50.
ALTER TABLE t1 MOVE ONLINE
INCLUDING ROWS WHERE id <= 50;
Check the contents of the table.
SELECT COUNT(*) AS total_rows,
MIN(id) AS min_id,
MAX(id) AS max_id
FROM t1;
TOTAL_ROWS MIN_ID MAX_ID
---------- ---------- ----------
50 1 50
SQL>
The rows with an ID value between 51 and 100 have been removed.
As mentioned above if maybe best to PARTITION the table abs drop a PARTITION every N number of days as part of a daily task.

Delete data based on the count & timestamp using pl\sql

I'm new to PL\SQL programming and I'm from DBA background. I got one requirement to delete data from both main table and reference table but need to follow below logic while deleting data because we need to delete 30M of data from the tables so we're reducing data based on the "State_ID" column below.
Following conditions need to consider
1. As per sample data given below(Main Table), sort data based on timestamp with desc order and leave the first 2 rows of data for each "State_id" and delete rest of the data from the both tables based on "state_id" column.
2. select state_id,count() from maintable group by state_id order by timestamp desc Having count()>2;
So if state_id=1 has 5 rows then has to delete 3 rows of data by leaving first 2 rows for state_id=1 and repeat for other state_id values.
Also same matching data should be deleted from the reference table as well.
Please someone help me on this issue. Thanks.
enter image description here
Main table
You should be able to do each table delete as a single SQL command. Anything else would essentially force row-by-row processing, which is the last thing you want for that much data. Something like this:
delete from main_table m
where m.row_id not in (
with keep_me as (
select row_id,
row_number() over (partition by state_id
order by time_stamp desc) id_row_number
from main_table where id_row_number<3)
select row_id from keep_me)
or
delete from main_table m
where m.row_id in (
with delete_me as (
select row_id,
row_number() over (partition by state_id
order by time_stamp desc) id_row_number
from main_table where id_row_number>2)
select row_id from delete_me)

Temporary table for deletion

I have to use a PL_SQL anonymous block to delete rows in a lot of tables.
Every table is related to the main table "TABLE1", and I cannot add CASCADE DELETE. I have to do something like
DELETE FROM table2 WHERE foreign_key in (SELECT ID FROM table1 WHERE ...).
DELETE FROM table3 WHERE foreign_key in (SELECT ID FROM table1 WHERE ...).
...
The "SELECT ID.." query may take several minute, does it make sense to put all the ID in a temporary table or something like that? So I can execute the "select" query only once.
There are alternatives?
If
select id from table1 where ....
takes a lot of time, then it depends. The result will probably be cached so the next execution (while deleting from table3) won't last that long.
Won't cost much to test it. Delete a couple of tables the old way then create a "temporary" table using CTAS
create table ids as
select id from table1 where ...
and use it in DELETE statements. You don't even need an index as you have to perform full table scan anyway.
Then choose a better option.
Another way is the use of PL/SQL-Collections:
DECLARE
id_list SYS.ODCINUMBERLIST;
BEGIN
SELECT ID
BULK COLLECT INTO id_list
FROM table1
WHERE ...
FORALL i IN id_list.FIRST..id_list.LAST
DELETE FROM table2
WHERE foreign_key = id_list(i);
FORALL i IN id_list.FIRST..id_list.LAST
DELETE FROM table3
WHERE foreign_key = id_list(i);
...
END;

Hive multiple insert goes wrong with the DISTINCT select statement

I read this code from "Hadoop the Definitive Guide":
SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id
FROM dim_ads a JOIN impression_logs b ON (b.ad_id = a.ad_id)
WHERE b.dateid = '2008-12-01') x
INSERT OVERWRITE DIRECTORY 'results_gby_adid'
SELECT x.ad_id, count(1), count(DISTINCT x.user_id) GROUP BY x.ad_id
INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'
SELECT x.campaign_id, count(1), count(DISTINCT x.user_id) GROUP BY x.campaign_id
INSERT OVERWRITE DIRECTORY 'results_gby_accountid'
SELECT x.account_id, count(1), count(DISTINCT x.user_id) GROUP BY x.account_id;
but as my test, using several DISTINCT cannot get right results.
my hiveql as below:
CREATE TABLE IF NOT EXISTS a (logindate int, id int);
then
load local file to this table...
CREATE TABLE IF NOT EXISTS user (id INT) PARTITIONED BY (logindate INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
then
if inserting table separately:
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) FROM a WHERE logindate=20130120;
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) FROM a WHERE logindate=20130121;
the results are correct;
but if choosing the next multiple insert hql:
FROM a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) WHERE logindate=20130120
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) WHERE logindate=20130121;
the results are not correct, both partitions have the same number of records, seems like select from DISTINCT(id) WHERE logindate=20130120 OR logindate=20130121
so is it a bug or did I write some wrong syntax?
DISTINCT has a bit of an odd history in the code as an alias to group by.
If there is a bug, then the version of hive you are using would be important to know since bugs are addressed in each release.
This might work:
FROM a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130120 GROUP BY id
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT id WHERE logindate=20130121 GROUP BY id;
if that doesn't work, this will definitely work...even though it isn't the approach you were attempting to use...
FROM (select distinct id, logindate from a where logindate in ('20130120','20130121')) subq_a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130120
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130121;

ORA-01732: data manipulation operation not legal on this view

I have this DML statement..
delete from (select key,value,computed, row_number() OVER (Partition By key, value order by seq asc) as a
from excelformats a )
where A > 1
and this throws
ORA-01732: data manipulation operation not legal on this view
This statement basically selects duplicate rows from excelFormats table those to be deleted
How can I revise so that
You could use:
DELETE FROM excelformats
WHERE rowid not in
(SELECT MIN(rowid)
FROM excelformats
GROUP BY key, value, computed);
This will delete duplicate rows in your excelformats table given the three key columns you stated.
Hope it helps...

Resources