Delete Duplicate rows in Vertica database - vertica

Vertica allows duplicates to be inserted into the tables. I can view those using the 'analyze_constraints' function.
How to delete duplicate rows from Vertica tables?

You should try to avoid/limit using DELETE with a large number of records. The following approach should be more effective:
Step 1 Create a new table with the same structure / projections as the one containing duplicates:
create table mytable_new like mytable including projections ;
Step 2 Insert into this new table de-duplicated rows:
insert /* +direct */ into mytable_new select <column list> from (
select * , row_number() over ( partition by <pk column list> ) as rownum from <table-name>
) a where a.rownum = 1 ;
Step 3 rename the original table (the one containing dups):
alter table mytable rename to mytable_orig ;
Step 4 rename the new table:
alter table mytable_new rename to mytable ;
That's all.

The answer of Mauro is correct, but there is an error in the sql of step 2. So, the complete way of working by avoiding DELETE should then be as follows:
Step 1 Create a new table with the same structure / projections as the one containing duplicates:
create table mytable_new like mytable including projections ;
Step 2 Insert into this new table de-duplicated rows:
insert /* +direct */ into mytable_new select <column list> from (
select * , row_number() over ( partition by <pk column list> ) as rownum from mytable
) a where a.rownum = 1 ;
Step 3 rename the original table (the one containing dups):
alter table mytable rename to mytable_orig ;
Step 4 rename the new table:
alter table mytable_new rename to mytable ;

Off the top of my head, and not a great answer so let's let this be the final word, you can delete both and insert one back in.

You can delete duplicates by Vertica tables by creating a temporary table and generating pseudo row_ids. Here are few steps, especially if you are removing duplicates from very large and wide tables. In the example below, i assume, k1 and k2 rows have more than 1 duplicates. For more info see here.
-- Find the duplicates
select keys, count(1) from large-table-1
where [where-conditions]
group by 1
having count(1) > 1
order by count(1) desc ;
-- Step 2: Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;
alter table test.large-table-1-dups -- add row_num column (pseudo row_id)
add column row_num int;
insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2'); -- where, say, k1 has n and k2 has m exact dups
-- Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;
select * from test.dim_line_items_dups;
-- Sanity test. Should have 1 row each of k1 & k2 rows above
-- Step 4: Delete all duplicates from main table...
delete from large-table-1
where key in ('k1', 'k2');
-- Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;
insert into large-table-1
select * from test.large-table-1-dups;

Step1: Create a intermediate table to port/load the data from original table along with row number.
Here in below sample, porting data from Table1 to Table2 along with row_num column
select * into Table2 from (select *, ROW_NUMBER() OVER(PARTITION BY A,B order by C)as row_num from Table1 ) A;
Step2: Delete data from Table1 using earlier created Table2 in above step
DELETE FROM Table1 WHERE EXISTS (SELECT NULL FROM Table2
where Table2.A=Table1.A
and Table2.B=Table1.B
and row_num > 1);
Step3: Drop table create in first step1 i.e Table2
Drop Table Table2;

You should have a look at this answer from the PostgreSQL wiki which also works for Vertica:
DELETE
FROM
tablename
WHERE
id IN(
SELECT
id
FROM
(
SELECT
id,
ROW_NUMBER() OVER(
partition BY column1,
column2,
column3
ORDER BY
id
) AS rnum
FROM
tablename
) t
WHERE
t.rnum > 1
);
It deletes all duplicate entries but the one with the lowest id.

Related

Can we use an insert overwrite after using insert all

In Snowflake I am trying to insert updated records to a table. Then I want to identify the records that were just inserted as the most recent records save that as the final table output in a new column called ACTIVE which will either be true or flase. I am having an issue incorporating some sort of updated table segment to my current query. I need everything be contained in the same query rather than break it up into separate parts.
I have my table as follows
CREATE TABLE IF NOT EXISTS MY_TABLE
(
LINK_ID BINARY NOT NULL,
LOAD TIMESTAMP NOT NULL,
SOURCE STRING NOT NULL,
SOURCE_DATE TIMESTAMP NOT NULL,
ORDER BIGINT NOT NULL,
ID BINARY NOT NULL,
ATTRIBUTE_ID BINARY NOT NULL
);
I have records being inserted in this way:
INSERT ALL
WHEN HAS_DATA AND ID_SEQ_NUM > 1 AND (SELECT COUNT(1) FROM MY_TABLE WHERE ID = KEY) = 0 THEN
INTO MY_TABLE VALUES (
LINK_KEY,
TIME,
DATASET_NAME,
DATASET_DATE,
ORDER_NUMBER,
O_KEY,
OA_KEY
)
SELECT *
FROM TEST_TABLE;
I would like my final table from this to be the output as
SELECT *, ORDER != MAX(ORDER) OVER (PARTITION BY ID) AS ACTIVE
FROM MY_TABLE;
This is so I can identify the most recent record per ID group as ACTIVE/TRUE and the previous records within that ID group as INACTIVE/FALSE
I tried to use an insert overwrite method like this
INSERT ALL
WHEN HAS_DATA AND ID_SEQ_NUM > 1 AND (SELECT COUNT(1) FROM MY_TABLE WHERE ID = KEY) = 0 THEN
INTO MY_TABLE VALUES (
LINK_KEY,
TIME,
DATASET_NAME,
DATASET_DATE,
ORDER_NUMBER,
O_KEY,
OA_KEY
)
INSERT OVERWRITE INTO MY_TABLE
SELECT *, RSRC_OFFSET != MAX(RSRC_OFFSET) OVER (PARTITION BY ID) AS ACTIVE
FROM L_OPTION_OPTION_ALLOCATION_TEST
SELECT *
FROM MY_TABLE;
However, it seems the insert overwrite doesn't work in this way (also I am not sure if I can just add a new column to the table like this?). Is there a way I can incorporate it into this query or a different way to update the table with this new ACTIVE column within this query itself?
Also I am using INSERT ALL here because I actually have multiple different tables I am inserting into at once, but this is the current table that I am trying to modify.
You can use the overwrite option with conditional multi-table inserts.
Starting with your current statement:
INSERT ALL
WHEN HAS_DATA AND ID_SEQ_NUM > 1 AND (SELECT COUNT(1) FROM MY_TABLE WHERE ID = KEY) = 0 THEN
INTO MY_TABLE VALUES (
LINK_KEY,
TIME,
DATASET_NAME,
DATASET_DATE,
ORDER_NUMBER,
O_KEY,
OA_KEY
)
SELECT *
FROM TEST_TABLE;
Add the overwrite option immediately after the insert command:
INSERT OVERWRITE ALL
WHEN HAS_DATA AND ID_SEQ_NUM > 1 AND (SELECT COUNT(1) FROM MY_TABLE WHERE ID = KEY) = 0 THEN
INTO MY_TABLE VALUES (
LINK_KEY,
TIME,
DATASET_NAME,
DATASET_DATE,
ORDER_NUMBER,
O_KEY,
OA_KEY
)
SELECT *
FROM TEST_TABLE;
Note that this will truncate and insert ALL tables in the multi-table insert. There is not a way to be selective about which tables get truncated and inserted and which don't.
https://docs.snowflake.com/en/sql-reference/sql/insert-multi-table.html#optional-parameters

Delete data returned from subquery in oracle

I have two tables. if the data in table1 is more than a predefined limit (say 2), i need to copy the remaining contents of table1 to table2 and delete those same contents from table1.
I used the below query to insert the excess data from table1 to table2.
insert into table2
SELECT * FROM table1 WHERE ROWNUM < ((select count(*) from table1)-2);
Now i need the delete query to delete the above contents from table1.
Thanks in advance.
A straightforward approach would be an interim storage in a temporary table. Its content can be used to determine the data to be deleted from table1 as well as the source to feed table 2.
Assume (slightly abusing notation) to be the PK column (or that of any candidate key) of table1 - usually there'll be some key that comprises only 1 column.
create global temporary table t_interim as
( SELECT <pk> pkc FROM table1 WHERE ROWNUM < ((select count(*) from table1)-2 )
;
insert into table2
select * from table1 where <pk> IN (
select pkc from t_interim
);
delete from table1 where <pk> IN (
select pkc from t_interim
);
Alternative
If any key of table1 spans more than 1 column, use an EXISTS clause instead as follows ( denoting the i-th component of a candidate key in table1):
create global temporary table t_interim as
( SELECT <ck_1> ck1, <ck_2> ck2, ..., <ck_n> ckn FROM table1 WHERE ROWNUM < ((select count(*) from table1)-2 )
;
insert into table2
select * from table1 t
where exists (
select 1
from t_interim t_i
where t.ck_1 = t_i.ck1
and t.ck_2 = t_i.ck2
...
and t.ck_n = t_i.ckn
)
;
delete from table1 t where
where exists (
select 1
from t_interim t_i
where t.ck_1 = t_i.ck1
and t.ck_2 = t_i.ck2
...
and t.ck_n = t_i.ckn
)
;
(Technically you could try to adjust the first scheme by synthesizing a key from the components of any CK, eg. by concatenating. You run the risk of introducing ambiguities ( (a bc, ab c) -> (abc, abc) ) or run into implementation limits ( max. varchar length ) using the first method)
Note
In case the table doesn't have a PK, you can apply the technique using any candidate key of table1. There will always be one, in the extreme case it's the set of all columns.
This situation may be the right time to improve the db design and add a (synthetic) pk column to table1 ( and any other tables in the system that lack it).

how do I split a numeric column into two separate columns in Oracle SQL

I am trying to join two tables together but the key in one table is formated like 'xxxxxx' and the key I am trying to join together in the second table is formated like '2222xxxxxx'. Is there a way I can either split the second column into two different columns to make the join, or join on only the last 6 numbers of column 2?
Notes: values are numeric. the '2222' is always the same 4 numbers.
Try this way:
select *
from table1 a
INNER JOIN table2 b
on a.fieldname = substr(b.fieldname, -6)
Here is the documentation for the SUBSTR Function in Oracle.
Know that this is a very bad design for your second table you should consider in separating this values in different columns.
EDIT
This comment from #Boneist "if there were 7 numbers after the first 4, not just 6?"
To fix this you should use: substr(b.fieldname, 5)
To separate that column on a select statement either you create a new column and update it values with the given substr or just add it in the select command.
On the select command:
select fielda, fieldb,
substr(b.fieldname, 1, 4) firstPartOfField,
substr(b.fieldname, 5) secondPartOfField
from tableb
To create another column it would be
alter table tableb add column newField number(6);
update tableb
set oldField = substr(b.fieldname,1,4),
newField = substr(b.fieldname,5);
I would suggest a slightly different approach:
If your column has varchar/char data type:
drop table tab1;
drop table tab2;
create table tab1(col char(30));
create table tab2(col char(30));
insert into tab1
select lpad(to_char(level), 6, '0') as val
from dual connect by level <=100 order by level;
insert into tab2
select '2222' || lpad(to_char(level), 6, '0') as val
from dual connect by level <=100 order by level;
create unique index ux_tab1 on tab1(col);
create unique index ux_tab2 on tab2(col);
select tab1.col, tab2.col
from tab1
join tab2
on '2222' || tab1.col = tab2.col;
-- check execution plan
explain plan for
select tab1.col, tab2.col
from tab1
join tab2
on '2222' || tab1.col = tab2.col;
select * from table(dbms_xplan.display);
If your column has number data type:
drop table tab1;
drop table tab2;
create table tab1(col number);
create table tab2(col number);
insert into tab1
select 100000 + level as val
from dual connect by level <=100 order by level;
insert into tab2
select 2222100000 + level as val
from dual connect by level <=100 order by level;
create unique index ux_tab1 on tab1(col);
create unique index ux_tab2 on tab2(col);
select tab1.col, tab2.col
from tab1
join tab2
on 2222000000 + tab1.col = tab2.col;
-- check execution plan
explain plan for
select tab1.col, tab2.col
from tab1
join tab2
on 2222000000 + tab1.col = tab2.col;
select * from table(dbms_xplan.display);
Using this approach Oracle will be able to use indexes for joining !

Convert String List to Number in Oracle DB

I am saving table ids as foreign key into another table using Oracle Apex Shuttle field like(3:4:5). Now I want to use these IDS in sql query using IN Clause. I have replaced : with , using replace function but it shows
no data found
message.
The following query works fine when I use static values.
select * from table where day_id IN(3,4,5)
But when I try to use
select * from table where id IN(Select id from table2)
it shows no data found.
From what i understand you have a list like 1:2:3:4 that you want to use in a IN clause; you can transform the list into separated values like this:
select regexp_substr('1:2:3:4','[^:]+', 1, level) as list from dual
connect by regexp_substr('1:2:3:4', '[^:]+', 1, level) is not null;
This will return:
List
1
2
3
4
Then you can simply add it to your query like this:
SELECT *
FROM TABLE
WHERE day_id IN
(SELECT regexp_substr('1:2:3:4','[^:]+', 1, level) AS list
FROM dual
CONNECT BY regexp_substr('1:2:3:4', '[^:]+', 1, level) IS NOT NULL
);
Can you try below statement. It has working as you expected.
create table table1 (id number, name varchar2(20));
alter table table1 add constraints pri_cons primary key(id);
create table table2 (id number, name varchar2(20));
alter table table2 add constraints ref_cons FOREIGN KEY(id) REFERENCES table1 (id);
begin
insert into table1 values (1,'Bala');
insert into table1 values (2,'Sathish');
insert into table1 values (3,'Subbu');
insert into table2 values (1,'Nalini');
insert into table2 values (2,'Sangeetha');
insert into table2 values (3,'Rubini');
end;
/
select * from table1 where id IN (Select id from table2);

Oracle Equivalent to MySQL INSERT IGNORE?

I need to update a query so that it checks that a duplicate entry does not exist before insertion. In MySQL I can just use INSERT IGNORE so that if a duplicate record is found it just skips the insert, but I can't seem to find an equivalent option for Oracle. Any suggestions?
If you're on 11g you can use the hint IGNORE_ROW_ON_DUPKEY_INDEX:
SQL> create table my_table(a number, constraint my_table_pk primary key (a));
Table created.
SQL> insert /*+ ignore_row_on_dupkey_index(my_table, my_table_pk) */
2 into my_table
3 select 1 from dual
4 union all
5 select 1 from dual;
1 row created.
Check out the MERGE statement. This should do what you want - it's the WHEN NOT MATCHED clause that will do this.
Do to Oracle's lack of support for a true VALUES() clause the syntax for a single record with fixed values is pretty clumsy though:
MERGE INTO your_table yt
USING (
SELECT 42 as the_pk_value,
'some_value' as some_column
FROM dual
) t on (yt.pk = t.the_pke_value)
WHEN NOT MATCHED THEN
INSERT (pk, the_column)
VALUES (t.the_pk_value, t.some_column);
A different approach (if you are e.g. doing bulk loading from a different table) is to use the "Error logging" facility of Oracle. The statement would look like this:
INSERT INTO your_table (col1, col2, col3)
SELECT c1, c2, c3
FROM staging_table
LOG ERRORS INTO errlog ('some comment') REJECT LIMIT UNLIMITED;
Afterwards all rows that would have thrown an error are available in the table errlog. You need to create that errlog table (or whatever name you choose) manually before running the insert using DBMS_ERRLOG.CREATE_ERROR_LOG.
See the manual for details
I don't think there is but to save time you can attempt the insert and ignore the inevitable error:
begin
insert into table_a( col1, col2, col3 )
values ( 1, 2, 3 );
exception when dup_val_on_index then
null;
end;
/
This will only ignore exceptions raised specifically by duplicate primary key or unique key constraints; everything else will be raised as normal.
If you don't want to do this then you have to select from the table first, which isn't really that efficient.
Another variant
Insert into my_table (student_id, group_id)
select distinct p.studentid, g.groupid
from person p, group g
where NOT EXISTS (select 1
from my_table a
where a.student_id = p.studentid
and a.group_id = g.groupid)
or you could do
Insert into my_table (student_id, group_id)
select distinct p.studentid, g.groupid
from person p, group g
MINUS
select student_id, group_id
from my_table
A simple solution
insert into t1
select from t2
where not exists
(select 1 from t1 where t1.id= t2.id)
This one isn't mine, but came in really handy when using sqlloader:
create a view that points to your table:
CREATE OR REPLACE VIEW test_view
AS SELECT * FROM test_tab
create the trigger:
CREATE OR REPLACE TRIGGER test_trig
INSTEAD OF INSERT ON test_view
FOR EACH ROW
BEGIN
INSERT INTO test_tab VALUES
(:NEW.id, :NEW.name);
EXCEPTION
WHEN DUP_VAL_ON_INDEX THEN NULL;
END test_trig;
and in the ctl file, insert into the view instead:
OPTIONS(ERRORS=0)
LOAD DATA
INFILE 'file_with_duplicates.csv'
INTO TABLE test_view
FIELDS TERMINATED BY ','
(id, field1)
How about simply adding an index with whatever fields you need to check for dupes on and say it must be unique? Saves a read check.
yet another "where not exists"-variant using dual...
insert into t1(id, unique_name)
select t1_seq.nextval, 'Franz-Xaver' from dual
where not exists (select 1 from t1 where unique_name = 'Franz-Xaver');

Resources