compare two tables and delete rows from one table if there are similar coumn values in two tables hive - hadoop

Table description are in the link
Table 1 and Table 2 has rows with A and D .I need these two removed from Table 2 .
Please check the link below for descriptive detail . Thank you .

You may do an INSERT OVERWRITE using a LEFT JOIN select query.
INSERT overwrite TABLE table2
SELECT t2.*
from table2 t2
LEFT JOIN table1 t1
on (t1.x = t2.p) --use appropriate common column name
WHERE t1.x is NULL; --where there's no common element in t2

Related

Oracle 20 million update based on join

I have a need to do the following
UPDATE TABLE2 t2
SET t2.product_id = (select t1.product_id from
table1 t1 where t1.matching_id = t2.matching_id)
Except that TABLE2 has 27 million records. The product_id is a newly added column and hence populating data to it.
I could use a cursor , break down my record set in TABLE2 to a reasonably smaller number, But with 27 million records, I am not sure whats the best way.
Pl suggest, even if it means exporting my data to excel.
Update - THe matching columns are indexed too.
The only thing I could do different is replace the update for a CREATE TABLE AS
CREATE TABLE table2_new AS
SELECT t2.* (less product_id), t1.product_id
FROM table1 t1
JOIN table2 t2
ON t1.matching_id = t2.matching_id
But later you will have to add the CONSTRAINTS manually, delete table2 and replace for table2_new
update (select t1.product_id as old_product_id, t2.product_id as new_product_id
from table1 t1
join table2 t2 on (t1.matching_id = t2.matching_id)) t
set t.new_product_id = t.old_product_id

Delete Duplicate rows in Vertica database

Vertica allows duplicates to be inserted into the tables. I can view those using the 'analyze_constraints' function.
How to delete duplicate rows from Vertica tables?
You should try to avoid/limit using DELETE with a large number of records. The following approach should be more effective:
Step 1 Create a new table with the same structure / projections as the one containing duplicates:
create table mytable_new like mytable including projections ;
Step 2 Insert into this new table de-duplicated rows:
insert /* +direct */ into mytable_new select <column list> from (
select * , row_number() over ( partition by <pk column list> ) as rownum from <table-name>
) a where a.rownum = 1 ;
Step 3 rename the original table (the one containing dups):
alter table mytable rename to mytable_orig ;
Step 4 rename the new table:
alter table mytable_new rename to mytable ;
That's all.
The answer of Mauro is correct, but there is an error in the sql of step 2. So, the complete way of working by avoiding DELETE should then be as follows:
Step 1 Create a new table with the same structure / projections as the one containing duplicates:
create table mytable_new like mytable including projections ;
Step 2 Insert into this new table de-duplicated rows:
insert /* +direct */ into mytable_new select <column list> from (
select * , row_number() over ( partition by <pk column list> ) as rownum from mytable
) a where a.rownum = 1 ;
Step 3 rename the original table (the one containing dups):
alter table mytable rename to mytable_orig ;
Step 4 rename the new table:
alter table mytable_new rename to mytable ;
Off the top of my head, and not a great answer so let's let this be the final word, you can delete both and insert one back in.
You can delete duplicates by Vertica tables by creating a temporary table and generating pseudo row_ids. Here are few steps, especially if you are removing duplicates from very large and wide tables. In the example below, i assume, k1 and k2 rows have more than 1 duplicates. For more info see here.
-- Find the duplicates
select keys, count(1) from large-table-1
where [where-conditions]
group by 1
having count(1) > 1
order by count(1) desc ;
-- Step 2: Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;
alter table test.large-table-1-dups -- add row_num column (pseudo row_id)
add column row_num int;
insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2'); -- where, say, k1 has n and k2 has m exact dups
-- Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;
select * from test.dim_line_items_dups;
-- Sanity test. Should have 1 row each of k1 & k2 rows above
-- Step 4: Delete all duplicates from main table...
delete from large-table-1
where key in ('k1', 'k2');
-- Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;
insert into large-table-1
select * from test.large-table-1-dups;
Step1: Create a intermediate table to port/load the data from original table along with row number.
Here in below sample, porting data from Table1 to Table2 along with row_num column
select * into Table2 from (select *, ROW_NUMBER() OVER(PARTITION BY A,B order by C)as row_num from Table1 ) A;
Step2: Delete data from Table1 using earlier created Table2 in above step
DELETE FROM Table1 WHERE EXISTS (SELECT NULL FROM Table2
where Table2.A=Table1.A
and Table2.B=Table1.B
and row_num > 1);
Step3: Drop table create in first step1 i.e Table2
Drop Table Table2;
You should have a look at this answer from the PostgreSQL wiki which also works for Vertica:
DELETE
FROM
tablename
WHERE
id IN(
SELECT
id
FROM
(
SELECT
id,
ROW_NUMBER() OVER(
partition BY column1,
column2,
column3
ORDER BY
id
) AS rnum
FROM
tablename
) t
WHERE
t.rnum > 1
);
It deletes all duplicate entries but the one with the lowest id.

Stored procedure to copy data from one table to multiple tables

Need help in writing a stored procedure, for copying data from one table to multiple tables.
Here is the scenario example:
Table1 has 10 columns with 20 rows of data.
Table2 (primary key will be sequence generator),Table3
Insert 4 columns values from Table1 into Table2, and 4 columns into Table3 (Check condition below)
Condition when inserting data into table3:
-Whenever data is inserted into Table3 (from table1) there will be one field in table3 which has the primary key value of Table2 and some values from Table 1.
Please help....
try INSERT INTO dest_table (COL1,COL2,...,COLn) SELECT COLa,COLb,...,COLz FROM source_table

Counting rows by a condition in another table

Need to count the number of rows in one table which connect to a second table by (name.sample) where in the second table the (name.sample) was created before (or after) a certain date.
select count(*) from table1 t1
inner join table2 t2 on t1.my_foreign_key_column = t2.my_primary_key_column
where t2.creation_date >= 'my_date_literal'

How to update a table with null values with data from other table at one time?

I have 2 tables - A and B . Table A has two columns, pkey (primary key) and col1. Table B also has two columns, pr_key (primary key but not a foreign key) and column1. Both tables have 4 rows. Table B has no values in column1, while table A has column1 values for all 4 rows. So my data looks like this
Table A
pkey col1
A 10
B 20
C 30
D 40
Table B
pr_key column1
A null
B null
C null
D null
I want to update table B to set the column1 value of each row equal to the column1 value of the equivalent row from table A in a single DML statement.
Should be something like that (depends on SQL implementation you use, but in general, the following is rather standard. In particular should work in MS-SQL and in MySQL.
INSERT INTO tblB (pr_key, column1)
SELECT pkey, col1
FROM tblA
-- WHERE some condition (if you don't want 100% of A to be copied)
The question is a bit unclear as to the nature of tblB's pr_key, if for some reason this was a default/auto-incremented key for that table, it could just then be omitted from both the column list (in parenthesis) and in the SELECT that follows. In this fashion upon insertion of each new row, a new value would be generated.
Edit: It appears the OP actually wants to update table B with values from A.
The syntax for this should then be something like
UPDATE tblB
SET Column1 = A.Col1
FROM tblA AS A
JOIN tblB AS B ON B.pr_key = A.pkey
This may perform better:
MERGE INTO tableB
USING (select pkey, col1 from tableA) a
ON (tableB.pr_key = a.pkey)
WHEN MATCHED THEN UPDATE
SET tableB.column1 = a.col1;
It sounds like you want to do a correlated update. The syntax for that in Oracle is
UPDATE tableB b
SET column1 = (SELECT a.column1
FROM tableA a
WHERE a.pkey = b.pr_key)
WHERE EXISTS( SELECT 1
FROM tableA a
WHERE a.pkey = b.pr_key )
The WHERE EXISTS clause isn't necessary if tableA and tableB each have 4 rows and have the same set of keys in each. It is much safer to include that option, though, to avoid updating column1 values of tableB to NULL if there is no matching row in tableA.

Resources