Upserting in GreenPlum - greenplum

How can I upsert a record in GreenPlum while copying the data from a CSV file. The CSV file has multiple records for a given value of the primary key. If a row with some value already exists in the database I want to update that record. Otherwise, it should append a new row.

One way to do this is to copy the data to a staging table, then insert/update from that table.
Here is an example of that:
-- Duplicate the definition of your table.
CREATE TEMP TABLE my_table_stage (LIKE my_table INCLUDING DEFAULTS);
-- Your COPY statment
COPY my_table FROM 'my_file.csv' ...
-- Insert any "new" records
INSERT INTO my_table (key_field, data_field1, data_field2)
SELECT
stg.key_field,
stg.data_field1,
stg.data_field2
FROM
my_table_stage stg
WHERE
NOT EXISTS (SELECT 1 FROM my_table WHERE key_field = stg.key_field);
-- Update any existing records
UPDATE my_table orig
SET
data_field1 = stg.data_field1,
data_field2 = stg.data_field2
FROM
my_table_stage stg
WHERE
orig.key_field = stg.keyfield;

Related

How to copy all constrains and data form one schema to another in oracle

I am using Toad for oracle 12c. I need to copy a table and data (40M) from one shcema to another (prod to test). However there is an unique key(not the PK for this table) called record_Id col which has something data like this 3.000*******19E15. About 2M rows has same numbers(I believe its because very large number) which are unique in prod. When I try to copy it violets the unique key of that col. I am using toad "export data to another schema" function to copy the data.
when I execute query in prod
select count(*) from table_name
OR
select count(distinct(record_id) from table_name
Both query gives the exact same numbers of data.
I don't have DBA permission. How do I copy all data without violating unique key of the table.
Thanks in advance!
You can use UPSERT for decisional INSERT or UPDATE or you may write small procedure for this.
you may consider to use NOT EXISTS, but your data is big and it might not be resource efficient.
insert into prod_tab
select * from other_tab t1 where NOT exists (
select 1 from prod_tab t2 where t1.id = t2.id
);
In Oracle you can use a MERGE query for that.
The following query proceeds as follows for each data row :
if the source record_id does not yet exist in the target table, a new record is inserted
else, the existing record is updated with source values
For the sake of the example, I assumed that there are two other columns in the table : column1 and column2.
MERGE INTO target_table t1
USING (SELECT * from source_table t2)
ON (t1.record_id = t2.record_id)
WHEN MATCHED THEN UPDATE SET
t1.column1 = t2.column1,
t1.column2 = t2.column2
WHEN NOT MATCHED THEN INSERT
(record_id, column1, column2) VALUES (t2.record_id, t2.column1, t2.column2)

Hive insert overwrites truncates the table in few cases

I was working on one solution and found that in some particular cases, hive insert overwrite truncates the table however in few cases it doesn't. Would someone please explain me what it's behaving like that?
to explain this, I am table two tables, source and target and trying to insert data into master from source table using insert overwrite
When Source Table has partition
if source table has partition and if you write a condition such that partition does not exist then it won't truncate the master table.
create table source (name String) partitioned by (age int);
insert into source partition (age) values("gaurang", 11);
create table target (name String, age int);
insert into target partition (age) values("xxx", 99);
following query won't truncate the table even if select doesn't return anything.
insert overwrite table temp.test12 select * from temp.test11 where name="Ddddd" and age=99;
However, following query will truncate the table.
insert overwrite table temp.target select * from temp.test11 where name="Ddddd" and age=11;
it makes sense in the first case, as the partition(age=99) does not exist hence it should stop the execution of the query further. However this is my assumption, not sure what exactly happens.
When Source Table Doesn't have partition, but Target has
in this case target table won't be truncated even if select statement from source table returns 0 rows.
use temp;
drop table if exists source1;
drop table if exists target1;
create table source1 (name String, age int);
create table target1 (name String) partitioned by (age int);
insert into source1 values ("gaurang", 11);
insert into target1 partition(age) values("xxx", 99);
select * from source1;
select * from target1;
Following query won't truncate the table even if no data found in select statement.
insert overwrite table temp.target1 partition(age) select * from temp.source1 where age=90;
When Source or Target don't have partition
In this case if I try to insert overwrite target and select statement doesn't return any row then target table will be truncated.
check the example below.
use temp;
drop table if exists source1;
drop table if exists target1;
create table source1 (name String, age int);
create table target1 (name String, age int);
insert into source1 values ("gaurang", 11);
insert into target1 values("xxx", 99);
select * from source1;
select * from target1;
Following Query will truncate the target table.
insert overwrite table temp.target1 select * from temp.source1 where age=90;
Better use term 'overwrite' instead of truncate, because it is what exactly happening during insert overwrite.
When you write overwrite table temp.target1 partition(age) you instructs Hive to overwrite partitions, not all the target1 table, only those partitions which will be returned by select.
Empty dataset will not overwrite partitions in dynamic partition mode. because the partition to overwrite is unknown, partition should be taken from dataset, and the dataset is empty, nothing to overwrite then.
And in case of not partitioned table, it is already known that it should overwrite all the table, does not matter, empty dataset or not.
Partition column in insert overwrite statement should be the last. And the list of partitions to be overwritten in target = list of values in partition column, returned by dataset, does not matter how the source table is partitioned (you can select target partition column from any source table column, calculate it or use a constant), only what was returned does matter.

How to create a procedure which checks if there are any recently added records to the table and if there are then move them to archive table

I have to create a procedure which searches any recently added records and if there are then move them to ARCHIVE table.
This is my statement which filters recently added records
SELECT
CL_ID,
CL_NAME,
CL_SURNAME,
CL_PHONE,
VEH_ID,
VEH_REG_NO,
VEH_MODEL,
VEH_MAKE_YEAR,
WD_ID,
WORK_DESC,
INV_ID,
INV_SERIES,
INV_NUM,
INV_DATE,
INV_PRICE
FROM
CLIENT,
INVOICE,
VEHICLE,
WORKS,
WORKS_DONE
WHERE
Client.CL_ID=Invoice.INV_CL_ID and
Invoice.INV_CL_ID = Client.CL_ID and
Client.CL_ID = Vehicle.VEH_CL_ID and
Vehicle.VEH_ID = Works_Done.WD_VEH_ID and
Works_done.WD_INV_ID = Invoice.INV_ID and
WORKS_DONE.WD_WORK_ID = Works.WORK_ID and
Works_done. Timestamp >= sysdate -1;
You may need something like this (pseudo-code):
create or replace procedure moveRecords is
vLimitDate timestamp := systimestamp -1;
begin
insert into table2
select *
from table1
where your_date >= vLimitDate;
--
delete table1
where your_date >= vLimitDate;
end;
Here are the steps I've used for this sort of task in the past.
Create a global temporary table (GTT) to hold a set of ROWIDs
Perform a multitable direct path insert, which selects the rows to be archived from the source table and inserts their ROWIDs into the GTT and the rest of the data into the archive table.
Perform a delete from the source table, where the source table ROWID is in the GTT of rowids
Issue a commit.
The business with the GTT and the ROWIDs ensures that you have 100% guaranteed stability in the set of rows that you are selecting and then deleting from the source table, regardless of any changes that might occur between the start of your select and the start of your delete (other than someone causing a partitioned table row migration or shrinking the table).
You could alternatively achieve that through changing the transaction isolation level.
O.K. may be something like this...
The downside is - it can be slow for large tables.
The upside is that there is no dependence on date and time - so you can run it anytime and synchronize your archives with live data...
create or replace procedure archive is
begin
insert into archive_table
(
select * from main_table
minus
select * from archive_table
);
end;

How to insert init-data into a table in hive?

I wanted to insert some initial data into the table in hive, so I created below HQL,
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value;
but it does not work.
There is another query like the above,
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value FROM table limit 1;
But it also didn't work, as I see that the tables are empty.
How can I set the initial data into the table?
(There is the reason why I have to do self-join)
About first HQL it should have from clause, its missing so HQL failure,
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value;
Regarding second HQL, from table should have atleast one row, so it can set the constant init values into your newly created table.
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum', '0' FROM table limit 1;
you can use any old hive table having data into it, and give a hit.
The following query works fine if we have already test table created in hive.
INSERT OVERWRITE TABLE test PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value FROM test;
I think the table which we perform insert should be created first.

How to duplicate all data in a table except for a single column that should be changed

I have a question regarding a unified insert query against tables with different data
structures (Oracle). Let me elaborate with an example:
tb_customers (
id NUMBER(3), name VARCHAR2(40), archive_id NUMBER(3)
)
tb_suppliers (
id NUMBER(3), name VARCHAR2(40), contact VARCHAR2(40), xxx, xxx,
archive_id NUMBER(3)
)
The only column that is present in all tables is [archive_id]. The plan is to create a new archive of the dataset by copying (duplicating) all records to a different database partition and incrementing the archive_id for those records accordingly. [archive_id] is always part of the primary key.
My problem is with select statements to do the actual duplication of the data. Because the columns are variable, I am struggling to come up with a unified select statement that will copy the data and update the archive_id.
One solution (that works), is to iterate over all the tables in a stored procedure and do a:
CREATE TABLE temp as (SELECT * from ORIGINAL_TABLE);
UPDATE temp SET archive_id=something;
INSERT INTO ORIGINAL_TABLE (select * from temp);
DROP TABLE temp;
I do not like this solution very much as the DDL commands muck up all restore points.
Does anyone else have any solution?
How about creating a global temporary table for each base table?
create global temporary table tb_customers$ as select * from tb_customers;
create global temporary table tb_suppliers$ as select * from tb_suppliers;
You don't need to create and drop these each time, just leave them as-is.
You're archive process is then a single transaction...
insert into tb_customers$ as select * from tb_customers;
update tb_customers$ set archive_id = :v_new_archive_id;
insert into tb_customers select * from tb_customers$;
insert into tb_suppliers$ as select * from tb_suppliers;
update tb_suppliers$ set archive_id = :v_new_archive_id;
insert into tb_suppliers select * from tb_suppliers$;
commit; -- this will clear the global temporary tables
Hope this helps.
I would suggest not having a single sql statement for all tables and just use and insert.
insert into tb_customers_2
select id, name, 'new_archive_id' from tb_customers;
insert into tb_suppliers_2
select id, name, contact, xxx, xxx, 'new_archive_id' from tb_suppliers;
Or if you really need a single sql statement for all of them at least precreate all the temp tables (as temp tables) and leave them in place for next time. Then just use dynamic sql to refer to the temp table.
insert into ORIGINAL_TABLE_TEMP (SELECT * from ORIGINAL_TABLE);
UPDATE ORIGINAL_TABLE_TEMP SET archive_id=something;
INSERT INTO NEW_TABLE (select * from ORIGINAL_TABLE_TEMP);

Resources