check if row exists before inserting in athena table - insert

Expected result
Result i am getting when same row is added
I want to check if the same row exists before inserting into athena to avoid any duplicate rows in table abc.
Is it possible to use "check if exists" in athena table. The below code is being used to add data in table abc.
insert into table abc
with tmp as(
select
date,
price1,
price2
from
table2
)
select
*
from
tmp
where
price1 > 100
;

You can definitely check if rows already exist - you could work with where not exists:
NOT EXISTS is satisfied if no rows are returned by the subquery
with tmp as(
select
date,
price1,
price2
from
table2
)
select
*
from
tmp
where price1 > 100
and not exists (select 1 from abc where abc.date = tmp.date)
However, keep in mind that Athena can't update data that is once written.

Related

Compare differences before insert into oracle table

Could you please tell me how to compare differences between table and my select query and insert those results in separate table? My plan is to create one base table (name RESULT) by using select statement and populate it with current result set. Then next day I would like to create procedure which will going to compare same select with RESULT table, and insert differences into another table called DIFFERENCES.
Any ideas?
Thanks!
You can create the RESULT_TABLE using CTAS as follows:
CREATE TABLE RESULT_TABLE
AS SELECT ... -- YOUR QUERY
Then you can use the following procedure which calculates the difference between your query and data from RESULT_TABLE:
CREATE OR REPLACE PROCEDURE FIND_DIFF
AS
BEGIN
INSERT INTO DIFFERENCES
--data present in the query but not in RESULT_TABLE
(SELECT ... -- YOUR QUERY
MINUS
SELECT * FROM RESULT_TABLE)
UNION
--data present in the RESULT_TABLE but not in the query
(SELECT * FROM RESULT_TABLE
MINUS
SELECT ... );-- YOUR QUERY
END;
/
I have used the UNION and the difference between both of them in a different order using MINUS to insert the deleted data also in the DIFFERENCES table. If this is not the requirement then remove the query after/before the UNION according to your requirement.
-- Create a table with results from the query, and ID as primary key
create table result_t as
select id, col_1, col_2, col_3
from <some-query>;
-- Create a table with new rows, deleted rows or updated rows
create table differences_t as
select id
-- Old values
,b.col_1 as old_col_1
,b.col_2 as old_col_2
,b.col_3 as old_col_3
-- New values
,a.col_1 as new_col_1
,a.col_2 as new_col_2
,a.col_3 as new_col_3
-- Execute the query once again
from <some-query> a
-- Outer join to detect also detect new/deleted rows
full join result_t b using(id)
-- Null aware comparison
where decode(a.col_1, b.col_1, 1, 0) = 0
or decode(a.col_2, b.col_2, 1, 0) = 0
or decode(a.col_3, b.col_3, 1, 0) = 0;

oracle | delete duplicates records

I have identified some duplicates in my table:
-- DUPLICATES: ----
select PPLP_NAME,
START_TIME,
END_TIME,
count(*)
from PPLP_LOAD_GENSTAT
group by PPLP_NAME,
START_TIME,
END_TIME
having count(*) > 1
-- DUPLICATES: ----
How is it possible to delete them?
Even if you don't have the primary key, each record has a unique rowid associated.
By using the query below you delete only the records that don't have the maximum row id by self joining a table with the columns that cause duplication. This will make sure that you delete any duplicates.
DELETE FROM PPLP_LOAD_GENSTAT plg_outer
WHERE ROWID NOT IN(
select MAX(ROWID)
from PPLP_LOAD_GENSTAT plg_inner
WHERE plg_outer.pplp_name = plg_inner.pplg_name
AND plg_outer.start_time= plg_inner.start_time
AND plg_outer.end_time = plg_inner.end_time
);
I'd suggest something easier:
CREATE table NewTable as
SELECT DISTINCT pplp_name,start_time,end_time
FROM YourTable
Then delete your table, and rename the new table.
If you really want to delete records, you can find a few examples of how here.

Invalid result for Order By in postgresql

I was trying to order a table "main_table" by date (which is a float), my initial table looks like this (select date from main_table):
date
20160424105948
20160424045955
20160424050000
20160418170003
20160419233154
20160419233155
so by creating another table with ordered rows by "Order by date", i get some rows which are not correctly ordered:
create temp table tmp (like main_table including defaults);
insert into tmp (select * from compact_table order by date asc)
and it i get (select date from tmp) :
date
20160418170908
20160418170909
20160418170910
20160420110031 <<
20160418170911
'date' is a primary key, and i have up to 600000 rows in my table, am I doing something wrong ?

Hive multiple insert goes wrong with the DISTINCT select statement

I read this code from "Hadoop the Definitive Guide":
SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id
FROM dim_ads a JOIN impression_logs b ON (b.ad_id = a.ad_id)
WHERE b.dateid = '2008-12-01') x
INSERT OVERWRITE DIRECTORY 'results_gby_adid'
SELECT x.ad_id, count(1), count(DISTINCT x.user_id) GROUP BY x.ad_id
INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'
SELECT x.campaign_id, count(1), count(DISTINCT x.user_id) GROUP BY x.campaign_id
INSERT OVERWRITE DIRECTORY 'results_gby_accountid'
SELECT x.account_id, count(1), count(DISTINCT x.user_id) GROUP BY x.account_id;
but as my test, using several DISTINCT cannot get right results.
my hiveql as below:
CREATE TABLE IF NOT EXISTS a (logindate int, id int);
then
load local file to this table...
CREATE TABLE IF NOT EXISTS user (id INT) PARTITIONED BY (logindate INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
then
if inserting table separately:
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) FROM a WHERE logindate=20130120;
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) FROM a WHERE logindate=20130121;
the results are correct;
but if choosing the next multiple insert hql:
FROM a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) WHERE logindate=20130120
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) WHERE logindate=20130121;
the results are not correct, both partitions have the same number of records, seems like select from DISTINCT(id) WHERE logindate=20130120 OR logindate=20130121
so is it a bug or did I write some wrong syntax?
DISTINCT has a bit of an odd history in the code as an alias to group by.
If there is a bug, then the version of hive you are using would be important to know since bugs are addressed in each release.
This might work:
FROM a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130120 GROUP BY id
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT id WHERE logindate=20130121 GROUP BY id;
if that doesn't work, this will definitely work...even though it isn't the approach you were attempting to use...
FROM (select distinct id, logindate from a where logindate in ('20130120','20130121')) subq_a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130120
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130121;

How to populate column data in one table based on an id match from another table (using the data from the matched table)

In Oracle I have two tables. Both are populated with data and have a timestamp column.
One table has that column filled with data, the other table doesn't. But I need to get the data from the table that does into the column of the other table based on a match of another column.
Each has 'code' so the timestamp from the one table should only be put in the timestamp of the other table where the codes match.
I've tried cursors etc, but it seems like I'm missing something.
Any suggestions?
It sounds like you want a correlated update. This will update every row of destinationTable with the timestamp_col from sourceTable where there is a match on the code column.
UPDATE destinationTable d
SET timestamp_col = (SELECT s.timestamp_col
FROM sourceTable s
WHERE s.code = d.code )
WHERE EXISTS( SELECT 1
FROM sourceTable s
WHERE s.code = d.code )
Just for completeness, I think I would have done something like this (untested), if the destination has a primary key or unique column:
UPDATE (
SELECT d.primary_key AS pk, d.timestamp_col AS dest_col, MAX(s.timestamp_col) AS src_col
FROM dest d
JOIN src s ON d.code = s.code
GROUP BY pk, dest_col
) j
SET j.dest_col = j.src_col

Resources