Incremental data load to hive table - hadoop

I am trying to load incremental data from one hive external table to another hive table. I have a date timestamp field on the source table to identify the newly added rows to it on a daily basis. My task is to extract the rows that are newly added to the source and insert them into the target table.
I am using Hive 0.14.
I tried the below queries but could not make it work.
INSERT INTO TABLE TARGET PARTITION (FIELD_DATE)
SELECT A.FIELD1, A.FIELD2, A.FIELD3,
CASE WHEN LENGTH(A.FIELD4)=0 THEN 0 ELSE 1 END,
CASE WHEN LENGTH(A.FIELD5)=0 THEN 0 ELSE 1 END,
FROM SOURCE A, (Select max(FIELD_TIMESTAMP) from TARGET) T
where A.FIELD_TIMESTAMP > T.FIELD_TIMESTAMP;
The above code is taking hours together without giving any result.
I also tried to execute the below query and later found that HIVE does not support subqueries in WHERE clause. (got ParseException)
INSERT INTO TABLE TARGET PARTITION (FIELD_DATE)
SELECT A.FIELD1, A.FIELD2, A.FIELD3,
CASE WHEN LENGTH(A.FIELD4)=0 THEN 0 ELSE 1 END,
CASE WHEN LENGTH(A.FIELD5)=0 THEN 0 ELSE 1 END,
FROM SOURCE A, TARGET T
where A.FIELD_TIMESTAMP > (Select max(FIELD_TIMESTAMP) from TARGET);
Please help me out in selecting only the rows that have been added after my initial load.
Thank you.

Try this
INSERT INTO TABLE TARGET PARTITION (FIELD_DATE)
SELECT A.FIELD1, A.FIELD2, A.FIELD3,
CASE WHEN LENGTH(A.FIELD4)=0 THEN 0 ELSE 1 END,
CASE WHEN LENGTH(A.FIELD5)=0 THEN 0 ELSE 1 END,
FROM SOURCE A JOIN
(Select max(FIELD_TIMESTAMP) as FIELD_TIMESTAMP from TARGET) T
on 1=1
where A.FIELD_TIMESTAMP > T.FIELD_TIMESTAMP;
This is the tested query for your reference:
insert into orders_target
select o.* from orders_source o join
(select max(o1.order_date) order_date from orders_target o1) o2
on 1=1
where o.order_date > o2.order_date;

Related

Delete duplicate rows from a BigQuery table

I have a table with >1M rows of data and 20+ columns.
Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).
If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.
I am not proficient in SQL or any other programming language so please excuse my ignorance.
delete from Accidents.CleanedFilledCombined
where Fixed_Accident_Index
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined
group by Fixed_Accident_Index
having count(Fixed_Accident_Index) >1);
You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).
A query that should work is here:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1
UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:
https://stackoverflow.com/a/57900778/132438
An alternative to Jordan's answer - this one scales better when having too many duplicates:
#standardSQL
SELECT event.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.created_at DESC LIMIT 1
)[OFFSET(0)] event
FROM `githubarchive.month.201706` t
# GROUP BY the id you are de-duplicating by
GROUP BY actor.id
)
Or a shorter version (takes any row, instead of the newest one):
SELECT k.*
FROM (
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM `fh-bigquery.reddit_comments.2017_01` x
GROUP BY id
)
To de-duplicate rows on an existing table:
CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k
FROM `deleting.deduplicating_table` row
GROUP BY id
)
Not sure why nobody mentioned DISTINCT query.
Here is the way to clean duplicate rows:
CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table
If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.
SELECT <list of original fields>
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1
In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.
I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row.
In this case duplicate rows will be eliminated by BigQuery
Of course, this will involve some client side coding - so might be not relevant for this particular question.
I havent tried this approach by myself either but feel it might be interesting to try :o)
If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a
Easier answer, without a subselect
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
WHERE TRUE
QUALIFY row_number = 1
The Where True is neccesary because qualify needs a where, group by or having clause
Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:
CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT
Fixed_Accident_Index,
ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;
To be safe, make sure you backup the original table before you run this ^^
I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.
Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING

Create duplicate rows by running same command 5 times for example
insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')
Check if duplicate entries exist
select * from beginner-290513.917834811114.messages where id = 19999
Use generate uuid function to generate uuid corresponding to each message

UPDATE beginner-290513.917834811114.messages
SET bq_uuid = GENERATE_UUID()
where id>0
Clean duplicate entries
DELETE FROM beginner-290513.917834811114.messages
WHERE bq_uuid IN
(SELECT bq_uuid
FROM
(SELECT bq_uuid,
ROW_NUMBER() OVER( PARTITION BY updated_at
ORDER BY bq_uuid ) AS row_num
FROM beginner-290513.917834811114.messages ) t
WHERE t.row_num > 1 );

HIVE equivalent of FIRST and LAST

I have a table with 3 columns:
table1: ID, CODE, RESULT, RESULT2, RESULT3
I have this SAS code:
data table1
set table1;
BY ID, CODE;
IF FIRST.CODE and RESULT='A' THEN OUTPUT;
ELSE IF LAST.CODE and RESULT NE 'A' THEN OUTPUT;
RUN;
So we are grouping the data by ID and CODE, and then writing to the dataset if certain conditions are met. I want to write a hive query to replicate this. This is what I have:
proc sql;
create table temp as
select *, row_number() over (partition by ID, CODE) as rowNum
from table1;
create table temp2 as
select a.ID, a.CODE, a.RESULT, a.RESULT2, a.RESULT3
from temp a
inner join (select ID, CODE, max(rowNum) as maxRowNum
from temp
group by ID, CODE) b
on a.ID=b.ID and a.CODE=b.CODE
where (a.rowNum=1 and a.RESULT='A') or (a.rowNum=b.maxRowNum and a.RESULT NE 'A');
quit;
There are two issues I see with this.
1) The row that is first or last in each BY group is entirely dependant on the order of rows in table1 in SAS, we aren't ordering by anything. I don't think row order is preserved when translating to a hive query.
2) The SAS code is taking the first row in each BY GROUP or the last, not both. I think that my HIVE query is taking both, resulting in more rows than I want.
Any suggestions or insight on how to improve my query is appreciated. Is it even possible to replicate this SAS code in HIVE?
The SAS code has a by statement (BY ID CODE;), which tells SAS that the set dataset is sorted at those levels. So, not a random selection for first. and last..
That said, we can replicate this in HIVE by using the first_value and last_value window functions.
FIRST.CODE should replicate to
first_value(code) over (partition by Id order by code)fcode
Similarly, LAST.CODE would be
last_value(code) over (partition by Id order by code)lcode
Once you have the fcode and lcode columns, use case when statements for the result column criteria. Like,
case when (code=fcode and result='A') or (code=lcode and result<>'A')
then 1 else 0 end as op_flag
Then the fetch the table with where op_flag = 1
SAMPLE
select id, code, result from (
select *,
first_value(code) over (partition by id order by code)fcode,
last_value(code) over (partition by id order by code)lcode
from footab) f
where (code=fcode and result='A') or (code=lcode and result<>'A')
Regarding point 1) the BY group processing requires the input data to be sorted or indexed on BY variables, so though the code contains no ordering, the source data is processed in order. If the input data was not indexed/sorted, SAS will throw error.
Regarding this, possible differences are on rows with same values of BY variables, especially if the RESULT is different.
In SAS, I would pre-sort data by ID, CODE, RESULT, then use BY ID CODE in order to not be influenced by order of rows.
Regarding 2) FIRST and LAST can be both true in SAS. Since your condition for first and last on RESULT is different, I guess this is not a source of differences.
I guess you could add another field as
row_number() over (partition by ID, CODE desc) as rowNumDesc
to detect last row with rowNumDesc = 1 (so that you skip the join).
EDIT:
I think the two programs above both include random selection of rows for groups with same values of ID and CODE variables, especially with same values of RESULT. But you should get same number of rows from both. If not, just debug it.
However the random aspect in SAS code/storage is based on physical order of rows, while the ROW_NUMBERs randomness within a group will be influenced by the implementation of the function in the engine.

Retrieving a large dataset from a very large table

I have a very large table in oracle that contains 140+ million rows. Currently we are doing three full table scans on this table nightly, and using some of the results to populate a tmp table. That tmp table is then turned into a very large report (usually 140K + lines).
The big table is called tasklog and has the following structure has:
tasklog_id (number) - PK
document_id (number)
date_time_in (date)
+ a few more rows that aren't relevant
There are millions of different document ids each repeated between 1 and several hundred times, date_time_in is the time this entry was put into the database.
All of the full table scans looks like this
DECLARE
n_prevdocid number;
cursor tasks is
select *
from tasklog
order by document_id, date_time_in DESC;
BEGIN
for tk in tasks
loop
if n_prevdocid <> tk.document_id then
-- *code snipped*
end if;
n_prevdocid = tk.document_id;
end loop;
END;
/
So my question: is there a quick (ish) way to get a distinct list of document_ids with the row having the most recent date_time_in. This could dramatically speed up the whole thing. Or can anyone think of a better way of retrieving this data daily?
Things that may be relevant, this table only ever has rows inserted with current date time. It is not range paritioned but I can't see how that might help me. No rows are ever updated or deleted. There are about 70k - 80k rows inserted daily.
I don't think that you're going to get away from doing at least one full table scan, as the only way that it would be efficient would be is if the ratio of distinct document_id's to total records was pretty small. The clustering on the document_id is going to be very poor due to the way that the data is generated and inserted.
How about:
create table tmp nologging compress -- or pctfree 0
as
select ...
from (
select t.*,
max(date_time_in) over (partition by document_id) max_date_time_in
from tasklog t)
where date_time_in = max_date_time_in
Possibly, having created this once, you could then optimise further refreshes by merging into this set only the newer records. Something like ...
merge into tmp
using (
select ...
from (
select t.*,
max(date_time_in) over (partition by document_id) max_date_time_in
from tasklog t
where date_time_in > (select max(date_time_in) from tmp))
where date_time_in = max_date_time_in)
on ... blah blah
Have you tried:
select document_id
from tasklog t1
where date_time_in = (select max(date_time_in)
from tasklog t2
where t1.document_id=t2.document_id)
You can do something like this:
select document_id , date_time from tasklog group by date_time,document_id order by date_time desc;
By this you can retrieve distinct document_id with latest date_time colums.

how to merge data while loading them into hive?

I'm tring to use hive to analysis our log, and I have a question.
Assume we have some data like this:
A 1
A 1
A 1
B 1
C 1
B 1
How can I make it like this in hive table(order is not important, I just want to merge them) ?
A 1
B 1
C 1
without pre-process it with awk/sed or something like that?
Thanks!
Step 1: Create a Hive table for input data set .
create table if not exists table1 (fld1 string, fld2 string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
(i assumed field seprator is \t, you can replace it with actual separator)
Step 2 : Run below to get the merge data you are looking for
create table table2 as select fld1,fld2 from table1 group by fld1,fld2 ;
I tried this for below input set
hive (default)> select * from table1;
OK
A 1
A 1
A 1
B 1
C 1
B 1
create table table4 as select fld1,fld2 from table1 group by fld1,fld2 ;
hive (default)> select * from table4;
OK
A 1
B 1
C 1
You can use external table as well , but for simplicity I have used managed table here.
One idea.. you could create a table around the first file (called 'oldtable').
Then run something like this....
create table newtable select field1, max(field) from oldtable group by field1;
Not sure I have the syntax right, but the idea is to get unique values of the first field, and only one of the second. Make sense?
For merging the data, we can also use "UNION ALL" , it can also merge two different types of datatypes.
insert overwrite into table test1
(select x.* from t1 x )
UNION ALL
(select y.* from t2 y);
here we are merging two tables data (t1 and t2) into one single table test1.
There's no way to pre-process the data while it's being loaded without using an external program. You could use a view if you'd like to keep the original data intact.
hive> SELECT * FROM table1;
OK
A 1
A 1
A 1
B 1
C 1
B 1
B 2 # Added to show it will group correctly with different values
hive> CREATE VIEW table2 (fld1, fld2) AS SELECT fld1, fld2 FROM table1 GROUP BY fld1, fld2;
hive> SELECT * FROM table2;
OK
A 1
B 1
B 2
C 1

How to create select SQL statement that would produce "merged" dataset from two tables(Oracle DBMS)?

Here's my original question:
merging two data sets
Unfortunately I omitted some intircacies, that I'd like to elaborate here.
So I have two tables events_source_1 and events_source_2 tables. I have to produce the data set from those tables into resultant dataset (that I'd be able to insert into third table, but that's irrelevant).
events_source_1 contain historic event data and I have to do get the most recent event (for such I'm doing the following:
select event_type,b,c,max(event_date),null next_event_date
from events_source_1
group by event_type,b,c,event_date,null
events_source_2 contain the future event data and I have to do the following:
select event_type,b,c,null event_date, next_event_date
from events_source_2
where b>sysdate;
How to put outer join statement to fill the void (i.e. when same event_type,b,c found from event_source_2 then next_event_date will be filled with the first date found
GREATLY APPRECIATE FOR YOUR HELP IN ADVANCE.
Hope I got your question right. This should return the latest event_date of events_source_1 per event_type, b, c and add the lowest event_date of event_source_2.
Select es1.event_type, es1.b, es1.c,
Max(es1.event_date),
Min(es2.event_date) As next_event_date
From events_source_1 es1
Left Join events_source_2 es2 On ( es2.event_type = es1.event_type
And es2.b = es1.b
And es2.c = es1.c
)
Group By c1.event_type, c1.b, c1.c
You could just make the table where you need to select a max using a group by into a virtual table, and then do the full outer join as I provided in the answer to the prior question.
Add something like this to the top of the query:
with past_source as (
select event_type, b, c, max(event_date)
from event_source_1
group by event_type, b, c, event_date
)
Then you can use past_source as if it were an actual table, and continue your select right after the closing parens on the with clause shown.
I end up doing two step process: 1st step populates the data from event table 1, 2nd step MERGES the data between target (the dataset from 1st step) and another source. Please forgive me, but I had to obfuscate table name and omit some columns in the code below for legal reasons. Here's the SQL:
INSERT INTO EVENTS_TARGET (VEHICLE_ID,EVENT_TYPE_ID,CLIENT_ID,EVENT_DATE,CREATED_DATE)
select VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID,
max(EVENT_INITIATED_DATE) EVENT_DATE, sysdate CREATED_DATE
FROM events_source_1
GROUP BY VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID, sysdate;
Here's the second step:
MERGE INTO EVENTS_TARGET tgt
USING (
SELECT ee.VEHICLE_ID VEHICLE_ID, ee.POTENTIAL_EVENT_TYPE_ID POTENTIAL_EVENT_TYPE_ID, ee.CLIENT_ID CLIENT_ID,ee.POTENTIAL_EVENT_DATE POTENTIAL_EVENT_DATE FROM EVENTS_SOURCE_2 ee WHERE ee.POTENTIAL_EVENT_DATE>SYSDATE) src
ON (tgt.vehicle_id = src.VEHICLE_ID AND tgt.client_id=src.client_id AND tgt.EVENT_TYPE_ID=src.POTENTIAL_EVENT_TYPE_ID)
WHEN MATCHED THEN
UPDATE SET tgt.NEXT_EVENT_DATE=src.POTENTIAL_EVENT_DATE
WHEN NOT MATCHED THEN
insert (tgt.VEHICLE_ID,tgt.EVENT_TYPE_ID,tgt.CLIENT_ID,tgt.NEXT_EVENT_DATE,tgt.CREATED_DATE) VALUES (src.VEHICLE_ID, src.POTENTIAL_EVENT_TYPE_ID, src.CLIENT_ID, src.POTENTIAL_EVENT_DATE, SYSDATE)
;

Resources