Insert data listing columns with partitioning field in Hive - hadoop

First of all let's setup a test environment:
CREATE TABLE IF NOT EXISTS source_table (
`col1` TIMESTAMP,
`col2` STRING
);
CREATE TABLE IF NOT EXISTS dest_table (
`col1` TIMESTAMP,
`col2` STRING,
`col3` STRING
)
PARTITIONED BY (day STRING)
STORED AS AVRO;
INSERT INTO TABLE source_table VALUES ('2018-03-21 17:08:04.401', 'test1'), ('2018-03-22 12:02:04.222', 'test2'), ('2018-03-22 07:21:04.111', 'test3');
How could I list the column names during insertion and put the partition value dynamically? The following command doesn't work:
INSERT INTO TABLE dest_table(col1, col2) PARTITION(day) SELECT col1, col2, date_format(col1, 'yyyy-MM-dd') FROM source_table;
By the way, without listing the columns of dest_table inside the INSERT INTO command, having two tables with the same columns number, everything works fine. What if my dest_table has more fields than the source_table?
Thank you for helping me.
P.S.
Ok, if I hardcode NULL this works. I leave the question opened because there might be better ways to achieve that.
INSERT INTO TABLE dest_table PARTITION(day) SELECT col1, col2, NULL, date_format(col1, 'yyyy-MM-dd') FROM source_table;
Anyway, this method is strictly bounded with columns order? In a real-life scenario, how could I handle lots of columns specifying a mapping, to avoid mistakes?

The syntax for inserting into a partitioned table when you want to list the specific columns is shown below. You don't need to put null on col3 since Hive will put a default value NULL since it is not in the column list during insert.
INSERT INTO TABLE dest_table PARTITION (day)(col1, col2, day)
SELECT col1, col2, date_format(col1, 'yyyy-MM-dd') FROM source_table;
Result:
col1 col2 col3 day
2018-03-22 12:02:04.222 test2 NULL 2018-03-22
2018-03-22 07:21:04.111 test3 NULL 2018-03-22
2018-03-21 17:08:04.401 test1 NULL 2018-03-21

Related

How to insert into hive table, partitioned by date reading from temp table? [duplicate]

This question already has answers here:
Hive dynamic partitioning
(2 answers)
Closed 2 years ago.
I have a Hive temp table without any partitions which has the data required. I want to select this data and insert into another table partitioned by date. I tried following techniques with no luck.
Source table schema
CREATE TABLE cls_staging.cls_billing_address_em_tmp
( col1 string,
col2 string,
col3 string);
Destination table :
CREATE TABLE cls_staging.cls_billing_address_em_tmp
( col1 string,
col2 string,
col3 string)
PARTITIONED BY (
curr_date string)
STORED AS ORC;
Query for inserting into destination table :
insert overwrite table cls_staging.cls_billing_address_em_tmp partition (record_date) select col1, col2, col3, FROM_UNIXTIME(UNIX_TIMESTAMP()) from myDB.mytbl;
ERROR
Dynamic partition strict mode requires at least one static partition column
2nd
insert overwrite table cls_staging.cls_billing_address_em_tmp partition (record_date = FROM_UNIXTIME(UNIX_TIMESTAMP())) select col1, col2, col3 from myDB.mytbl;
ERROR :
cannot recognize input near 'FROM_UNIXTIME' '(' 'UNIX_TIMESTAMP'
1st Switch-on dynamic partitioning and non-strict mode:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table cls_staging.cls_billing_address_em_tmp partition (record_date)
select col1, col2, col3, current_timestamp from myDB.mytbl;
2nd: Do not use unix_timestamp() for this purpose, because it will generate many different timestamps, use current_timestamp constant, read this: https://stackoverflow.com/a/58081191/2700344

How to update some rows in a partitioned table in hive?

I need to update some rows in a partitioned table by date, with a ranges of dates and i don't know how to do it?
Using dynamic partitioning you can overwrite partitions which is necessary to update. Use case statement to check for rows to be modified and to set values, like in this template:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table table_name partition (partition_column)
select col1,
col2,
case when col1='record to be updated' then 'new value' else col3 end as col3,
...
colN,
partition_column --partition_column should be the last
from table_name
where ...--partition predicate here

Create Unique Constraint over function based index with existing duplicate values

I have a table with wrong data and I'd like to prevent new wrong data from beeing inserted while I fix data and find out what process or where is the sentence making this happen.
I first made a UQ constraint over the columns that shouldn't be duplicated, but this gets me into another problem: I need to apply uniqueness only when all the columns have value, if there are nulls I need duplicate records over these columns. Something like this:
CREATE TABLE MYTAB (COL1 NUMBER, COL2 NUMBER, COL3 NUMBER, COL4 NUMBER); --EXAMPLE TABLE. I NEED NO DUPS OVER (COL1, COL3, COL4)
INSERT INTO MYTAB VALUES (1, 1, 1, 1); --OK
INSERT INTO MYTAB VALUES (1, 2, 1, 1); -- NOOK
INSERT INTO MYTAB VALUES (1, 3, NULL, NULL); --OK
INSERT INTO MYTAB VALUES (1, 4, NULL, NULL); --OK
If I create a constraint like this:
ALTER TABLE MYTAB
ADD CONSTRAINT U_CONSTRAINT UNIQUE (COL1, COL3, COL4) NOVALIDATE;
Last insert will crash.
I've tried with
CREATE UNIQUE INDEX FN_UIX_MYTAB
ON MYTAB (CASE WHEN COL2 IS NOT NULL THEN COL1 ELSE null END,
CASE WHEN COL2 IS NOT NULL THEN COL3 ELSE null END
CASE WHEN COL2 IS NOT NULL THEN COL4 ELSE null END) ;
But the create crashes because table has duplicate data. I'll need to create this index without validating existing data, wich means index will be applied only to new records inserted.
I've tried also with:
CREATE INDEX FN_IX_MYTAB
ON MYTAB (CASE WHEN COL2 IS NOT NULL THEN COL1 ELSE null END,
CASE WHEN COL2 IS NOT NULL THEN COL3 ELSE null END,
CASE WHEN COL2 IS NOT NULL THEN COL4 ELSE null END) ;
ALTER TABLE MYTAB
ADD CONSTRAINT FN_UIX_MYTAB UNIQUE (COL1, COL3, COL4) USING INDEX FN_IX_MYTAB NOVALIDATE;
But this gives me error:
ORA-14196: Specified index cannot be used to enforce the constraint.
Is there a way to do what I've explained, or should I prevent wrong inserts in another way while I look for the origin of the problem? Any advice will be appreciated also.
Here's one possible approach. Create a materialized view, with refresh on commit (preferably fast refresh, if the circumstances permit; in this case, they should). The MV would be something like
create materialized view mymv
refresh fast on commit
as
select col1, col3, col4
from mytab
where col1 is not null and col3 is not null and col4 is not null
;
And then put a unique constraint on (col1, col3, col4) on the MV.

Hive 'create table like' without including partition columns

Say I create tbl1 like so:
create table tbl1 (
col_a STRING,
col_b STRING,
col_c STRING )
partitioned by (col_d STRING);
Is there a shorthand way to create tbl2 - a table with same columns as tbl1, but without paritioning by anything (and without including the parition column). tbl2 manual ddl would be:
create table tbl2 (
col_a STRING,
col_b STRING,
col_c STRING );
Thanks for any help!
You can use CTAS(Create table as Select) in hive.
create table tbl2 as select * from tbl1
This will not create any partition in tbl2 even though the tbl1 holds the partition.Only limitation is with out select you cannot be able to create the structure.
Partitioned column also one of the columns in the table.
if you want only 3 columns(col_a, col_b, col_c), you need to explicitly mention them in the query answered by Avinash.
create table tbl2 as select col_a, col_b, col_c from tbl1;

Oracle Equivalent to MySQL INSERT IGNORE?

I need to update a query so that it checks that a duplicate entry does not exist before insertion. In MySQL I can just use INSERT IGNORE so that if a duplicate record is found it just skips the insert, but I can't seem to find an equivalent option for Oracle. Any suggestions?
If you're on 11g you can use the hint IGNORE_ROW_ON_DUPKEY_INDEX:
SQL> create table my_table(a number, constraint my_table_pk primary key (a));
Table created.
SQL> insert /*+ ignore_row_on_dupkey_index(my_table, my_table_pk) */
2 into my_table
3 select 1 from dual
4 union all
5 select 1 from dual;
1 row created.
Check out the MERGE statement. This should do what you want - it's the WHEN NOT MATCHED clause that will do this.
Do to Oracle's lack of support for a true VALUES() clause the syntax for a single record with fixed values is pretty clumsy though:
MERGE INTO your_table yt
USING (
SELECT 42 as the_pk_value,
'some_value' as some_column
FROM dual
) t on (yt.pk = t.the_pke_value)
WHEN NOT MATCHED THEN
INSERT (pk, the_column)
VALUES (t.the_pk_value, t.some_column);
A different approach (if you are e.g. doing bulk loading from a different table) is to use the "Error logging" facility of Oracle. The statement would look like this:
INSERT INTO your_table (col1, col2, col3)
SELECT c1, c2, c3
FROM staging_table
LOG ERRORS INTO errlog ('some comment') REJECT LIMIT UNLIMITED;
Afterwards all rows that would have thrown an error are available in the table errlog. You need to create that errlog table (or whatever name you choose) manually before running the insert using DBMS_ERRLOG.CREATE_ERROR_LOG.
See the manual for details
I don't think there is but to save time you can attempt the insert and ignore the inevitable error:
begin
insert into table_a( col1, col2, col3 )
values ( 1, 2, 3 );
exception when dup_val_on_index then
null;
end;
/
This will only ignore exceptions raised specifically by duplicate primary key or unique key constraints; everything else will be raised as normal.
If you don't want to do this then you have to select from the table first, which isn't really that efficient.
Another variant
Insert into my_table (student_id, group_id)
select distinct p.studentid, g.groupid
from person p, group g
where NOT EXISTS (select 1
from my_table a
where a.student_id = p.studentid
and a.group_id = g.groupid)
or you could do
Insert into my_table (student_id, group_id)
select distinct p.studentid, g.groupid
from person p, group g
MINUS
select student_id, group_id
from my_table
A simple solution
insert into t1
select from t2
where not exists
(select 1 from t1 where t1.id= t2.id)
This one isn't mine, but came in really handy when using sqlloader:
create a view that points to your table:
CREATE OR REPLACE VIEW test_view
AS SELECT * FROM test_tab
create the trigger:
CREATE OR REPLACE TRIGGER test_trig
INSTEAD OF INSERT ON test_view
FOR EACH ROW
BEGIN
INSERT INTO test_tab VALUES
(:NEW.id, :NEW.name);
EXCEPTION
WHEN DUP_VAL_ON_INDEX THEN NULL;
END test_trig;
and in the ctl file, insert into the view instead:
OPTIONS(ERRORS=0)
LOAD DATA
INFILE 'file_with_duplicates.csv'
INTO TABLE test_view
FIELDS TERMINATED BY ','
(id, field1)
How about simply adding an index with whatever fields you need to check for dupes on and say it must be unique? Saves a read check.
yet another "where not exists"-variant using dual...
insert into t1(id, unique_name)
select t1_seq.nextval, 'Franz-Xaver' from dual
where not exists (select 1 from t1 where unique_name = 'Franz-Xaver');

Resources