Hive insert overwrites truncates the table in few cases

Hive insert overwrites truncates the table in few cases - hadoop

I was working on one solution and found that in some particular cases, hive insert overwrite truncates the table however in few cases it doesn't. Would someone please explain me what it's behaving like that?
to explain this, I am table two tables, source and target and trying to insert data into master from source table using insert overwrite
When Source Table has partition
if source table has partition and if you write a condition such that partition does not exist then it won't truncate the master table.
create table source (name String) partitioned by (age int);
insert into source partition (age) values("gaurang", 11);
create table target (name String, age int);
insert into target partition (age) values("xxx", 99);
following query won't truncate the table even if select doesn't return anything.
insert overwrite table temp.test12 select * from temp.test11 where name="Ddddd" and age=99;
However, following query will truncate the table.
insert overwrite table temp.target select * from temp.test11 where name="Ddddd" and age=11;
it makes sense in the first case, as the partition(age=99) does not exist hence it should stop the execution of the query further. However this is my assumption, not sure what exactly happens.
When Source Table Doesn't have partition, but Target has
in this case target table won't be truncated even if select statement from source table returns 0 rows.
use temp;
drop table if exists source1;
drop table if exists target1;
create table source1 (name String, age int);
create table target1 (name String) partitioned by (age int);
insert into source1 values ("gaurang", 11);
insert into target1 partition(age) values("xxx", 99);
select * from source1;
select * from target1;
Following query won't truncate the table even if no data found in select statement.
insert overwrite table temp.target1 partition(age) select * from temp.source1 where age=90;
When Source or Target don't have partition
In this case if I try to insert overwrite target and select statement doesn't return any row then target table will be truncated.
check the example below.
use temp;
drop table if exists source1;
drop table if exists target1;
create table source1 (name String, age int);
create table target1 (name String, age int);
insert into source1 values ("gaurang", 11);
insert into target1 values("xxx", 99);
select * from source1;
select * from target1;
Following Query will truncate the target table.
insert overwrite table temp.target1 select * from temp.source1 where age=90;

Better use term 'overwrite' instead of truncate, because it is what exactly happening during insert overwrite.
When you write overwrite table temp.target1 partition(age) you instructs Hive to overwrite partitions, not all the target1 table, only those partitions which will be returned by select.
Empty dataset will not overwrite partitions in dynamic partition mode. because the partition to overwrite is unknown, partition should be taken from dataset, and the dataset is empty, nothing to overwrite then.
And in case of not partitioned table, it is already known that it should overwrite all the table, does not matter, empty dataset or not.
Partition column in insert overwrite statement should be the last. And the list of partitions to be overwritten in target = list of values in partition column, returned by dataset, does not matter how the source table is partitioned (you can select target partition column from any source table column, calculate it or use a constant), only what was returned does matter.

Related

Loading Data into an empty Impala Table with account data partitioned by area code

I'm trying to copy data from a table called accounts into an empty table called accounts_by_area_code. I have the following fields in accounts_by_area_code: acct_num INT, first_name STRING, last_name STRING, phone_number STRING. The table is partitioned by areacode (the first 3 digits of phone_number.
I need to use a SELECT statement to extract the area code into an INSERT INTO TABLE command to copy the speciﬁed columns to the new table, dynamically partitioning by area code.
This is my last attempt:
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num, first_name, last_name, phone_number, areacode) PARTITION (areacode) SELECT STRLEFT (phone_number,3) AS areacode FROM accounts;"
This generates ERROR: AnalysisException: Column permutation and PARTITION clause mention more columns (5) than the SELECT / VALUES clause and PARTITION clause return (1). I'm not convinced I have even the basic syntax correct so any help would be great as I'm new to Impala.

Impala creates partitions dynamically based on data. So not sure why you want to create an empty table with partitions because it will be auto created while inserting new data.
Still, I think you can create empty table with partitions like this-
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num) PARTITION (areacode)
SELECT CAST(NULL as STRING), STRLEFT (phone_number,3) AS areacode FROM accounts;"

Upserting in GreenPlum

How can I upsert a record in GreenPlum while copying the data from a CSV file. The CSV file has multiple records for a given value of the primary key. If a row with some value already exists in the database I want to update that record. Otherwise, it should append a new row.

One way to do this is to copy the data to a staging table, then insert/update from that table.
Here is an example of that:
-- Duplicate the definition of your table.
CREATE TEMP TABLE my_table_stage (LIKE my_table INCLUDING DEFAULTS);
-- Your COPY statment
COPY my_table FROM 'my_file.csv' ...
-- Insert any "new" records
INSERT INTO my_table (key_field, data_field1, data_field2)
SELECT
stg.key_field,
stg.data_field1,
stg.data_field2
FROM
my_table_stage stg
WHERE
NOT EXISTS (SELECT 1 FROM my_table WHERE key_field = stg.key_field);
-- Update any existing records
UPDATE my_table orig
SET
data_field1 = stg.data_field1,
data_field2 = stg.data_field2
FROM
my_table_stage stg
WHERE
orig.key_field = stg.keyfield;

How to use insert statement for a Hive partitioned table?

I have a hive table dynpart.
id int
name char(30)
city char(30)
thisday string
# Partition Information
# col_name data_type comment
thisday string
It is partitioned by 'thisday' whose datatype is STRING.
How can I insert a single record into the table in a particular partition. I know there is load command to load an entire file data into hive table. I just want to know how an Insert statement can be written for a partitioned table. I tried to write command like below but this is taking data from another table.
insert into droplater partition(thisday='30/03/2017') select * from dynpart;
The table: Droplater has the same structure as dynpart. But the above command is to insert the data from another table. What I'd like to learn is to write a simple insert command into a partition, like: insert into tabname values(1,"abcd","efgh");into the table.

This will work for primitive types only (no arrays, structs etc.)
insert into tabname partition (thisday='30/03/2017') values (1,"abcd","efgh");
This will work for all types
insert into tabname partition (thisday='30/03/2017') select 1,"abcd","efgh";
P.s.
By all means, partition your table by date ((thisday date) )
insert into tabname partition (thisday=date '2017-03-30') ...
or at least use the ISO date format
insert into tabname partition (thisday='2017-03-30') ...

How to insert init-data into a table in hive?

I wanted to insert some initial data into the table in hive, so I created below HQL,
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value;
but it does not work.
There is another query like the above,
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value FROM table limit 1;
But it also didn't work, as I see that the tables are empty.
How can I set the initial data into the table?
(There is the reason why I have to do self-join)

About first HQL it should have from clause, its missing so HQL failure,
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value;
Regarding second HQL, from table should have atleast one row, so it can set the constant init values into your newly created table.
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum', '0' FROM table limit 1;
you can use any old hive table having data into it, and give a hit.

The following query works fine if we have already test table created in hive.
INSERT OVERWRITE TABLE test PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value FROM test;
I think the table which we perform insert should be created first.

How to set default value for column of new created table from select statement in 11g

I create a table in Oracle 11g with the default value for one of the columns. Syntax is:
create table xyz(emp number,ename varchar2(100),salary number default 0);
This created successfully. For some reasons I need to create another table with same old table structure and data. So I created a new table with name abc as
create table abc as select * from xyz.
Here "abc" created successfully with same structure and data as old table xyz. But for the column "salary" in old table "xyz" default value was set to "0". But in the newly created table "abc" the default value is not set.
This is all in Oracle 11g. Please tell me the reason why the default value was not set and how we can set this using select statement.

You can specify the constraints and defaults in a CREATE TABLE AS SELECT, but the syntax is as follows
create table t1 (id number default 1 not null);
insert into t1 (id) values (2);
create table t2 (id default 1 not null)
as select * from t1;
That is, it won't inherit the constraints from the source table/select. Only the data type (length/precision/scale) is determined by the select.

The reason is that CTAS (Create table as select) does not copy any metadata from the source to the target table, namely
no primary key
no foreign keys
no grants
no indexes
...
To achieve what you want, I'd either
use dbms_metadata.get_ddl to get the complete table structure, replace the table name with the new name, execute this statement, and do an INSERT afterward to copy the data
or keep using CTAS, extract the not null constraints for the source table from user_constraints and add them to the target table afterwards

You will need to alter table abc modify (salary default 0);

new table inherits only "not null" constraint and no other constraint.
Thus you can alter the table after creating it with "create table as" command
or you can define all constraint that you need by following the
create table t1 (id number default 1 not null);
insert into t1 (id) values (2);
create table t2 as select * from t1;
This will create table t2 with not null constraint.
But for some other constraint except "not null" you should use the following syntax
create table t1 (id number default 1 unique);
insert into t1 (id) values (2);
create table t2 (id default 1 unique)
as select * from t1;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive insert overwrites truncates the table in few cases - hadoop

Related

Loading Data into an empty Impala Table with account data partitioned by area code

Upserting in GreenPlum

How to use insert statement for a Hive partitioned table?

How to insert init-data into a table in hive?

How to set default value for column of new created table from select statement in 11g

Categories

Resources