How to use insert statement for a Hive partitioned table? - hadoop

I have a hive table dynpart.
id int
name char(30)
city char(30)
thisday string
# Partition Information
# col_name data_type comment
thisday string
It is partitioned by 'thisday' whose datatype is STRING.
How can I insert a single record into the table in a particular partition. I know there is load command to load an entire file data into hive table. I just want to know how an Insert statement can be written for a partitioned table. I tried to write command like below but this is taking data from another table.
insert into droplater partition(thisday='30/03/2017') select * from dynpart;
The table: Droplater has the same structure as dynpart. But the above command is to insert the data from another table. What I'd like to learn is to write a simple insert command into a partition, like: insert into tabname values(1,"abcd","efgh");into the table.

This will work for primitive types only (no arrays, structs etc.)
insert into tabname partition (thisday='30/03/2017') values (1,"abcd","efgh");
This will work for all types
insert into tabname partition (thisday='30/03/2017') select 1,"abcd","efgh";
P.s.
By all means, partition your table by date ((thisday date) )
insert into tabname partition (thisday=date '2017-03-30') ...
or at least use the ISO date format
insert into tabname partition (thisday='2017-03-30') ...

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

Loading Data into an empty Impala Table with account data partitioned by area code

I'm trying to copy data from a table called accounts into an empty table called accounts_by_area_code. I have the following fields in accounts_by_area_code: acct_num INT, first_name STRING, last_name STRING, phone_number STRING. The table is partitioned by areacode (the first 3 digits of phone_number.
I need to use a SELECT statement to extract the area code into an INSERT INTO TABLE command to copy the speciļ¬ed columns to the new table, dynamically partitioning by area code.
This is my last attempt:
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num, first_name, last_name, phone_number, areacode) PARTITION (areacode) SELECT STRLEFT (phone_number,3) AS areacode FROM accounts;"
This generates ERROR: AnalysisException: Column permutation and PARTITION clause mention more columns (5) than the SELECT / VALUES clause and PARTITION clause return (1). I'm not convinced I have even the basic syntax correct so any help would be great as I'm new to Impala.
Impala creates partitions dynamically based on data. So not sure why you want to create an empty table with partitions because it will be auto created while inserting new data.
Still, I think you can create empty table with partitions like this-
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num) PARTITION (areacode)
SELECT CAST(NULL as STRING), STRLEFT (phone_number,3) AS areacode FROM accounts;"

Hive insert overwrites truncates the table in few cases

I was working on one solution and found that in some particular cases, hive insert overwrite truncates the table however in few cases it doesn't. Would someone please explain me what it's behaving like that?
to explain this, I am table two tables, source and target and trying to insert data into master from source table using insert overwrite
When Source Table has partition
if source table has partition and if you write a condition such that partition does not exist then it won't truncate the master table.
create table source (name String) partitioned by (age int);
insert into source partition (age) values("gaurang", 11);
create table target (name String, age int);
insert into target partition (age) values("xxx", 99);
following query won't truncate the table even if select doesn't return anything.
insert overwrite table temp.test12 select * from temp.test11 where name="Ddddd" and age=99;
However, following query will truncate the table.
insert overwrite table temp.target select * from temp.test11 where name="Ddddd" and age=11;
it makes sense in the first case, as the partition(age=99) does not exist hence it should stop the execution of the query further. However this is my assumption, not sure what exactly happens.
When Source Table Doesn't have partition, but Target has
in this case target table won't be truncated even if select statement from source table returns 0 rows.
use temp;
drop table if exists source1;
drop table if exists target1;
create table source1 (name String, age int);
create table target1 (name String) partitioned by (age int);
insert into source1 values ("gaurang", 11);
insert into target1 partition(age) values("xxx", 99);
select * from source1;
select * from target1;
Following query won't truncate the table even if no data found in select statement.
insert overwrite table temp.target1 partition(age) select * from temp.source1 where age=90;
When Source or Target don't have partition
In this case if I try to insert overwrite target and select statement doesn't return any row then target table will be truncated.
check the example below.
use temp;
drop table if exists source1;
drop table if exists target1;
create table source1 (name String, age int);
create table target1 (name String, age int);
insert into source1 values ("gaurang", 11);
insert into target1 values("xxx", 99);
select * from source1;
select * from target1;
Following Query will truncate the target table.
insert overwrite table temp.target1 select * from temp.source1 where age=90;
Better use term 'overwrite' instead of truncate, because it is what exactly happening during insert overwrite.
When you write overwrite table temp.target1 partition(age) you instructs Hive to overwrite partitions, not all the target1 table, only those partitions which will be returned by select.
Empty dataset will not overwrite partitions in dynamic partition mode. because the partition to overwrite is unknown, partition should be taken from dataset, and the dataset is empty, nothing to overwrite then.
And in case of not partitioned table, it is already known that it should overwrite all the table, does not matter, empty dataset or not.
Partition column in insert overwrite statement should be the last. And the list of partitions to be overwritten in target = list of values in partition column, returned by dataset, does not matter how the source table is partitioned (you can select target partition column from any source table column, calculate it or use a constant), only what was returned does matter.

Hive update all values in a column

I have an external partitioned Hive table. One of its columns is a string named OLDDATE that has the date in a different format(DD-MM-YY). I want to update the column and store dates in YYYY-MM-DD format. All years are 20XX.
So I thought of this
select CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-',SPLIT(OLDDATE ,'-')[0]) from table
This gives me the dates in the format I want. Now how do I overwrite the old date with this new date?
You can effect an update by overwriting the table with its own contents, just with the date field changed according to your transformation, like this pseudo-code:
INSERT OVERWRITE table
SELECT
col1
, col2
...
, CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-',SPLIT(OLDDATE ,'-')[0]) AS olddate
...
, coln
FROM table;
#user2441441
To overwrite a partitioned table:
INSERT OVERWRITE table PARTITION (p_col)
SELECT
col1
, col2
...
, CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-
',SPLIT(OLDDATE ,'-')[0]) AS olddate
...
, coln
, p_col
FROM table;
Since its an partitioned table, the folder names must be created with the date values.
Hence you are not able to update the values.
One work around for this would be create a new table and run your above query and insert data into the new table.
After that you can drop your existing table and treat this new table as your required table.

Hive and Sqoop partition

I have sqoopd data from Netezza table and output file is in HDFS, but one column is a timestamp and I want to load it as a date column in my hive table. Using that column I want to create partition on date. How can i do that?
Example: in HDFS data is like = 2013-07-30 11:08:36
In hive I want to load only date (2013-07-30) not timestamps. I want to partition on that column DAILY.
How can I pass partition by column as dynamically?
I have tried with loading data into one table as source. In final table I will do insert overwrite table partition by (date_column=dynamic date) select * from table1
Set these 2 properties -
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And the Query can be like -
INSERT OVERWRITE TABLE TABLE PARTITION (DATE_STR)
SELECT
:
:
-- Partition Col is the last column
to_date(date_column) DATE_STR
FROM table1;
You can explore the two options of hive-import - if it is an incremental import you will be able to get the current day's partition.
--hive-partition-key
--hive-partition-value
You can just load the EMP_HISTORY table from EMP by enabling dynamic partition and converting the timestamp to date using to_date date function
The code might look something like this....
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE EMP_HISTORY PARTITION (join_date)
SELECT e.name as name, e.age as age, e.salay as salary, e.loc as loc, to_date(e.join_date) as join_date from EMP e ;

Resources