Hive and Sqoop partition - hadoop

I have sqoopd data from Netezza table and output file is in HDFS, but one column is a timestamp and I want to load it as a date column in my hive table. Using that column I want to create partition on date. How can i do that?
Example: in HDFS data is like = 2013-07-30 11:08:36
In hive I want to load only date (2013-07-30) not timestamps. I want to partition on that column DAILY.
How can I pass partition by column as dynamically?
I have tried with loading data into one table as source. In final table I will do insert overwrite table partition by (date_column=dynamic date) select * from table1

Set these 2 properties -
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And the Query can be like -
INSERT OVERWRITE TABLE TABLE PARTITION (DATE_STR)
SELECT
:
:
-- Partition Col is the last column
to_date(date_column) DATE_STR
FROM table1;
You can explore the two options of hive-import - if it is an incremental import you will be able to get the current day's partition.
--hive-partition-key
--hive-partition-value

You can just load the EMP_HISTORY table from EMP by enabling dynamic partition and converting the timestamp to date using to_date date function
The code might look something like this....
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE EMP_HISTORY PARTITION (join_date)
SELECT e.name as name, e.age as age, e.salay as salary, e.loc as loc, to_date(e.join_date) as join_date from EMP e ;

Related

How to insert Hive partition column and value into data (parquet) file?

Request:- How can I insert partition key pair into each parquet file while inserting data into Hive/Impala table.
Hive Table DDL
[
create external table db.tbl_name ( col1 string, col2 string)
Partitioned BY (date_col string)
STORED AS parquet
LOCATION 'hdfs_path/db/tbl_name'
]
Let's insert data into this hive table.
INSERT INTO db.tbl_name PARTITION (date_col=2020-07-26) VALUES ('test1_col1','test1_col2')
Once records get inserted, let's view data into parquet file using parquet-tools or any other tool.
parquet-tool cat hdfs_path/db/tbl_name/date_col=2020-07-26/parquet_file.parquet
Below would be the view.
**********************
col1 = test1_col1
col2 = test1_col2
**********************
However, if I fire following HQL query on Hive/Impala, then it will read partition value from metadata.
**Query**- select * from db.tbl_name
**Result** -
col1 col2 date_col
test1_col1 test1_col2 2020-07-26
Question- Is there any way, where we can view partition columnn name and value in parquet file like below.
col1 = test1_col1
col2 = test1_col2
date_col = 2020-07-26
Please use this -
INSERT INTO db.tbl_name PARTITION (date_col) VALUES ('test1_col1','test1_col2','2020-07-26');
Always mention partition name inside brackets() like above. And then in the values/select clause, order the partition column in the end.
Thats all you need to insert into hive/impala partitioned table.

Hive partition table by month from daily timestamp?

Is it possible to create partition like 01 from date like 2017-01-02' where 01 is month ?
I have daily sales record and I need to do query like select * from sales where month = '01'. So it will be better if I could partition my daily sales by month.but my data has date of format 2017-01-01 and doing
create table tl (columns ......) partitioned by (date <datatype> ) will create partition on daily basis which is the last thing I want .
I need to create partition dynamically.
CAUTION:- You need to escape date column(by using ` i.e. backtick around column name) in create statement. Because date is a datatype in hive.
You can create partitions dynamically:-
by setting below parameter in query.
set hive.exec.dynamic.partition.mode=nonstrict;
Along with that you need to select only month part from source table:-
insert into table sales partition(date) select columns...,SUBSTR(date,5,2) from source_table
This insert statement will create partitions like.
show partitions sales
date=01
date=02
date=03
date=04

How to use insert statement for a Hive partitioned table?

I have a hive table dynpart.
id int
name char(30)
city char(30)
thisday string
# Partition Information
# col_name data_type comment
thisday string
It is partitioned by 'thisday' whose datatype is STRING.
How can I insert a single record into the table in a particular partition. I know there is load command to load an entire file data into hive table. I just want to know how an Insert statement can be written for a partitioned table. I tried to write command like below but this is taking data from another table.
insert into droplater partition(thisday='30/03/2017') select * from dynpart;
The table: Droplater has the same structure as dynpart. But the above command is to insert the data from another table. What I'd like to learn is to write a simple insert command into a partition, like: insert into tabname values(1,"abcd","efgh");into the table.
This will work for primitive types only (no arrays, structs etc.)
insert into tabname partition (thisday='30/03/2017') values (1,"abcd","efgh");
This will work for all types
insert into tabname partition (thisday='30/03/2017') select 1,"abcd","efgh";
P.s.
By all means, partition your table by date ((thisday date) )
insert into tabname partition (thisday=date '2017-03-30') ...
or at least use the ISO date format
insert into tabname partition (thisday='2017-03-30') ...

Hive modify partitioned table data

Problem: One column value is null. It should be 'ab'. Unfortunately I have written '' instead 'ab'.
My table is partitioned table. Is there any way to change that?
I found the following way. But it seems inefficient.
Create a temp table like my table
Use INSERT OVERWRITE. Read data from my old table and write to new table. I am using case statement to change '' to 'ab'
And then change my temp table to original table.
I am looking for a solution something like update partition and msck. Is there any way to do?
You can overwrite single partition in this way:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite target_table partition (part_col)
select
case when column ='' then 'ab' else column end as column ,
col2, --select all the columns in the same order
col3,
part_col --partition column is the last one
from target_table where part_col='your_partition_value';
One possible solution would be to perform update on the table provided the column is not neither a partitioning nor bucketing column.
UPDATE tablename SET column = (CASE WHEN column = '' THEN 'ab' else column END) [WHERE expr if any];
Update: To support ACID operations on Hive
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1;
Note: works only if Hive >= 0.14

Hive update all values in a column

I have an external partitioned Hive table. One of its columns is a string named OLDDATE that has the date in a different format(DD-MM-YY). I want to update the column and store dates in YYYY-MM-DD format. All years are 20XX.
So I thought of this
select CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-',SPLIT(OLDDATE ,'-')[0]) from table
This gives me the dates in the format I want. Now how do I overwrite the old date with this new date?
You can effect an update by overwriting the table with its own contents, just with the date field changed according to your transformation, like this pseudo-code:
INSERT OVERWRITE table
SELECT
col1
, col2
...
, CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-',SPLIT(OLDDATE ,'-')[0]) AS olddate
...
, coln
FROM table;
#user2441441
To overwrite a partitioned table:
INSERT OVERWRITE table PARTITION (p_col)
SELECT
col1
, col2
...
, CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-
',SPLIT(OLDDATE ,'-')[0]) AS olddate
...
, coln
, p_col
FROM table;
Since its an partitioned table, the folder names must be created with the date values.
Hence you are not able to update the values.
One work around for this would be create a new table and run your above query and insert data into the new table.
After that you can drop your existing table and treat this new table as your required table.

Resources