Add a new partition in hive external table and update the existing partition to column of the table to non-partition column - hadoop

I have below existing table emp with partition column as as_of_date(current_date -1).
CREATE EXTERNAL TABLE IF NOT EXISTS emp(
student_ID INT,
name STRING)
partitioned by (as_of_date date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/emp';
Below are the existing partition path
user/emp/as_of_date=2021-09-02
user/emp/as_of_date=2021-09-03
user/emp/as_of_date=2021-09-04
In emp table, I have to add new partition column as businessdate(current_date) and change partition column (as_of_date) to non-partition column.
Expected output should be as below.
describe table emp;
CREATE EXTERNAL TABLE IF NOT EXISTS emp(
student_ID INT,
Name STRING,
as_of_date date)
partitioned by (businessdate date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/emp';
After update, below will be hdfs path
user/emp/buinessdate=2021-09-03
user/emp/buinessdate=2021-09-04
user/emp/businessdate=2021-09-05
Expected output table:
|student_ID |name |as_of_date | business_date |
|--| --- | --- |----|
|1 | Sta |2021-09-02| 2021-09-03 |
|2 |Danny|2021-09-03| 2021-09-04 |
|3 |Elle |2021-09-04| 2021-09-05 |

Create new table, load data from old table, remove old table, rename new table.
--1 Create new table emp1
CREATE EXTERNAL TABLE emp1(
student_ID INT,
Name STRING,
as_of_date date)
partitioned by (businessdate date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/emp1';
--2 Load data into emp1 from the emp with new partition column calculated
--dynamic partition mode
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table emp1 partition (business_date)
select
student_ID,
Name,
as_of_date,
date_add(as_of_date,1) as business_date
from emp;
Now you can drop old (make it managed first to drop location as well) and rename new table if necessary.

Related

partition the hive data complex data types while inserting data its shows error

i created a table using hive i want to partition the data based on location
create table student(
id bigint
,name string
,location string
, course array<string>)
ROW FORMAT DELIMiTED fields terminated by '\t'
collection items terminated by ','
stored as textfile;
and data like
100 student1 ongole java,.net,hadoop
101 student2 hyderabad .net,hadoop
102 student3 vizag java,hadoop
103 student4 ongole .net,hadoop
104 student5 vizag java,.net
105 student6 ongole java,.net,hadoop
106 student7 neollre .net,hadoop
creating partition table:
create table student_partition(
id bigint
,name string
,course array<string>)
PARTITIONED BY (address string)
ROW FORMAT DELIMiTED fields terminated by '\t'
collection items terminated by ','
stored as textfile;
INSERT OVERWRITE TABLE student_partition PARTITION(address) select *
from student;
i'm trying to partition the data based on location but it shows below error:
FAILED: SemanticException [Error 10044]: Line 1:23 Cannot insert into
target table because column number/types are different 'address':
Cannot convert column 2 from string to array.
please anyone help me.
The columns of the source and the target should match
Option 1: adjust the source to the target. The partition column goes last
insert into student_partition partition (address)
select id,name,course,location
from student
;
Option 2: adjust the target to the source
insert into student_partition partition (address) (id,name,address,course)
select *
from student
;
P.s.
You might need this -
set hive.exec.dynamic.partition.mode=nonstrict
;

How I avoid the "NULL" in the first "Field Name" of Hive table

First of all I have created the table "emp" in Hive by using below commands:
create table emp (id INT, name STRING, address STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
Then load the data in this "emp" table by this below command:
LOAD DATA LOCAL INPATH '\home\cloudera\Desktop\emp.txt' overwrite into table emp;
When I select the data from "emp" table: it show me first field of table Null
like this:
You have an header row in your file and the first value id cannot be converted into INT therefore being replaced by NULL.
add tblproperties ("skip.header.line.count"="1") to your table definition
For an existing table -
alter table emp set tblproperties ("skip.header.line.count"="1");

Alter hive table add or drop column

I have orc table in hive I want to drop column from this table
ALTER TABLE table_name drop col_name;
but I am getting the following exception
Error occurred executing hive query: OK FAILED: ParseException line 1:35 mismatched input 'user_id1' expecting PARTITION near 'drop' in drop partition statement
Can any one help me or provide any idea to do this? Note, I am using hive 0.14
You cannot drop column directly from a table using command ALTER TABLE table_name drop col_name;
The only way to drop column is using replace command. Lets say, I have a table emp with id, name and dept column. I want to drop id column of table emp. So provide all those columns which you want to be the part of table in replace columns clause. Below command will drop id column from emp table.
ALTER TABLE emp REPLACE COLUMNS( name string, dept string);
There is also a "dumb" way of achieving the end goal, is to create a new table without the column(s) not wanted. Using Hive's regex matching will make this rather easy.
Here is what I would do:
-- make a copy of the old table
ALTER TABLE table RENAME TO table_to_dump;
-- make the new table without the columns to be deleted
CREATE TABLE table AS
SELECT `(col_to_remove_1|col_to_remove_2)?+.+`
FROM table_to_dump;
-- dump the table
DROP TABLE table_to_dump;
If the table in question is not too big, this should work just well.
suppose you have an external table viz. organization.employee as: (not including TBLPROPERTIES)
hive> show create table organization.employee;
OK
CREATE EXTERNAL TABLE `organization.employee`(
`employee_id` bigint,
`employee_name` string,
`updated_by` string,
`updated_date` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://getnamenode/apps/hive/warehouse/organization.db/employee'
You want to remove updated_by, updated_date columns from the table. Follow these steps:
create a temp table replica of organization.employee as:
hive> create table organization.employee_temp as select * from organization.employee;
drop the main table organization.employee.
hive> drop table organization.employee;
remove the underlying data from HDFS (need to come out of hive shell)
[nameet#ip-80-108-1-111 myfile]$ hadoop fs -rm hdfs://getnamenode/apps/hive/warehouse/organization.db/employee/*
create the table with removed columns as required:
hive> CREATE EXTERNAL TABLE `organization.employee`(
`employee_id` bigint,
`employee_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://getnamenode/apps/hive/warehouse/organization.db/employee'
insert the original records back into original table.
hive> insert into organization.employee
select employee_id, employee_name from organization.employee_temp;
finally drop the temp table created
hive> drop table organization.employee_temp;
ALTER TABLE emp REPLACE COLUMNS( name string, dept string);
Above statement can only change the schema of a table, not data.
A solution of this problem to copy data in a new table.
Insert <New Table> Select <selective columns> from <Old Table>
ALTER TABLE is not yet supported for non-native tables; i.e. what you get with CREATE TABLE when a STORED BY clause is specified.
check this https://cwiki.apache.org/confluence/display/Hive/StorageHandlers
After a lot of mistakes, in addition to above explanations, I would add simpler answers.
Case 1: Add new column named new_column
ALTER TABLE schema.table_name
ADD new_column INT COMMENT 'new number column');
Case 2: Rename a column new_column to no_of_days
ALTER TABLE schema.table_name
CHANGE new_column no_of_days INT;
Note that in renaming, both columns should be of same datatype like above as INT
For external table its simple and easy.
Just drop the table schema then edit create table schema , at last again create table with new schema.
example table: aparup_test.tbl_schema_change and will drop column id
steps:-
------------- show create table to fetch schema ------------------
spark.sql("""
show create table aparup_test.tbl_schema_change
""").show(100,False)
o/p:
CREATE EXTERNAL TABLE aparup_test.tbl_schema_change(name STRING, time_details TIMESTAMP, id BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'gs://aparup_test/tbl_schema_change'
TBLPROPERTIES (
'parquet.compress' = 'snappy'
)
""")
------------- drop table --------------------------------
spark.sql("""
drop table aparup_test.tbl_schema_change
""").show(100,False)
------------- edit create table schema by dropping column "id"------------------
CREATE EXTERNAL TABLE aparup_test.tbl_schema_change(name STRING, time_details TIMESTAMP)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'gs://aparup_test/tbl_schema_change'
TBLPROPERTIES (
'parquet.compress' = 'snappy'
)
""")
------------- sync up table schema with parquet files ------------------
spark.sql("""
msck repair table aparup_test.tbl_schema_change
""").show(100,False)
==================== DONE =====================================
Even below query is working for me.
Alter table tbl_name drop col_name

hive external partitioned table

First i created hive external table partitioned by code and date
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/old_work/XYZ';
and then i execute insert overwrite on this table taking data from other table
INSERT OVERWRITE TABLE XYZ PARTITION (CODE,DATE)
SELECT
*
FROM TEMP_XYZ;
and after that i count the number of records in hive
select count(*) from XYZ;
it shows me 1000 records are there
and then i rename or move the location '/old_work/XYZ' to '/new_work/XYZ'
and then i again drop the XYZ table and created again pointing location to new directory
means '/new_work/XYZ'
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/new_work/XYZ';
But then when i execute select count(*) from XYZ table in hive , it shows 0 records ,
i think i missed something , please help me on this????
You need not drop the table and re create it the second time:
As soon as you move or rename a external hdfs location of the table just do this :
msck repair table <table_name>
In your case the error was because, The hive metastore wasnt updated with the new path .

Hive - facing challenge's in Dynamic partition error

Can any one guide me where I am doing mistake while doing dynamic partition.
--Staging table:
create table staging_peopledata
(
firstname string,
secondname string,
salary float,
country string
state string
)
row format delimited fields terminated by ',' lines terminated by '\n';
--Data for Staging table:
John,David,30000,RUS,tnRUS
John,David,30000,RUS,tnRUS
Mary,David,5000,AUS,syAUS
Mary,David,5000,AUS,syAUS
Mary,David,5000,AUS,weAUS
Pierre,Cathey,6000,RUS,kaRUS
Pierre,Cathey,6000,RUS,kaRUS
Ahmed,Talib,11000,US,bcUS
Ahmed,Talib,11000,US,onUS
Ahmed,Talib,11000,US,onUS
kris,David,80000,UK,lnUK
kris,David,80000,UK,soUK
--Production table:
create table Production_peopledata
(
firstname string,
lastname string,
salary float)
partitioned by (country string, state string)
row format delimited fields terminated by ',' lines terminated by '\n';
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table Production_peopledata
partition(country,state)
select firstname, secondname, salary, country, state from staging_peopledata;
If i execute the above command I am getting error as below.
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode
requires atleast one static partition column. To turn this off set
hive.exec.dynamic.partition.mode=nonstrict
Can any one tell me where I am doing the mistake.
Can you please run below command on Hive Shell.
hive>set hive.exec.dynamic.partition.mode=nonstrict;
You need to set below properties:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
The column name want to partition on should not be part of the table definition. As the partition column is dynamically generated. While filling the data in the partitioned table the partitioned column should come from the source table.
Let's say we have EMP and EMP1 tables. EMP1 is the partitioned table which will get the data from the EMP table. Initially both of these tables are same. So first we need to create a partitioned column i.e. salpart. Then we will add this column in the source table which is EMP. After successful run we can see the partitioned files in user/hive/warehouse location. The above explanation is implemented as below:
load data local inpath '/home/cloudera/myemployeedata.txt' overwrite into table emp;
CREATE TABLE IF NOT EXISTS emp ( eid int, name String,
salary String, destination String,salpart string)
COMMENT "Employee details"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
CREATE TABLE IF NOT EXISTS emp1 ( eid int, name String,
salary String, destination String)
COMMENT "Employee details"
partitioned by (salpart string) {this column will values will come from a seperate table }
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
Dynamic Partition:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table emp1 partition(salpart) select eid,name,salary,destination,salpart from emp;
as per the error it seems that mode in still strict, for dynamic partitioning it need to be set to non strict
use below command
hive>set hive.exec.dynamic.partition.mode=nonstrict;
Once again try to do
Set hive.exec.dynamic.partition.mode=nonstrict
Sometimes in hive it happens even if you set this property it considers strict mode hence I suggest you to set this property once again

Resources