partition the hive data complex data types while inserting data its shows error - hadoop

i created a table using hive i want to partition the data based on location
create table student(
id bigint
,name string
,location string
, course array<string>)
ROW FORMAT DELIMiTED fields terminated by '\t'
collection items terminated by ','
stored as textfile;
and data like
100 student1 ongole java,.net,hadoop
101 student2 hyderabad .net,hadoop
102 student3 vizag java,hadoop
103 student4 ongole .net,hadoop
104 student5 vizag java,.net
105 student6 ongole java,.net,hadoop
106 student7 neollre .net,hadoop
creating partition table:
create table student_partition(
id bigint
,name string
,course array<string>)
PARTITIONED BY (address string)
ROW FORMAT DELIMiTED fields terminated by '\t'
collection items terminated by ','
stored as textfile;
INSERT OVERWRITE TABLE student_partition PARTITION(address) select *
from student;
i'm trying to partition the data based on location but it shows below error:
FAILED: SemanticException [Error 10044]: Line 1:23 Cannot insert into
target table because column number/types are different 'address':
Cannot convert column 2 from string to array.
please anyone help me.

The columns of the source and the target should match
Option 1: adjust the source to the target. The partition column goes last
insert into student_partition partition (address)
select id,name,course,location
from student
;
Option 2: adjust the target to the source
insert into student_partition partition (address) (id,name,address,course)
select *
from student
;
P.s.
You might need this -
set hive.exec.dynamic.partition.mode=nonstrict
;

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

Add a new partition in hive external table and update the existing partition to column of the table to non-partition column

I have below existing table emp with partition column as as_of_date(current_date -1).
CREATE EXTERNAL TABLE IF NOT EXISTS emp(
student_ID INT,
name STRING)
partitioned by (as_of_date date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/emp';
Below are the existing partition path
user/emp/as_of_date=2021-09-02
user/emp/as_of_date=2021-09-03
user/emp/as_of_date=2021-09-04
In emp table, I have to add new partition column as businessdate(current_date) and change partition column (as_of_date) to non-partition column.
Expected output should be as below.
describe table emp;
CREATE EXTERNAL TABLE IF NOT EXISTS emp(
student_ID INT,
Name STRING,
as_of_date date)
partitioned by (businessdate date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/emp';
After update, below will be hdfs path
user/emp/buinessdate=2021-09-03
user/emp/buinessdate=2021-09-04
user/emp/businessdate=2021-09-05
Expected output table:
|student_ID |name |as_of_date | business_date |
|--| --- | --- |----|
|1 | Sta |2021-09-02| 2021-09-03 |
|2 |Danny|2021-09-03| 2021-09-04 |
|3 |Elle |2021-09-04| 2021-09-05 |
Create new table, load data from old table, remove old table, rename new table.
--1 Create new table emp1
CREATE EXTERNAL TABLE emp1(
student_ID INT,
Name STRING,
as_of_date date)
partitioned by (businessdate date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/emp1';
--2 Load data into emp1 from the emp with new partition column calculated
--dynamic partition mode
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table emp1 partition (business_date)
select
student_ID,
Name,
as_of_date,
date_add(as_of_date,1) as business_date
from emp;
Now you can drop old (make it managed first to drop location as well) and rename new table if necessary.

To overwrite hive table with an updated file

I have a CSv file:
Name,Age,City,Country
SACHIN,44,PUNE,INDIA
TENDULKAR,45,MUMBAI,INDIA
SOURAV,45,NEW YORK,USA
GANGULY,45,CHICAGO,USA
I created a HIVE table and loaded the data into it.
I found that the above file is wrong and corrected file is below:
Name,Age,City,Country
SACHIN,44,PUNE,INDIA
TENDULKAR,45,MUMBAI,INDIA
SOURAV,45,NEW JERSEY,USA
GANGULY,45,CHICAGO,USA
I need to update my main table with correct file.
I have tried below approaches.
1- Created the main table as a partitioned table on City and dynamically loaded the first file.
Step1- Creating a temp table and loading the old.csv file as it is without partitioning. This step I am doing to insert data in main table dyn dynamically by not creating separate input files per partition.
create table temp(
name string,
age int,
city string,
country string)
row format delimited
fields terminated by ','
stored as textfile;
Step2- Loaded old file into temporary table.
load data local inpath '/home/test_data/old.csv' into table temp;
Step3- Creating the main partitioned table.
create table dyn(
name string,
age int)
partitioned by(city string,country string)
row format delimited
fields terminated by ','
stored as textfile;
Step4- Inserting dynamically the old.csv file into the partitioned table from temporary table.
insert into table dyn
partition(city,country)
select name,age,city,country from temp;
Old recorded dynamically inserted into main table. In the next steps I am trying to correct the main table dyn with old.csv to new.csv
Step5- Creating another temporary table with new and correct input file.
create table temp1(
name string,
age int,
city string,
country string)
row format delimited
fields terminated by ','
stored as textfile;
Step6- Loading the new and correct input file into second temp table which will then be used to overwrite the main table but only the row whose data was wrong in old.csv. That is for SOURAV,45,NEW YORK,USA to SOURAV,45,NEW JERSEY,USA.
load data local inpath '/home/test_data/new.csv' into table temp1;
Overwriting the main table but only the row whose data was wrong in old.csv. That is for SOURAV,45,NEW YORK,USA to SOURAV,45,NEW JERSEY,USA.
Final overwrite Step7 attempt 1-
insert overwrite table dyn partition(country='USA' , city='NEW YORK') select city,country from temp1 t where t.city='NEW JERSEY' and t.country='USA';
Result:- Inserted NUll in Name column.
NEW JERSEY NULL NEW YORK USA
Final overwrite Step7 attempt 2-
insert overwrite table dyn partition(country='USA' , city='NEW YORK') select name,age from temp1 t where t.city='NEW JERSEY' and t.country='USA';
Result:- No change in dyn table. Same as before. NEW YORk did not update to NEW JERSEY
Final overwrite Step7 attempt3 -
insert overwrite table dyn partition(country='USA' , city='NEW YORK') select * from temp1 t where t.city='NEW JERSEY' and t.country='USA';
Error:- FAILED: SemanticException [Error 10044]: Line 1:23 Cannot Insert into target table because column number/types are different. Table insclause-0 has 2 columns,but query has 4 columns
What is the correct approach for this problem.

How to Copy TEXT format partitioned table to ORC format Table in Hive

i have a Text Format hive table, like:
CREATE EXTERNAL TABLE op_log (
time string, debug string,app_id string,app_version string, ...more fields)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
now i create a orc format table with same fields, like
CREATE TABLE op_log_orc (
time string, debug string,app_id string,app_version string, ...more fields)
PARTITIONED BY (dt string)
STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");
when i copy from op_log to op_log_orc, i have get this errors:
hive> insert into op_log_orc PARTITION(dt='2016-08-09') select * from op_log where dt='2016-08-09';
FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target table because column number/types are different ''2016-08-09'': Table insclause-0 has 62 columns, but query has 63 columns.
hive>
The partition key (dt) in the source table is returned in the result set as though it were a regular field, so you have the extra column. Exclude the dt field from the field list (instead of *) if you're going to specify its value in the partition key. Alternatively, just specify dt as the name of the partition, without providing a value. See CTAS (create table as select...) in the example here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS)

Hive - facing challenge's in Dynamic partition error

Can any one guide me where I am doing mistake while doing dynamic partition.
--Staging table:
create table staging_peopledata
(
firstname string,
secondname string,
salary float,
country string
state string
)
row format delimited fields terminated by ',' lines terminated by '\n';
--Data for Staging table:
John,David,30000,RUS,tnRUS
John,David,30000,RUS,tnRUS
Mary,David,5000,AUS,syAUS
Mary,David,5000,AUS,syAUS
Mary,David,5000,AUS,weAUS
Pierre,Cathey,6000,RUS,kaRUS
Pierre,Cathey,6000,RUS,kaRUS
Ahmed,Talib,11000,US,bcUS
Ahmed,Talib,11000,US,onUS
Ahmed,Talib,11000,US,onUS
kris,David,80000,UK,lnUK
kris,David,80000,UK,soUK
--Production table:
create table Production_peopledata
(
firstname string,
lastname string,
salary float)
partitioned by (country string, state string)
row format delimited fields terminated by ',' lines terminated by '\n';
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table Production_peopledata
partition(country,state)
select firstname, secondname, salary, country, state from staging_peopledata;
If i execute the above command I am getting error as below.
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode
requires atleast one static partition column. To turn this off set
hive.exec.dynamic.partition.mode=nonstrict
Can any one tell me where I am doing the mistake.
Can you please run below command on Hive Shell.
hive>set hive.exec.dynamic.partition.mode=nonstrict;
You need to set below properties:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
The column name want to partition on should not be part of the table definition. As the partition column is dynamically generated. While filling the data in the partitioned table the partitioned column should come from the source table.
Let's say we have EMP and EMP1 tables. EMP1 is the partitioned table which will get the data from the EMP table. Initially both of these tables are same. So first we need to create a partitioned column i.e. salpart. Then we will add this column in the source table which is EMP. After successful run we can see the partitioned files in user/hive/warehouse location. The above explanation is implemented as below:
load data local inpath '/home/cloudera/myemployeedata.txt' overwrite into table emp;
CREATE TABLE IF NOT EXISTS emp ( eid int, name String,
salary String, destination String,salpart string)
COMMENT "Employee details"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
CREATE TABLE IF NOT EXISTS emp1 ( eid int, name String,
salary String, destination String)
COMMENT "Employee details"
partitioned by (salpart string) {this column will values will come from a seperate table }
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
Dynamic Partition:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table emp1 partition(salpart) select eid,name,salary,destination,salpart from emp;
as per the error it seems that mode in still strict, for dynamic partitioning it need to be set to non strict
use below command
hive>set hive.exec.dynamic.partition.mode=nonstrict;
Once again try to do
Set hive.exec.dynamic.partition.mode=nonstrict
Sometimes in hive it happens even if you set this property it considers strict mode hence I suggest you to set this property once again

Resources