Hive - Insert data into partitioned table: partition not found - hadoop

I'm having issue while trying to inserting new data in Hive external partitioned table.
Table is partitioned by day, the error I got is:
FAILED: SemanticException [Error 10006]: Line 1:51 Partition not found ''18102016''
My query is as following:
ALTER TABLE my_source_table RECOVER PARTITIONS;
INSERT OVERWRITE TABLE my_dest_table PARTITION (d = '18102016')
SELECT
'III' AS primary_alias_type,
iii_id AS primary_alias_id,
FROM
my_source_table
WHERE
d = '18102016'
The my_dest_table has been created as:
CREATE EXTERNAL TABLE my_dest_table (
primary_alias_type string,
primary_alias_id
) PARTITIONED BY (d string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://my_bucket/my_external_tables/'
Any idea on what I'm doing wrong? Thanks!

I believe you should ALTER TABLE my_source_table RECOVER PARTITIONS; do this for your destination table.
ALTER TABLE my_dest_table RECOVER PARTITIONS;
try this.
Note: Of course you should remove the extra comma what Alex L mentioned. Which will give other parsing error.

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

To overwrite hive table with an updated file

I have a CSv file:
Name,Age,City,Country
SACHIN,44,PUNE,INDIA
TENDULKAR,45,MUMBAI,INDIA
SOURAV,45,NEW YORK,USA
GANGULY,45,CHICAGO,USA
I created a HIVE table and loaded the data into it.
I found that the above file is wrong and corrected file is below:
Name,Age,City,Country
SACHIN,44,PUNE,INDIA
TENDULKAR,45,MUMBAI,INDIA
SOURAV,45,NEW JERSEY,USA
GANGULY,45,CHICAGO,USA
I need to update my main table with correct file.
I have tried below approaches.
1- Created the main table as a partitioned table on City and dynamically loaded the first file.
Step1- Creating a temp table and loading the old.csv file as it is without partitioning. This step I am doing to insert data in main table dyn dynamically by not creating separate input files per partition.
create table temp(
name string,
age int,
city string,
country string)
row format delimited
fields terminated by ','
stored as textfile;
Step2- Loaded old file into temporary table.
load data local inpath '/home/test_data/old.csv' into table temp;
Step3- Creating the main partitioned table.
create table dyn(
name string,
age int)
partitioned by(city string,country string)
row format delimited
fields terminated by ','
stored as textfile;
Step4- Inserting dynamically the old.csv file into the partitioned table from temporary table.
insert into table dyn
partition(city,country)
select name,age,city,country from temp;
Old recorded dynamically inserted into main table. In the next steps I am trying to correct the main table dyn with old.csv to new.csv
Step5- Creating another temporary table with new and correct input file.
create table temp1(
name string,
age int,
city string,
country string)
row format delimited
fields terminated by ','
stored as textfile;
Step6- Loading the new and correct input file into second temp table which will then be used to overwrite the main table but only the row whose data was wrong in old.csv. That is for SOURAV,45,NEW YORK,USA to SOURAV,45,NEW JERSEY,USA.
load data local inpath '/home/test_data/new.csv' into table temp1;
Overwriting the main table but only the row whose data was wrong in old.csv. That is for SOURAV,45,NEW YORK,USA to SOURAV,45,NEW JERSEY,USA.
Final overwrite Step7 attempt 1-
insert overwrite table dyn partition(country='USA' , city='NEW YORK') select city,country from temp1 t where t.city='NEW JERSEY' and t.country='USA';
Result:- Inserted NUll in Name column.
NEW JERSEY NULL NEW YORK USA
Final overwrite Step7 attempt 2-
insert overwrite table dyn partition(country='USA' , city='NEW YORK') select name,age from temp1 t where t.city='NEW JERSEY' and t.country='USA';
Result:- No change in dyn table. Same as before. NEW YORk did not update to NEW JERSEY
Final overwrite Step7 attempt3 -
insert overwrite table dyn partition(country='USA' , city='NEW YORK') select * from temp1 t where t.city='NEW JERSEY' and t.country='USA';
Error:- FAILED: SemanticException [Error 10044]: Line 1:23 Cannot Insert into target table because column number/types are different. Table insclause-0 has 2 columns,but query has 4 columns
What is the correct approach for this problem.

Unable to load data in Hive partitioned table

I have created a table in Hive with the following query:
create table if not exists employee(CASE_NUMBER String,
CASE_STATUS String,
CASE_RECEIVED_DATE DATE,
DECISION_DATE DATE,
EMPLOYER_NAME STRING,
PREVAILING_WAGE_PER_YEAR BIGINT,
PAID_WAGE_PER_YEAR BIGINT,
order_n int) partitioned by (JOB_TITLE_SUBGROUP STRING) row format delimited fields terminated by ',';
I tried loading data into the create table using below query:
LOAD DATA INPATH '/salary_data.csv' overwrite into table employee partition (JOB_TITLE_SUBGROUP);
For the partitioned table, I have even set following configuration :
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
But I am getting below error while executing the load query:
Your query has the following error(s):
Error while compiling statement: FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Invalid partition key & values; keys [job_title_subgroup, ], values [])
Please help.
If you want to load data into a Hive partition, you have to provide the value of the partition itself in the LOAD DATA query. So in this case, your query would be something like this.
LOAD DATA INPATH '/salary_data.csv' overwrite into table employee partition (JOB_TITLE_SUBGROUP="Value");
Where "Value" is the name of the partition in which you are loading your data. The reason is because Hive will use "Value" to create the directory in which your .csv is going to be stored, which will be something like this: .../employee/JOB_TITLE_SUBGROUP=Value. I hope this helps.
Check the documentation for details on the LOAD DATA syntax.
EDITED
Since the table has dynamic partition, one solution would be loading the .csv into an external table (e.g. employee_external) and then execute an INSERT command like this:
INSERT OVERWRITE INTO TABLE employee PARTITION(JOB_TITLE_SUBGROUP)
SELECT CASE_NUMBER, CASE_STATUS, (...), JOB_TITLE_SUBGROUP
FROM employee_external
I might be little late to reply but can try below steps:
Set below properties first :
Ø set hive.exec.dynamic.partition.mode=nonstrict;
Ø set hive.exec.dynamic.partition=true;
Create temp table first:
CREATE EXTERNAL TABLE IF NOT EXISTS employee_temp(
ID STRING,
Name STRING,
Salary STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
tblproperties ("skip.header.line.count"="1");
Load Data in temporary table:
hive> LOAD DATA INPATH 'filepath/employee.csv' OVERWRITE INTO TABLE employee;
Create Partitioned Table:
CREATE EXTERNAL TABLE IF NOT EXISTS employee_part(
ID STRING,
Name STRING)
PARTITIONED BY (Salary STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
tblproperties ("skip.header.line.count"="1");
Load Data into partitioned table from intermediate / temp table:
INSERT OVERWRITE TABLE employee_part PARTITION (SALARY) SELECT * FROM employee;

How to Copy TEXT format partitioned table to ORC format Table in Hive

i have a Text Format hive table, like:
CREATE EXTERNAL TABLE op_log (
time string, debug string,app_id string,app_version string, ...more fields)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
now i create a orc format table with same fields, like
CREATE TABLE op_log_orc (
time string, debug string,app_id string,app_version string, ...more fields)
PARTITIONED BY (dt string)
STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");
when i copy from op_log to op_log_orc, i have get this errors:
hive> insert into op_log_orc PARTITION(dt='2016-08-09') select * from op_log where dt='2016-08-09';
FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target table because column number/types are different ''2016-08-09'': Table insclause-0 has 62 columns, but query has 63 columns.
hive>
The partition key (dt) in the source table is returned in the result set as though it were a regular field, so you have the extra column. Exclude the dt field from the field list (instead of *) if you're going to specify its value in the partition key. Alternatively, just specify dt as the name of the partition, without providing a value. See CTAS (create table as select...) in the example here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS)

Alter hive table add or drop column

I have orc table in hive I want to drop column from this table
ALTER TABLE table_name drop col_name;
but I am getting the following exception
Error occurred executing hive query: OK FAILED: ParseException line 1:35 mismatched input 'user_id1' expecting PARTITION near 'drop' in drop partition statement
Can any one help me or provide any idea to do this? Note, I am using hive 0.14
You cannot drop column directly from a table using command ALTER TABLE table_name drop col_name;
The only way to drop column is using replace command. Lets say, I have a table emp with id, name and dept column. I want to drop id column of table emp. So provide all those columns which you want to be the part of table in replace columns clause. Below command will drop id column from emp table.
ALTER TABLE emp REPLACE COLUMNS( name string, dept string);
There is also a "dumb" way of achieving the end goal, is to create a new table without the column(s) not wanted. Using Hive's regex matching will make this rather easy.
Here is what I would do:
-- make a copy of the old table
ALTER TABLE table RENAME TO table_to_dump;
-- make the new table without the columns to be deleted
CREATE TABLE table AS
SELECT `(col_to_remove_1|col_to_remove_2)?+.+`
FROM table_to_dump;
-- dump the table
DROP TABLE table_to_dump;
If the table in question is not too big, this should work just well.
suppose you have an external table viz. organization.employee as: (not including TBLPROPERTIES)
hive> show create table organization.employee;
OK
CREATE EXTERNAL TABLE `organization.employee`(
`employee_id` bigint,
`employee_name` string,
`updated_by` string,
`updated_date` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://getnamenode/apps/hive/warehouse/organization.db/employee'
You want to remove updated_by, updated_date columns from the table. Follow these steps:
create a temp table replica of organization.employee as:
hive> create table organization.employee_temp as select * from organization.employee;
drop the main table organization.employee.
hive> drop table organization.employee;
remove the underlying data from HDFS (need to come out of hive shell)
[nameet#ip-80-108-1-111 myfile]$ hadoop fs -rm hdfs://getnamenode/apps/hive/warehouse/organization.db/employee/*
create the table with removed columns as required:
hive> CREATE EXTERNAL TABLE `organization.employee`(
`employee_id` bigint,
`employee_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://getnamenode/apps/hive/warehouse/organization.db/employee'
insert the original records back into original table.
hive> insert into organization.employee
select employee_id, employee_name from organization.employee_temp;
finally drop the temp table created
hive> drop table organization.employee_temp;
ALTER TABLE emp REPLACE COLUMNS( name string, dept string);
Above statement can only change the schema of a table, not data.
A solution of this problem to copy data in a new table.
Insert <New Table> Select <selective columns> from <Old Table>
ALTER TABLE is not yet supported for non-native tables; i.e. what you get with CREATE TABLE when a STORED BY clause is specified.
check this https://cwiki.apache.org/confluence/display/Hive/StorageHandlers
After a lot of mistakes, in addition to above explanations, I would add simpler answers.
Case 1: Add new column named new_column
ALTER TABLE schema.table_name
ADD new_column INT COMMENT 'new number column');
Case 2: Rename a column new_column to no_of_days
ALTER TABLE schema.table_name
CHANGE new_column no_of_days INT;
Note that in renaming, both columns should be of same datatype like above as INT
For external table its simple and easy.
Just drop the table schema then edit create table schema , at last again create table with new schema.
example table: aparup_test.tbl_schema_change and will drop column id
steps:-
------------- show create table to fetch schema ------------------
spark.sql("""
show create table aparup_test.tbl_schema_change
""").show(100,False)
o/p:
CREATE EXTERNAL TABLE aparup_test.tbl_schema_change(name STRING, time_details TIMESTAMP, id BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'gs://aparup_test/tbl_schema_change'
TBLPROPERTIES (
'parquet.compress' = 'snappy'
)
""")
------------- drop table --------------------------------
spark.sql("""
drop table aparup_test.tbl_schema_change
""").show(100,False)
------------- edit create table schema by dropping column "id"------------------
CREATE EXTERNAL TABLE aparup_test.tbl_schema_change(name STRING, time_details TIMESTAMP)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'gs://aparup_test/tbl_schema_change'
TBLPROPERTIES (
'parquet.compress' = 'snappy'
)
""")
------------- sync up table schema with parquet files ------------------
spark.sql("""
msck repair table aparup_test.tbl_schema_change
""").show(100,False)
==================== DONE =====================================
Even below query is working for me.
Alter table tbl_name drop col_name

Resources