hive external partitioned table - hadoop

First i created hive external table partitioned by code and date
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/old_work/XYZ';
and then i execute insert overwrite on this table taking data from other table
INSERT OVERWRITE TABLE XYZ PARTITION (CODE,DATE)
SELECT
*
FROM TEMP_XYZ;
and after that i count the number of records in hive
select count(*) from XYZ;
it shows me 1000 records are there
and then i rename or move the location '/old_work/XYZ' to '/new_work/XYZ'
and then i again drop the XYZ table and created again pointing location to new directory
means '/new_work/XYZ'
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/new_work/XYZ';
But then when i execute select count(*) from XYZ table in hive , it shows 0 records ,
i think i missed something , please help me on this????

You need not drop the table and re create it the second time:
As soon as you move or rename a external hdfs location of the table just do this :
msck repair table <table_name>
In your case the error was because, The hive metastore wasnt updated with the new path .

Related

Creating external hive table from parquet file which contains json string

I have a parquet file which is stored in a partitioned directory. The format of the partition is
/dates=*/hour=*/something.parquet.
The content of parquet file looks like as follows:
{a:1,b:2,c:3}.
This is json data and i want to create external hive table.
My approach:
CREATE EXTERNAL TABLE test_table (a int, b int, c int) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';
After that i run MSCK REPAIR TABLE test_table; but i get following output:
hive> select * from test_table;
OK
NULL NULL NULL 2021-09-27 09
The other three columns are null. I think i have to define JSON schema somehow but i have no idea how to proceed further.
Create table with the same schema as parquet file:
CREATE EXTERNAL TABLE test_table (value string) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';
Run repair table to mount partitions:
MSCK REPAIR TABLE test_table;
Parse value in query:
select e.a, e.b, e.c
from test_table t
lateral view json_tuple(t.value, 'a', 'b', 'c') e as a,b,c
Cast values as int if necessary: cast(e.a as int) as a
You can also create a table for json fields as columns using this:
CREATE EXTERNAL TABLE IF NOT EXISTS test_table(
a INT,
b INT,
c INT)
partitioned by (dates string, hour string)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS PARQUET
location '/user/output/';
Then run MSCK REPAIR TABLE test_table;
You would be able to query directly without writing any parsers.

How to add multi-level partition in hive?

I have customer managed table in the hive, partition based on date and customerName. My directory structure is like below:
user/hive/warehouse/test.db/customer/date1=2021-09-16/customerName=xyz
when I am doing show partitions customer it is not giving output. So I tried to add a partition with
MSCK REPAIR TABLE customer;
It give error Execution Error,return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
ALTER TABLE customer ADD PARTITION (date1='2021-09-15') PARTITION (customerName='xyz');
It also give error ValidationFailureSemanticException partition spec {customername=xyz} contain non partition column
How can I add these partitions in hive metastore.
hive> show create table customer;
OK
CREATE TABLE `customer`(
`act` string)
PARTITIONED BY (
`date1` string,
`customername` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='hdfs://hdcluster/user/hive/warehouse/test.db/customer')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://hdcluster/user/hive/warehouse/test.db/customer'
TBLPROPERTIES (
'spark.sql.create.version'='2.4.0',
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.provider'='orc',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'=
'{\"type\":\"struct\",\"fields\":
[{\"name\":\"act\",\"type\":\"string\",\"nullable\":true,\"metadata\":
{}}, {\"name\":\"date1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"customername\",\"type\":\"string\",\"nullable\":true,\"metadata\
":{}}]}','spark.sql.sources.schema.partCol.0'='date1',
'spark.sql.sources.schema.partCol.1'='customername',
'transient_lastDdlTime'='1631781225')

How I avoid the "NULL" in the first "Field Name" of Hive table

First of all I have created the table "emp" in Hive by using below commands:
create table emp (id INT, name STRING, address STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
Then load the data in this "emp" table by this below command:
LOAD DATA LOCAL INPATH '\home\cloudera\Desktop\emp.txt' overwrite into table emp;
When I select the data from "emp" table: it show me first field of table Null
like this:
You have an header row in your file and the first value id cannot be converted into INT therefore being replaced by NULL.
add tblproperties ("skip.header.line.count"="1") to your table definition
For an existing table -
alter table emp set tblproperties ("skip.header.line.count"="1");

Unable to load data in Hive partitioned table

I have created a table in Hive with the following query:
create table if not exists employee(CASE_NUMBER String,
CASE_STATUS String,
CASE_RECEIVED_DATE DATE,
DECISION_DATE DATE,
EMPLOYER_NAME STRING,
PREVAILING_WAGE_PER_YEAR BIGINT,
PAID_WAGE_PER_YEAR BIGINT,
order_n int) partitioned by (JOB_TITLE_SUBGROUP STRING) row format delimited fields terminated by ',';
I tried loading data into the create table using below query:
LOAD DATA INPATH '/salary_data.csv' overwrite into table employee partition (JOB_TITLE_SUBGROUP);
For the partitioned table, I have even set following configuration :
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
But I am getting below error while executing the load query:
Your query has the following error(s):
Error while compiling statement: FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Invalid partition key & values; keys [job_title_subgroup, ], values [])
Please help.
If you want to load data into a Hive partition, you have to provide the value of the partition itself in the LOAD DATA query. So in this case, your query would be something like this.
LOAD DATA INPATH '/salary_data.csv' overwrite into table employee partition (JOB_TITLE_SUBGROUP="Value");
Where "Value" is the name of the partition in which you are loading your data. The reason is because Hive will use "Value" to create the directory in which your .csv is going to be stored, which will be something like this: .../employee/JOB_TITLE_SUBGROUP=Value. I hope this helps.
Check the documentation for details on the LOAD DATA syntax.
EDITED
Since the table has dynamic partition, one solution would be loading the .csv into an external table (e.g. employee_external) and then execute an INSERT command like this:
INSERT OVERWRITE INTO TABLE employee PARTITION(JOB_TITLE_SUBGROUP)
SELECT CASE_NUMBER, CASE_STATUS, (...), JOB_TITLE_SUBGROUP
FROM employee_external
I might be little late to reply but can try below steps:
Set below properties first :
Ø set hive.exec.dynamic.partition.mode=nonstrict;
Ø set hive.exec.dynamic.partition=true;
Create temp table first:
CREATE EXTERNAL TABLE IF NOT EXISTS employee_temp(
ID STRING,
Name STRING,
Salary STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
tblproperties ("skip.header.line.count"="1");
Load Data in temporary table:
hive> LOAD DATA INPATH 'filepath/employee.csv' OVERWRITE INTO TABLE employee;
Create Partitioned Table:
CREATE EXTERNAL TABLE IF NOT EXISTS employee_part(
ID STRING,
Name STRING)
PARTITIONED BY (Salary STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
tblproperties ("skip.header.line.count"="1");
Load Data into partitioned table from intermediate / temp table:
INSERT OVERWRITE TABLE employee_part PARTITION (SALARY) SELECT * FROM employee;

Alter hive table add or drop column

I have orc table in hive I want to drop column from this table
ALTER TABLE table_name drop col_name;
but I am getting the following exception
Error occurred executing hive query: OK FAILED: ParseException line 1:35 mismatched input 'user_id1' expecting PARTITION near 'drop' in drop partition statement
Can any one help me or provide any idea to do this? Note, I am using hive 0.14
You cannot drop column directly from a table using command ALTER TABLE table_name drop col_name;
The only way to drop column is using replace command. Lets say, I have a table emp with id, name and dept column. I want to drop id column of table emp. So provide all those columns which you want to be the part of table in replace columns clause. Below command will drop id column from emp table.
ALTER TABLE emp REPLACE COLUMNS( name string, dept string);
There is also a "dumb" way of achieving the end goal, is to create a new table without the column(s) not wanted. Using Hive's regex matching will make this rather easy.
Here is what I would do:
-- make a copy of the old table
ALTER TABLE table RENAME TO table_to_dump;
-- make the new table without the columns to be deleted
CREATE TABLE table AS
SELECT `(col_to_remove_1|col_to_remove_2)?+.+`
FROM table_to_dump;
-- dump the table
DROP TABLE table_to_dump;
If the table in question is not too big, this should work just well.
suppose you have an external table viz. organization.employee as: (not including TBLPROPERTIES)
hive> show create table organization.employee;
OK
CREATE EXTERNAL TABLE `organization.employee`(
`employee_id` bigint,
`employee_name` string,
`updated_by` string,
`updated_date` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://getnamenode/apps/hive/warehouse/organization.db/employee'
You want to remove updated_by, updated_date columns from the table. Follow these steps:
create a temp table replica of organization.employee as:
hive> create table organization.employee_temp as select * from organization.employee;
drop the main table organization.employee.
hive> drop table organization.employee;
remove the underlying data from HDFS (need to come out of hive shell)
[nameet#ip-80-108-1-111 myfile]$ hadoop fs -rm hdfs://getnamenode/apps/hive/warehouse/organization.db/employee/*
create the table with removed columns as required:
hive> CREATE EXTERNAL TABLE `organization.employee`(
`employee_id` bigint,
`employee_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://getnamenode/apps/hive/warehouse/organization.db/employee'
insert the original records back into original table.
hive> insert into organization.employee
select employee_id, employee_name from organization.employee_temp;
finally drop the temp table created
hive> drop table organization.employee_temp;
ALTER TABLE emp REPLACE COLUMNS( name string, dept string);
Above statement can only change the schema of a table, not data.
A solution of this problem to copy data in a new table.
Insert <New Table> Select <selective columns> from <Old Table>
ALTER TABLE is not yet supported for non-native tables; i.e. what you get with CREATE TABLE when a STORED BY clause is specified.
check this https://cwiki.apache.org/confluence/display/Hive/StorageHandlers
After a lot of mistakes, in addition to above explanations, I would add simpler answers.
Case 1: Add new column named new_column
ALTER TABLE schema.table_name
ADD new_column INT COMMENT 'new number column');
Case 2: Rename a column new_column to no_of_days
ALTER TABLE schema.table_name
CHANGE new_column no_of_days INT;
Note that in renaming, both columns should be of same datatype like above as INT
For external table its simple and easy.
Just drop the table schema then edit create table schema , at last again create table with new schema.
example table: aparup_test.tbl_schema_change and will drop column id
steps:-
------------- show create table to fetch schema ------------------
spark.sql("""
show create table aparup_test.tbl_schema_change
""").show(100,False)
o/p:
CREATE EXTERNAL TABLE aparup_test.tbl_schema_change(name STRING, time_details TIMESTAMP, id BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'gs://aparup_test/tbl_schema_change'
TBLPROPERTIES (
'parquet.compress' = 'snappy'
)
""")
------------- drop table --------------------------------
spark.sql("""
drop table aparup_test.tbl_schema_change
""").show(100,False)
------------- edit create table schema by dropping column "id"------------------
CREATE EXTERNAL TABLE aparup_test.tbl_schema_change(name STRING, time_details TIMESTAMP)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'gs://aparup_test/tbl_schema_change'
TBLPROPERTIES (
'parquet.compress' = 'snappy'
)
""")
------------- sync up table schema with parquet files ------------------
spark.sql("""
msck repair table aparup_test.tbl_schema_change
""").show(100,False)
==================== DONE =====================================
Even below query is working for me.
Alter table tbl_name drop col_name

Resources