hive add columns on partitioned table does not work - hadoop

I share my experience about adding columns on a partitioned hive table.
As you can see, despite the CASCADE function, the ALTER brakes my table :(
add columns on partitioned table
table description
CREATE TABLE test (
a string,
b string,
c string
)
PARTITIONED BY (
x string,
y string,
z string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.compress'='SNAPPY'
);
duplicate the table
CREATE TABLE test_tmp...
hadoop distcp hdfs://.../test/* dfs://.../test_tmp
MSCK REPAIR TABLE test_tmp;
SELECT * FROM test_tmp
LIMIT 100
check : OK (I get results)
modify the table
ALTER TABLE test_tmp
ADD COLUMNS(
aa timestamp,
bb string,
cc int,
dd string
) CASCADE;
SELECT * FROM test_tmp
LIMIT 100
...
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:19, Vertex vertex_1502459312997_187854_4_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
... 1 statement(s) executed, 0 rows affected, exec/fetch time: 21.655/0.000 sec [0 successful, 1 errors]
check : KO (I get this error)

If you are using Hive 0.x or 1.x then you are probably a victim of...
HIVE-10598 Vectorization borks when column is added to table.
...which is specific to ORC format, even if it's not apparent from the JIRA label.
There is a partial fix as of Hive 2.0 (i.e. ADD is fixed, but DROP / RENAME / CHANGE are still crippled) thanks to
HIVE-11981 ORC Schema Evolution Issues (Vectorized, ACID, and
Non-Vectorized)
And another related fix as of Hive 2.1.1 for CHANGE
HIVE-14355 Schema evolution for ORC in llap is broken
for Int to String conversion
To be continued...

Related

Creating external hive table from parquet file which contains json string

I have a parquet file which is stored in a partitioned directory. The format of the partition is
/dates=*/hour=*/something.parquet.
The content of parquet file looks like as follows:
{a:1,b:2,c:3}.
This is json data and i want to create external hive table.
My approach:
CREATE EXTERNAL TABLE test_table (a int, b int, c int) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';
After that i run MSCK REPAIR TABLE test_table; but i get following output:
hive> select * from test_table;
OK
NULL NULL NULL 2021-09-27 09
The other three columns are null. I think i have to define JSON schema somehow but i have no idea how to proceed further.
Create table with the same schema as parquet file:
CREATE EXTERNAL TABLE test_table (value string) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';
Run repair table to mount partitions:
MSCK REPAIR TABLE test_table;
Parse value in query:
select e.a, e.b, e.c
from test_table t
lateral view json_tuple(t.value, 'a', 'b', 'c') e as a,b,c
Cast values as int if necessary: cast(e.a as int) as a
You can also create a table for json fields as columns using this:
CREATE EXTERNAL TABLE IF NOT EXISTS test_table(
a INT,
b INT,
c INT)
partitioned by (dates string, hour string)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS PARQUET
location '/user/output/';
Then run MSCK REPAIR TABLE test_table;
You would be able to query directly without writing any parsers.

Hive - Insert data into partitioned table: partition not found

I'm having issue while trying to inserting new data in Hive external partitioned table.
Table is partitioned by day, the error I got is:
FAILED: SemanticException [Error 10006]: Line 1:51 Partition not found ''18102016''
My query is as following:
ALTER TABLE my_source_table RECOVER PARTITIONS;
INSERT OVERWRITE TABLE my_dest_table PARTITION (d = '18102016')
SELECT
'III' AS primary_alias_type,
iii_id AS primary_alias_id,
FROM
my_source_table
WHERE
d = '18102016'
The my_dest_table has been created as:
CREATE EXTERNAL TABLE my_dest_table (
primary_alias_type string,
primary_alias_id
) PARTITIONED BY (d string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://my_bucket/my_external_tables/'
Any idea on what I'm doing wrong? Thanks!
I believe you should ALTER TABLE my_source_table RECOVER PARTITIONS; do this for your destination table.
ALTER TABLE my_dest_table RECOVER PARTITIONS;
try this.
Note: Of course you should remove the extra comma what Alex L mentioned. Which will give other parsing error.

Hive cannot read ORC if set "orc.create.index"="false" when loading table

Hive version: 1.2.1, create a table by the below:
CREATE TABLE ORC_NONE(
millisec bigint,
...
)
stored as orc tblproperties ("orc.create.index"="false");
insert into table ORC_NONE select * from ex_test_convert;
But when giving query, it always return NULL. For example:
Select * from ORC_NONE limit 10; // return blank
Select min(millisec), max(millisec) from ORC_NONE; // return NULL, NULL
I check the size of ORC_NONE, 2G, so it is not empty table, and if creating table by setting "orc.create.index"="true", queries work.
I was meant to test Hive performance on ORC with/without row indexes, more exactly, to test the skipping power of row indexes. However, it seemed that Hive can not read data when row index unavailable.
Is this a bug? Or something wrong with my loading?

Hive, Bucketing for the partitioned table

This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.

Data not getting loaded into Partitioned Table in Hive

I am trying to create partition for my Table inorder to update a value.
This is my sample data
1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
I want to update Janet's Department to B.
So for doing that I created a table with Department as partition.
create external table trail (EmployeeID Int,FirstName
String,Designation String,Salary Int) PARTITIONED BY (Department
String) row format delimited fields terminated by "," location
'/user/sreeveni/HIVE';
But while doing the above command.
No data are inserted into trail table.
hive>select * from trail;
OK
Time taken: 0.193 seconds
hive>desc trail;
OK
employeeid int None
firstname string None
designation string None
salary int None
department string None
# Partition Information
# col_name data_type comment
department string None
Am I doing anything wrong?
UPDATE
As suggested I tried to insert data into my table
load data inpath '/user/aibladmin/HIVE' overwrite into table trail
Partition(Department);
But it is showing
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode
requires at least one static partition column. To turn this off set
hive.exec.dynamic.partition.mode=nonstrict
After setting set hive.exec.dynamic.partition.mode=nonstrict also didnt work fine.
Anything else to do.
Try both below properties
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
And while writing insert statement for a partitioned table make sure that you specify the partition columns at the last in select clause.
You cannot directly insert data(Hdfs File) into a Partitioned hive table.
First you need to create a normal table, then you will insert that table data into partitioned table.
set hive.exec.dynamic.partition.mode=strict means when ever you are populating hive table it must have at least one static partition column.
set hive.exec.dynamic.partition.mode=nonstrict In this mode you don't need any static partition column.
Try the following:
Start by creating the table:
create external table test23 (EmployeeID Int,FirstName String,Designation String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by "," location '/user/rocky/HIVE';
Create a directory in hdfs with partition name :
$ hadoop fs -mkdir /user/rocky/HIVE/department=50000
Create a local file abc.txt by filtering records having department equal to 50000:
$ cat abc.txt
1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
Put it into HDFS:
$ hadoop fs -put /home/yarn/abc.txt /user/rocky/HIVE/department=50000
Now alter the table:
ALTER TABLE test23 ADD PARTITION(department=50000);
And check the result:
select * from test23 ;
just set those 2 properties BEFORE you getOrCreate() the spark session:
SparkSession
.builder
.config(new SparkConf())
.appName(appName)
.enableHiveSupport()
.config("hive.exec.dynamic.partition","true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
I ran into the same problem and yes these two properties are needed. However, I used JDBC driver with Scala to set these properties before executing Hive statements. The problem, however, was that I was executing a bunch of properties (SET statements) in one execution statement like this
conn = DriverManager.getConnection(conf.get[String]("hive.jdbc.url"))
conn.createStatement().execute(
"SET spark.executor.memory = 2G;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.other.statements =blabla ;")
For some reason, the driver was not able to interpret all these as separate statements, so I needed to execute each one of them separately.
conn = DriverManager.getConnection(conf.get[String]("hive.jdbc.url"))
conn.createStatement().execute("SET spark.executor.memory = 2G;")
conn.createStatement().execute("SET hive.exec.dynamic.partition.mode=nonstrict;")
conn.createStatement().execute("SET hive.other.statements =blabla ;")
Can you try running
MSCK REPAIR TABLE table_name;
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

Resources