Hive: altering on MetadataTypedColumnsetSerDe in fact results LazySimpleSerDe - hadoop

I'm trying to alter hive-table in part of setting MetadataTypedColumnsetSerDe.
alter table 'some_table' set serde 'org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe'
But as a result I retrieve:
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

Related

How to add multi-level partition in hive?

I have customer managed table in the hive, partition based on date and customerName. My directory structure is like below:
user/hive/warehouse/test.db/customer/date1=2021-09-16/customerName=xyz
when I am doing show partitions customer it is not giving output. So I tried to add a partition with
MSCK REPAIR TABLE customer;
It give error Execution Error,return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
ALTER TABLE customer ADD PARTITION (date1='2021-09-15') PARTITION (customerName='xyz');
It also give error ValidationFailureSemanticException partition spec {customername=xyz} contain non partition column
How can I add these partitions in hive metastore.
hive> show create table customer;
OK
CREATE TABLE `customer`(
`act` string)
PARTITIONED BY (
`date1` string,
`customername` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='hdfs://hdcluster/user/hive/warehouse/test.db/customer')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://hdcluster/user/hive/warehouse/test.db/customer'
TBLPROPERTIES (
'spark.sql.create.version'='2.4.0',
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.provider'='orc',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'=
'{\"type\":\"struct\",\"fields\":
[{\"name\":\"act\",\"type\":\"string\",\"nullable\":true,\"metadata\":
{}}, {\"name\":\"date1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"customername\",\"type\":\"string\",\"nullable\":true,\"metadata\
":{}}]}','spark.sql.sources.schema.partCol.0'='date1',
'spark.sql.sources.schema.partCol.1'='customername',
'transient_lastDdlTime'='1631781225')

Hive Error: ORC does not support type conversion from DATE to TIMESTAMP

I have a source table in Hive with DDL as:
CREATE EXTERNAL TABLE JRNL.SOURCE_TAB(
ticket_id varchar(11),
ttr_start timestamp,
ttr_stop timestamp
)
PARTITIONED BY (
exp_dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://my.cluster.net:8020//db/data/SOURCE_TAB'
TBLPROPERTIES (
'last_modified_by'='edpintdatd',
'last_modified_time'='1466093031',
'serialization.null.format'='',
'transient_lastDdlTime'='1466093031')
When I am querying the table:
hive> select exp_dt from JRNL.SOURCE_TAB limit 3;
It is giving me an Exception:
Failed with exception java.io.IOException:java.io.IOException: ORC does not support type conversion from DATE to TIMESTAMP
Even when I tried to create a replica table like the above source, using:
CREATE TABLE JRNL.SOURCE_TAB_BKP(
ticket_id varchar(11),
ttr_start timestamp,
ttr_stop timestamp
)
PARTITIONED BY (exp_dt string);
and then inserting data in this table using:
INSERT INTO TABLE JRNL.SOURCE_TAB_BKP PARTITION (exp_dt)
SELECT
ticket_id,
ttr_start,
ttr_stop,
exp_dt string
FROM JRNL.SOURCE_TAB;
it is still giving me the error ORC does not support type conversion from DATE to TIMESTAMP
I tried using
to_utc_timestamp(unix_timestamp(ttr_start),'UTC'),
to_utc_timestamp(unix_timestamp(ttr_stop),'UTC'),
but this isn't helping either.
I have already set the hive.exec.dynamic.partition.mode=nonstrict.
I even used CAST(.... as DATE), CAST(.... as TIMESTAMP). Didn't work either.

Alter hive table add or drop column

I have orc table in hive I want to drop column from this table
ALTER TABLE table_name drop col_name;
but I am getting the following exception
Error occurred executing hive query: OK FAILED: ParseException line 1:35 mismatched input 'user_id1' expecting PARTITION near 'drop' in drop partition statement
Can any one help me or provide any idea to do this? Note, I am using hive 0.14
You cannot drop column directly from a table using command ALTER TABLE table_name drop col_name;
The only way to drop column is using replace command. Lets say, I have a table emp with id, name and dept column. I want to drop id column of table emp. So provide all those columns which you want to be the part of table in replace columns clause. Below command will drop id column from emp table.
ALTER TABLE emp REPLACE COLUMNS( name string, dept string);
There is also a "dumb" way of achieving the end goal, is to create a new table without the column(s) not wanted. Using Hive's regex matching will make this rather easy.
Here is what I would do:
-- make a copy of the old table
ALTER TABLE table RENAME TO table_to_dump;
-- make the new table without the columns to be deleted
CREATE TABLE table AS
SELECT `(col_to_remove_1|col_to_remove_2)?+.+`
FROM table_to_dump;
-- dump the table
DROP TABLE table_to_dump;
If the table in question is not too big, this should work just well.
suppose you have an external table viz. organization.employee as: (not including TBLPROPERTIES)
hive> show create table organization.employee;
OK
CREATE EXTERNAL TABLE `organization.employee`(
`employee_id` bigint,
`employee_name` string,
`updated_by` string,
`updated_date` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://getnamenode/apps/hive/warehouse/organization.db/employee'
You want to remove updated_by, updated_date columns from the table. Follow these steps:
create a temp table replica of organization.employee as:
hive> create table organization.employee_temp as select * from organization.employee;
drop the main table organization.employee.
hive> drop table organization.employee;
remove the underlying data from HDFS (need to come out of hive shell)
[nameet#ip-80-108-1-111 myfile]$ hadoop fs -rm hdfs://getnamenode/apps/hive/warehouse/organization.db/employee/*
create the table with removed columns as required:
hive> CREATE EXTERNAL TABLE `organization.employee`(
`employee_id` bigint,
`employee_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://getnamenode/apps/hive/warehouse/organization.db/employee'
insert the original records back into original table.
hive> insert into organization.employee
select employee_id, employee_name from organization.employee_temp;
finally drop the temp table created
hive> drop table organization.employee_temp;
ALTER TABLE emp REPLACE COLUMNS( name string, dept string);
Above statement can only change the schema of a table, not data.
A solution of this problem to copy data in a new table.
Insert <New Table> Select <selective columns> from <Old Table>
ALTER TABLE is not yet supported for non-native tables; i.e. what you get with CREATE TABLE when a STORED BY clause is specified.
check this https://cwiki.apache.org/confluence/display/Hive/StorageHandlers
After a lot of mistakes, in addition to above explanations, I would add simpler answers.
Case 1: Add new column named new_column
ALTER TABLE schema.table_name
ADD new_column INT COMMENT 'new number column');
Case 2: Rename a column new_column to no_of_days
ALTER TABLE schema.table_name
CHANGE new_column no_of_days INT;
Note that in renaming, both columns should be of same datatype like above as INT
For external table its simple and easy.
Just drop the table schema then edit create table schema , at last again create table with new schema.
example table: aparup_test.tbl_schema_change and will drop column id
steps:-
------------- show create table to fetch schema ------------------
spark.sql("""
show create table aparup_test.tbl_schema_change
""").show(100,False)
o/p:
CREATE EXTERNAL TABLE aparup_test.tbl_schema_change(name STRING, time_details TIMESTAMP, id BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'gs://aparup_test/tbl_schema_change'
TBLPROPERTIES (
'parquet.compress' = 'snappy'
)
""")
------------- drop table --------------------------------
spark.sql("""
drop table aparup_test.tbl_schema_change
""").show(100,False)
------------- edit create table schema by dropping column "id"------------------
CREATE EXTERNAL TABLE aparup_test.tbl_schema_change(name STRING, time_details TIMESTAMP)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'gs://aparup_test/tbl_schema_change'
TBLPROPERTIES (
'parquet.compress' = 'snappy'
)
""")
------------- sync up table schema with parquet files ------------------
spark.sql("""
msck repair table aparup_test.tbl_schema_change
""").show(100,False)
==================== DONE =====================================
Even below query is working for me.
Alter table tbl_name drop col_name

How to change the FIELD TERMINATED value for an existing Hive table?

I currently have a table t1 which was set a value of '\t' in my FIELD TERMINATED clause.
Now I would like to change that particular clause in structure of the Table t1.
Is there any way to ALTER the FIELD TERMINATED clause after creation?
hive >
ALTER TABLE table_name
set serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim' = '|');
It works. Check DESC FORMATTED tbl_name before and after applying the query. Hope this helps!
As already stated by Randall , it did not work directly.
So solution is below which seems catching .
ALTER TABLE table_name SET SERDEPROPERTIES ('field.delim' = ',');

describe extended table in Hive

I am storing the Table as a SequenceFile format and I am setting the below commands to enable Sequence with BLOCK Compression-
set mapred.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
But when I tried viewing the tables like this-
describe extended lip_table
I got below information in which there is a field called compressed which is set as false, So that means my data doesn't got compressed by setting the above three commands?
Detailed Table Information Table(tableName:lip_table, dbName:default, owner:uname,
createTime:1343931235, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:
[FieldSchema(name:buyer_id, type:bigint, comment:null), FieldSchema(name:total_chkout,
type:bigint, comment:null), FieldSchema(name:total_errpds, type:bigint, comment:null)],
location:hdfs://ares-nn/apps/hdmi/uname/lip-data,
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
**compressed:false**, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:
{serialization.format= , field.delim=
I found this article that I think gives the solution to your problem.
You should rather try to specify the usage of your compression codec at the level of your table definition, either when creating the table or by using the ALTER statement.
At creation time:
CREATE EXTERNAL TABLE lip_table (
column1 string
, column2 string
)
PARTITIONED BY (date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/path/to/hive/tables/lip';
Using ALTER (only affects partitions created subsequently):
ALTER TABLE lip_table
SET FILEFORMAT
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
http://www.mrbalky.com/2011/02/24/hive-tables-partitions-and-lzo-compression/
To avoid serde exception use serde class too.
ALTER TABLE <<table name>>
SET FILEFORMAT
INPUTFORMAT "<<Input format class>>"
OUTPUTFORMAT
"<<Output format class>>" SERDE "<<Serde class>>";

Resources