Hive - Replace columns in ORC table - hadoop

I have a hive table saved in ORC files, this is the definition in the "create" command:
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
I want to drop a column from the end, so I tried the "Alter Table - Replace Columns" command, where I didn't write the column name - but got this error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Replacing columns cannot drop columns for table default.table. SerDe may be incompatible
Is there a way to replace columns in a ORC table in Hive?
Google failed me on this subject....
Thanks!

As per the hive tutorial,REPLACE COLUMNS command can be done only for tables with a native SerDe (DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe).
So for your case,
create a new table with required column.
insert into new table from old table .
Rename old table to someother table.
Rename new table to old table.
Thanks.

Related

How to delete fields from a partitioned table in Hive stored as parquet?

I'm looking for a way to modify a parquet data table in HIVE to remove some fields. The table is managed but it doesn't matter because I can convert it to external.
The problem is that I can not use the command ALTER TABLE ... REPLACE COLUMN with partitioned parquet tables.
It is works well for textfile format (partitioned or not) and only for non-partitioned parquet tables.
I've tried to replace column but this is the result:
hive> ALTER TABLE db_test.mytable REPLACE COLUMNS(name String);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Replacing columns cannot drop columns for table db_test.mytable.
SerDe may be incompatible
I've thought about some solutions, but none of them fits my scenario:
First
- [Optional] Convert the table in external.
- Delete the table.
- Re-create the table with the fields that I want.
- MSCK REPAIR TABLE to add HDFS partitions.
- [Optional] Convert back to managed table.
Second
- Create temporary table as selection of the original table with the fields that I choose.
- Delete the original table.
- Rename the temporary table to the original name.
Both options affect my process because I would lose the statistics of my table. This table is consumed with MicroStrategy by Impala and I need to mantain the statistics.
In addition, the second solution is bad with very large data tables.
Any suggestions?
Thanks in advance.
You can use first method and then run
hive> anayze table <db_name>.<table_name> compute statistics;
to compute all the statistics of the table.

Is Row format serde a compulsory parameter to be used while creating Hive table

I created a temporary hive table on top of textfile like this:
CREATE EXTERNAL TABLE tc (fc String,cno String,cs String,tr String,at String,act String,wa String,dn String,pnm String,rsk String,ttp String,tte String,aml String,pn String,ttn String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
location '/home/hbaseuser/tc';
Then I created an ORC table like this:
CREATE EXTERNAL TABLE tc1 (fc String,cno String,cs String,tr String,at String,act String,wa String,dn String,pnm String,rsk String,ttp String,tte String,aml String,pn String,ttn String)
Row format delimited
Fields terminated by '\t'
STORED AS orc
location '/user/hbaseuser/tc1';
Then I used this command to import data to hive table:
insert overwrite table tc1 select * from table tc;
now orc file is available at '/user/hbaseuser/tc1'
and I am able to read from orc table.
my question is what is the use of tag Row format serde 'org.apache.hadoop.hive.contrib.serde2.ORCSerDe'
When ROW FORMAT Serde is specified, it overrides the native Serde and uses that for table creation.
As per documentation,
You can create tables with a custom SerDe or using a native SerDe. A
native SerDe is used if ROW FORMAT is not specified or ROW FORMAT
DELIMITED is specified. Use the SERDE clause to create a table with a
custom SerDe.
STORED AS ORC statement is equivalent to writing
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
You can either use "Stored as" or "Row Format Serde" statement. You can refer the below documentation for more details:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe

Hive Alter External Table and Update Schema

I am looking for a command to add columns and update schema for my Hive External table backed by Avro schema.
Here is what I have tried so far.
I have a Hive External Table with Avro backed Schema created with this command -
CREATE EXTERNAL TABLE `person_hourly`(
'personid' string COMMENT '',
'name' string COMMENT ''
)
PARTITIONED BY (
'partitiontime' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'hdfs://nameservice1/web/PersonData/'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///schemas/PersonV1.avsc'
)
I would like to add additional columns and update schema for this table.
alter table person_hourly ADD COLUMNS (lastname string ) SET TBLPROPERTIES ('avro.schema.url' = 'hdfs:///schemas/PersonV2.avsc')
But I cannot do this since I get an error
FAILED: ParseException line 1:64 missing EOF at 'SET' near ')'
So I tried adding column separately, which worked, but I cannot update the schema
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. at least one column must be specified for the table
The Data Definition Language (DDL) for ALTER TABLE can be found here
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
 
table_properties:
  : (property_name = property_value, property_name = property_value, ... )
And your comment
I tried adding column separately, which worked
I think that's what you should do. Add the column, then set the properties
if you modify the schema in the hdfs, it will be detected by Hive. Hive read the schema on runtime, it doesn't save any schema information when you use avsc through avro.schema.url
Regards,
Hector
The code below worked for me..
You can change the schema definition in avsc file (with proper formatting) then can use simply alter command with setting path of updated schema file.
ALTER TABLE table_name SET TBLPROPERTIES ("path of updated schema avsc format file")

Malformed ORC file error

Upon upgrading Hive External table from RC to ORC format and running MSCK REPAIR TABLE on it when I do select all from the table , I get following error -
Failed with exception java.io.IOException:java.io.IOException: Malformed ORC file hdfs://myServer:port/my_table/prtn_date=yyyymm/part-m-00000__xxxxxxxxxxxxx Invalid postscript length 1
What is the process to be followed for migrating RC formatted historical data to ORC formatted new definition for same table if there is one ?
Hive doesn't automatically reformat the data when you add partitions. You have two choices:
Leave the old partitions as RC files and make the new partitions ORC.
Move the data to a staging table and use insert overwrite to re-write the data as ORC files.
Blockquote
Add Row format ,input format and outformat to solve the problen in create statement:
create external table xyz
(
a string,
b string)
PARTITIONED BY (
c string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
Loacation "hdfs path";

Create new hive table from existing external portioned table

I have a external partitioned table with almost 500 partitions. I am trying to create another external table with same properties as of the old table. Then i want to copy all the partitions from my old table to the newly created table. below is my create table query. My old table is stored as TEXTFILE and i want to save the new one as ORC file.
'add jar json_jarfile;
CREATE EXTERNAL TABLE new_table_orc (col1,col2,col3...col27)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (....)
STORED AS orc
LOCATION 'path';'
And after creation of this table. i am using the below query to insert the partitions from old table to new one.i only want to copy few columns from original table to new table
'set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE new_table_orc PARTITION (year,month,day) SELECT col2,col3,col6,year,month,day FROM old_table;
ALTER TABLE new_table_orc RECOVER PARTITIONS;'
i am getting below error.
'FAILED: SemanticException [Error 10044]: Line 2:23 Cannot insert into target table because column number/types are different 'day': Table insclause-0 has 27 columns, but query has 6 columns.'
Any suggestions?
Your query has to match the number and type of columns in your new table. You have created your new table with 27 regular columns and 3 partition columns, but your query only select six columns.
If you really only care about those six columns, then modify the new table to have only those columns. If you do want all columns, then modify your select statement to select all of those columns.
You also will not need the "recover partitions" statement. When you insert into a table with dynamic partitions, it will create those partitions both in the filesystem and in the metastore.

Resources