Hive Alter External Table and Update Schema - hadoop

I am looking for a command to add columns and update schema for my Hive External table backed by Avro schema.
Here is what I have tried so far.
I have a Hive External Table with Avro backed Schema created with this command -
CREATE EXTERNAL TABLE `person_hourly`(
'personid' string COMMENT '',
'name' string COMMENT ''
)
PARTITIONED BY (
'partitiontime' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'hdfs://nameservice1/web/PersonData/'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///schemas/PersonV1.avsc'
)
I would like to add additional columns and update schema for this table.
alter table person_hourly ADD COLUMNS (lastname string ) SET TBLPROPERTIES ('avro.schema.url' = 'hdfs:///schemas/PersonV2.avsc')
But I cannot do this since I get an error
FAILED: ParseException line 1:64 missing EOF at 'SET' near ')'
So I tried adding column separately, which worked, but I cannot update the schema
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. at least one column must be specified for the table

The Data Definition Language (DDL) for ALTER TABLE can be found here
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
 
table_properties:
  : (property_name = property_value, property_name = property_value, ... )
And your comment
I tried adding column separately, which worked
I think that's what you should do. Add the column, then set the properties

if you modify the schema in the hdfs, it will be detected by Hive. Hive read the schema on runtime, it doesn't save any schema information when you use avsc through avro.schema.url
Regards,
Hector

The code below worked for me..
You can change the schema definition in avsc file (with proper formatting) then can use simply alter command with setting path of updated schema file.
ALTER TABLE table_name SET TBLPROPERTIES ("path of updated schema avsc format file")

Related

Hive - Replace columns in ORC table

I have a hive table saved in ORC files, this is the definition in the "create" command:
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
I want to drop a column from the end, so I tried the "Alter Table - Replace Columns" command, where I didn't write the column name - but got this error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Replacing columns cannot drop columns for table default.table. SerDe may be incompatible
Is there a way to replace columns in a ORC table in Hive?
Google failed me on this subject....
Thanks!
As per the hive tutorial,REPLACE COLUMNS command can be done only for tables with a native SerDe (DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe).
So for your case,
create a new table with required column.
insert into new table from old table .
Rename old table to someother table.
Rename new table to old table.
Thanks.

Create hive table from table schema stored in .avsc file

I have a hive table schema stored in one hdfs file schema.avsc.
I want to create a hive table of the same schema and want to dump a data from another hdfs path where data is stored in HDFS file system.
1 : How can i create a table ?
2 : How can i dump a data stored in hdfs file into created table ?
How can i create a table ?
The Apache Hive documentation on the AvroSerDe shows the syntax for creating a table based on an Avro schema stored in a file. For convenience, I'll repeat one of the examples here:
CREATE TABLE kst
PARTITIONED BY (ds string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='http://schema_provider/kst.avsc');
This example pulls the schema file from a web server. The documentation also shows other options, such as pulling from a local file, depending on your specific needs.
I recommend reading the entire AvroSerDe documentation page. There is a lot of useful information there about getting the most out of using Hive with Avro.
How can i dump a data stored in hdfs file into created table ?
You can define an external table that references the existing HDFS files. The documentation page for External Tables shows the syntax. Repeating an example:
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
After defining the external table, you can then use an INSERT-SELECT query that reads from the external table and writes to the Avro table. The documentation on Inserting data into Hive Tables from queries describes the INSERT-SELECT syntax. For example:
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

Can we use TEXT FILE format for Hive Table with Snappy compression?

I have an hive external table in the HDFS and i am trying to create a hive managed table above it.i am using textfile format with snappy compression but i want to know how it helps the table.
CREATE TABLE standard_cd
(
last_update_dttm TIMESTAMP,
last_operation_type CHAR (1) ,
source_commit_dttm TIMESTAMP,
transaction_dttm TIMESTAMP ,
transaction_type CHAR (1)
)
PARTITIONED BY (process_dt DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("orc.compress" = "SNAPPY");
Let me know if any issues in creating in this format.
As such their is no issue while creating.
but difference in properties:
Table created and stored as TEXTFILE:
Table created and stored as ORC:
although the size of both tables were same after loading some data.
also check documentation about ORC file format

External Hive table from AVRO files says it has no data

I created an external Hive table that points to a location that has several avro files. The create statement worked without any issues and it created the expected columns. However, the table is has no data when I try to run a query. I tried to create the table a few different ways and couldn't get it to work. I have also verified the the directory has the avro files.
CREATE EXTERNAL TABLE table_name
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/path/to/avro/data/'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema/ags.avsc');
CREATE EXTERNAL TABLE table_name
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as AVRO
LOCATION '/path/to/avro/data/'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema/ags.avsc');
Any ideas?
Turns out the schema file(which was produced by sqoop) was incorrect. I ended up creating a new schema file by using "avro-tools getschema " Once I used that schema file everything worked as expected.

Malformed ORC file error

Upon upgrading Hive External table from RC to ORC format and running MSCK REPAIR TABLE on it when I do select all from the table , I get following error -
Failed with exception java.io.IOException:java.io.IOException: Malformed ORC file hdfs://myServer:port/my_table/prtn_date=yyyymm/part-m-00000__xxxxxxxxxxxxx Invalid postscript length 1
What is the process to be followed for migrating RC formatted historical data to ORC formatted new definition for same table if there is one ?
Hive doesn't automatically reformat the data when you add partitions. You have two choices:
Leave the old partitions as RC files and make the new partitions ORC.
Move the data to a staging table and use insert overwrite to re-write the data as ORC files.
Blockquote
Add Row format ,input format and outformat to solve the problen in create statement:
create external table xyz
(
a string,
b string)
PARTITIONED BY (
c string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
Loacation "hdfs path";

Resources