External Hive table from AVRO files says it has no data - hadoop

I created an external Hive table that points to a location that has several avro files. The create statement worked without any issues and it created the expected columns. However, the table is has no data when I try to run a query. I tried to create the table a few different ways and couldn't get it to work. I have also verified the the directory has the avro files.
CREATE EXTERNAL TABLE table_name
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/path/to/avro/data/'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema/ags.avsc');
CREATE EXTERNAL TABLE table_name
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as AVRO
LOCATION '/path/to/avro/data/'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema/ags.avsc');
Any ideas?

Turns out the schema file(which was produced by sqoop) was incorrect. I ended up creating a new schema file by using "avro-tools getschema " Once I used that schema file everything worked as expected.

Related

Hive Table Creation based on file structure

i have one doubt, is there any way in HIVE which create table during load to hive warehouse or external table.
As i know hive is based on Schema On Read. so table structure must sync with file structure. but if file size is huge and we don't know its structure for example columns and their datatypes.
Than how to load those file to hive table.
so in short how to load file from HDFS to HIVE Table without knowing its schema structure.
New to Hive, Pardon if my understanding is wrong.
Thanks
By using sqoop you can create hive table while importing data.
Please refer to this link to create hive table while importing data
(or)
if you have imported data in AVRO format then you can generate avro schema by using
/usr/bin/Avro/avro-tools-*.jar then use the generated avro schema while creating table in hive then hive uses the schema and reads the data from HDFS.
Please refer to this link to extract schema from avro data file
(or)
While importing data using sqoop --as-avrodatefile then sqoop creates .avsc file with schema in it, so we can use this .avsc file creating the table.
CREATE EXTERNAL TABLE avro_tbl
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '<hdfs-location>'
TBLPROPERTIES ('avro.schema.url'='<schema-file>');
(or)
By using NiFi to import data NiFi pulls data in avro format by using ExtractAvroMetadata processor we can extract the avro schema and store into HDFS and create table by using this avro schema.
If you want to create table in ORC format then by using ConvertAvroToOrc processor adds hive.ddl attribute to the flowfile as we can execute the ddl statement to create orc table in hive.

Hive Alter External Table and Update Schema

I am looking for a command to add columns and update schema for my Hive External table backed by Avro schema.
Here is what I have tried so far.
I have a Hive External Table with Avro backed Schema created with this command -
CREATE EXTERNAL TABLE `person_hourly`(
'personid' string COMMENT '',
'name' string COMMENT ''
)
PARTITIONED BY (
'partitiontime' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'hdfs://nameservice1/web/PersonData/'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///schemas/PersonV1.avsc'
)
I would like to add additional columns and update schema for this table.
alter table person_hourly ADD COLUMNS (lastname string ) SET TBLPROPERTIES ('avro.schema.url' = 'hdfs:///schemas/PersonV2.avsc')
But I cannot do this since I get an error
FAILED: ParseException line 1:64 missing EOF at 'SET' near ')'
So I tried adding column separately, which worked, but I cannot update the schema
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. at least one column must be specified for the table
The Data Definition Language (DDL) for ALTER TABLE can be found here
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
 
table_properties:
  : (property_name = property_value, property_name = property_value, ... )
And your comment
I tried adding column separately, which worked
I think that's what you should do. Add the column, then set the properties
if you modify the schema in the hdfs, it will be detected by Hive. Hive read the schema on runtime, it doesn't save any schema information when you use avsc through avro.schema.url
Regards,
Hector
The code below worked for me..
You can change the schema definition in avsc file (with proper formatting) then can use simply alter command with setting path of updated schema file.
ALTER TABLE table_name SET TBLPROPERTIES ("path of updated schema avsc format file")

Create hive table from table schema stored in .avsc file

I have a hive table schema stored in one hdfs file schema.avsc.
I want to create a hive table of the same schema and want to dump a data from another hdfs path where data is stored in HDFS file system.
1 : How can i create a table ?
2 : How can i dump a data stored in hdfs file into created table ?
How can i create a table ?
The Apache Hive documentation on the AvroSerDe shows the syntax for creating a table based on an Avro schema stored in a file. For convenience, I'll repeat one of the examples here:
CREATE TABLE kst
PARTITIONED BY (ds string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='http://schema_provider/kst.avsc');
This example pulls the schema file from a web server. The documentation also shows other options, such as pulling from a local file, depending on your specific needs.
I recommend reading the entire AvroSerDe documentation page. There is a lot of useful information there about getting the most out of using Hive with Avro.
How can i dump a data stored in hdfs file into created table ?
You can define an external table that references the existing HDFS files. The documentation page for External Tables shows the syntax. Repeating an example:
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
After defining the external table, you can then use an INSERT-SELECT query that reads from the external table and writes to the Avro table. The documentation on Inserting data into Hive Tables from queries describes the INSERT-SELECT syntax. For example:
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

Partitioning with Hive

I'm having an issue with Hive creating a table from .avro files stored in multiple directories. Such as ParentDir with Child1, Child2, Child3, Child4 as subdirectiories. Each Child Subdirectory have multiple .avro files. I tried with the following syntax to create a table but it's working only on 1 subdir at a time. How can i make the partitions to involve all the subdirs at once?
Thanks!
Edited:
Example of dirs:
/ParentDir/brand-Child1/part-m-0000.avro
/ParentDir/brand-Child2/part-m-0000.avro
/ParentDir/brand-Child3/part-m-0000.avro
/ParentDir/brand-Child4/part-m-0000.avro
CREATE EXTERNAL TABLE test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/test/dir_with_avro_files/'
TBLPROPERTIES
('avro.schema.url'='/user/test/schema.avsc');
FIXED with this
CREATE EXTERNAL TABLE test
PARTITIONED BY(brand STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES
('avro.schema.url'='/user/test/schema');
ALTER TABLE db.test ADD PARTITION (brand="child1") LOCATION '/user/test/output/brand-child1';
ALTER TABLE db.testADD PARTITION (brand="child2") LOCATION '/user/test/output/brand-child2';
ALTER TABLE db.testADD PARTITION (brand="child3") LOCATION '/user/test/output/brand-child3';
ALTER TABLE db.testADD PARTITION (brand="child4") LOCATION '/user/test/output/brand-child4';

Insert xml file on hdfs to Hive Parquet Table

tI have a gzipped 3GBs xml file that I want to map to Hive parquet table.
I'm using xml serde for parsing that file to temporary external table and than I'm using INSERT to insert this data to hive parquet table (I want this data to by placed on Hive table, not create interface to xml file on HDFS).
I came up with this script:
CREATE TEMPORARY EXTERNAL TABLE temp_table (someData1 INT, someData2 STRING, someData3 ARRAY<STRING>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.someData1" ="someXpath1/text()",
"column.xpath.someData2"="someXpath2/text()",
"column.xpath.someData3"="someXpath3/text()",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 'hdfs://locationToGzippedXmlFile'
TBLPROPERTIES (
"xmlinput.start"="<MyItem>",
"xmlinput.end"="</MyItem>"
);
CREATE TABLE parquet_table
STORED AS Parquet
AS select * from temp_table
Main point of this is that I want to have optimized way to access the data. I don't want to parse xml every query instead parse whole file once and put the result into parquet table. And running the script above is taking unlimited amount of time additionally in log's i can see that only 1 mapper is used.
I don't really know if it's the correct approach (maybe it's possible to do that with partitions?)
BTW I'm using Hue with cloudera.

Resources