Difference between 'Stored as InputFormat, OutputFormat' and 'Stored as' in Hive - hadoop

Issue when executing a show create table and then executing the resulting create table statement if the table is ORC.
Using show create table, you get this:
STORED AS INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’
But if you create the table with those clauses, you will then get the casting error when selecting. Error likes:
Failed with exception
java.io.IOException:java.lang.ClassCastException:
org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to
org.apache.hadoop.io.BinaryComparable
To fix this, just change create table statement to STORED AS ORC
But, as the answer said in the similar question:
What is the difference between 'InputFormat, OutputFormat' & 'Stored as' in Hive? .
I can't figure out the reason.

STORED AS implies 3 things:
SERDE
INPUTFORMAT
OUTPUTFORMAT
You have defined only the last 2, leaving the SERDE to be defined by hive.default.serde
hive.default.serde
Default Value: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Added in: Hive 0.14 with HIVE-5976
The default SerDe Hive will use for storage formats that do not specify a SerDe.
Storage formats that currently do not specify a SerDe include 'TextFile, RcFile'.
Demo
hive.default.serde
set hive.default.serde;
hive.default.serde=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
STORED AS ORC
create table mytable (i int)
stored as orc;
show create table mytable;
Note that the SERDE is 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
CREATE TABLE `mytable`(
`i` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'file:/home/cloudera/local_db/mytable'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1496982059')
STORED AS INPUTFORMAT ... OUTPUTFORMAT ...
create table mytable2 (i int)
STORED AS
INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
;
show create table mytable2
;
Note that the SERDE is 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
CREATE TABLE `mytable2`(
`i` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'file:/home/cloudera/local_db/mytable2'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1496982426')

You сan specify INPUTFORMAT, OUTPUTFORMAT, SERDE in STORED AS when creating table. Hive allows you to separate your record format from your file format. You can provide custom classes for INPUTFORMAT, OUTPUTFORMAT, SERDE. See details: http://www.dummies.com/programming/big-data/hadoop/defining-table-record-formats-in-hive/
Alternatively you can write simply STORED AS ORC or STORED AS TEXTFILE for example.
STORED AS ORC statement already takes care about INPUTFORMAT, OUTPUTFORMAT and SERDE. This allows you not to write those long fully qualified Java class names for INPUTFORMAT, OUTPUTFORMAT, SERDE. Just STORED AS ORC instead.

Related

Is Row format serde a compulsory parameter to be used while creating Hive table

I created a temporary hive table on top of textfile like this:
CREATE EXTERNAL TABLE tc (fc String,cno String,cs String,tr String,at String,act String,wa String,dn String,pnm String,rsk String,ttp String,tte String,aml String,pn String,ttn String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
location '/home/hbaseuser/tc';
Then I created an ORC table like this:
CREATE EXTERNAL TABLE tc1 (fc String,cno String,cs String,tr String,at String,act String,wa String,dn String,pnm String,rsk String,ttp String,tte String,aml String,pn String,ttn String)
Row format delimited
Fields terminated by '\t'
STORED AS orc
location '/user/hbaseuser/tc1';
Then I used this command to import data to hive table:
insert overwrite table tc1 select * from table tc;
now orc file is available at '/user/hbaseuser/tc1'
and I am able to read from orc table.
my question is what is the use of tag Row format serde 'org.apache.hadoop.hive.contrib.serde2.ORCSerDe'
When ROW FORMAT Serde is specified, it overrides the native Serde and uses that for table creation.
As per documentation,
You can create tables with a custom SerDe or using a native SerDe. A
native SerDe is used if ROW FORMAT is not specified or ROW FORMAT
DELIMITED is specified. Use the SERDE clause to create a table with a
custom SerDe.
STORED AS ORC statement is equivalent to writing
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
You can either use "Stored as" or "Row Format Serde" statement. You can refer the below documentation for more details:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe

Hive - Replace columns in ORC table

I have a hive table saved in ORC files, this is the definition in the "create" command:
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
I want to drop a column from the end, so I tried the "Alter Table - Replace Columns" command, where I didn't write the column name - but got this error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Replacing columns cannot drop columns for table default.table. SerDe may be incompatible
Is there a way to replace columns in a ORC table in Hive?
Google failed me on this subject....
Thanks!
As per the hive tutorial,REPLACE COLUMNS command can be done only for tables with a native SerDe (DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe).
So for your case,
create a new table with required column.
insert into new table from old table .
Rename old table to someother table.
Rename new table to old table.
Thanks.

External Hive table from AVRO files says it has no data

I created an external Hive table that points to a location that has several avro files. The create statement worked without any issues and it created the expected columns. However, the table is has no data when I try to run a query. I tried to create the table a few different ways and couldn't get it to work. I have also verified the the directory has the avro files.
CREATE EXTERNAL TABLE table_name
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/path/to/avro/data/'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema/ags.avsc');
CREATE EXTERNAL TABLE table_name
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as AVRO
LOCATION '/path/to/avro/data/'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema/ags.avsc');
Any ideas?
Turns out the schema file(which was produced by sqoop) was incorrect. I ended up creating a new schema file by using "avro-tools getschema " Once I used that schema file everything worked as expected.

Partitioning with Hive

I'm having an issue with Hive creating a table from .avro files stored in multiple directories. Such as ParentDir with Child1, Child2, Child3, Child4 as subdirectiories. Each Child Subdirectory have multiple .avro files. I tried with the following syntax to create a table but it's working only on 1 subdir at a time. How can i make the partitions to involve all the subdirs at once?
Thanks!
Edited:
Example of dirs:
/ParentDir/brand-Child1/part-m-0000.avro
/ParentDir/brand-Child2/part-m-0000.avro
/ParentDir/brand-Child3/part-m-0000.avro
/ParentDir/brand-Child4/part-m-0000.avro
CREATE EXTERNAL TABLE test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/test/dir_with_avro_files/'
TBLPROPERTIES
('avro.schema.url'='/user/test/schema.avsc');
FIXED with this
CREATE EXTERNAL TABLE test
PARTITIONED BY(brand STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES
('avro.schema.url'='/user/test/schema');
ALTER TABLE db.test ADD PARTITION (brand="child1") LOCATION '/user/test/output/brand-child1';
ALTER TABLE db.testADD PARTITION (brand="child2") LOCATION '/user/test/output/brand-child2';
ALTER TABLE db.testADD PARTITION (brand="child3") LOCATION '/user/test/output/brand-child3';
ALTER TABLE db.testADD PARTITION (brand="child4") LOCATION '/user/test/output/brand-child4';

Insert xml file on hdfs to Hive Parquet Table

tI have a gzipped 3GBs xml file that I want to map to Hive parquet table.
I'm using xml serde for parsing that file to temporary external table and than I'm using INSERT to insert this data to hive parquet table (I want this data to by placed on Hive table, not create interface to xml file on HDFS).
I came up with this script:
CREATE TEMPORARY EXTERNAL TABLE temp_table (someData1 INT, someData2 STRING, someData3 ARRAY<STRING>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.someData1" ="someXpath1/text()",
"column.xpath.someData2"="someXpath2/text()",
"column.xpath.someData3"="someXpath3/text()",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 'hdfs://locationToGzippedXmlFile'
TBLPROPERTIES (
"xmlinput.start"="<MyItem>",
"xmlinput.end"="</MyItem>"
);
CREATE TABLE parquet_table
STORED AS Parquet
AS select * from temp_table
Main point of this is that I want to have optimized way to access the data. I don't want to parse xml every query instead parse whole file once and put the result into parquet table. And running the script above is taking unlimited amount of time additionally in log's i can see that only 1 mapper is used.
I don't really know if it's the correct approach (maybe it's possible to do that with partitions?)
BTW I'm using Hue with cloudera.

Resources