Partitioning with Hive - hadoop

I'm having an issue with Hive creating a table from .avro files stored in multiple directories. Such as ParentDir with Child1, Child2, Child3, Child4 as subdirectiories. Each Child Subdirectory have multiple .avro files. I tried with the following syntax to create a table but it's working only on 1 subdir at a time. How can i make the partitions to involve all the subdirs at once?
Thanks!
Edited:
Example of dirs:
/ParentDir/brand-Child1/part-m-0000.avro
/ParentDir/brand-Child2/part-m-0000.avro
/ParentDir/brand-Child3/part-m-0000.avro
/ParentDir/brand-Child4/part-m-0000.avro
CREATE EXTERNAL TABLE test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/test/dir_with_avro_files/'
TBLPROPERTIES
('avro.schema.url'='/user/test/schema.avsc');
FIXED with this
CREATE EXTERNAL TABLE test
PARTITIONED BY(brand STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES
('avro.schema.url'='/user/test/schema');
ALTER TABLE db.test ADD PARTITION (brand="child1") LOCATION '/user/test/output/brand-child1';
ALTER TABLE db.testADD PARTITION (brand="child2") LOCATION '/user/test/output/brand-child2';
ALTER TABLE db.testADD PARTITION (brand="child3") LOCATION '/user/test/output/brand-child3';
ALTER TABLE db.testADD PARTITION (brand="child4") LOCATION '/user/test/output/brand-child4';

Related

How to create partitioned hive table on dynamic hdfs directories

I am having difficulty in getting hive to discover partitions which are created in HDFS
Here's the directory structure in HDFS
warehouse/database/table_name/A
warehouse/database/table_name/B
warehouse/database/table_name/C
warehouse/database/table_name/D
A,B,C,D being values from a column type
when I create a hive table using the following syntax
CREATE EXTERNAL TABLE IF NOT EXISTS
table_name(`name` string, `description` string)
PARTITIONED BY (`type` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs:///tmp/warehouse/database/table_name'
I am unable to see any records when I query the table.
But when I create directories in HDFS as below
warehouse/database/table_name/type=A
warehouse/database/table_name/type=B
warehouse/database/table_name/type=C
warehouse/database/table_name/type=D
It works and discovers partitions when I check using show partitions table_name
Is there some configuration in hive to able to detect dynamic directories as partitions?
Creating external table on top of some directory is not enough, partitions needs to be mounted also. Discover partitions feature added in Hive 4.0.0. Use MSCK REPAIR TABLE for earlier versions:
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
or it's equivalent on EMR:
ALTER TABLE table_name RECOVER PARTITIONS;
And when you creating dynamic partitions using insert overwrite, partition metadata is being created automatically and partition folders are in the form key=value.

Hive - Replace columns in ORC table

I have a hive table saved in ORC files, this is the definition in the "create" command:
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
I want to drop a column from the end, so I tried the "Alter Table - Replace Columns" command, where I didn't write the column name - but got this error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Replacing columns cannot drop columns for table default.table. SerDe may be incompatible
Is there a way to replace columns in a ORC table in Hive?
Google failed me on this subject....
Thanks!
As per the hive tutorial,REPLACE COLUMNS command can be done only for tables with a native SerDe (DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe).
So for your case,
create a new table with required column.
insert into new table from old table .
Rename old table to someother table.
Rename new table to old table.
Thanks.

Difference between 'Stored as InputFormat, OutputFormat' and 'Stored as' in Hive

Issue when executing a show create table and then executing the resulting create table statement if the table is ORC.
Using show create table, you get this:
STORED AS INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’
But if you create the table with those clauses, you will then get the casting error when selecting. Error likes:
Failed with exception
java.io.IOException:java.lang.ClassCastException:
org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to
org.apache.hadoop.io.BinaryComparable
To fix this, just change create table statement to STORED AS ORC
But, as the answer said in the similar question:
What is the difference between 'InputFormat, OutputFormat' & 'Stored as' in Hive? .
I can't figure out the reason.
STORED AS implies 3 things:
SERDE
INPUTFORMAT
OUTPUTFORMAT
You have defined only the last 2, leaving the SERDE to be defined by hive.default.serde
hive.default.serde
Default Value: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Added in: Hive 0.14 with HIVE-5976
The default SerDe Hive will use for storage formats that do not specify a SerDe.
Storage formats that currently do not specify a SerDe include 'TextFile, RcFile'.
Demo
hive.default.serde
set hive.default.serde;
hive.default.serde=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
STORED AS ORC
create table mytable (i int)
stored as orc;
show create table mytable;
Note that the SERDE is 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
CREATE TABLE `mytable`(
`i` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'file:/home/cloudera/local_db/mytable'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1496982059')
STORED AS INPUTFORMAT ... OUTPUTFORMAT ...
create table mytable2 (i int)
STORED AS
INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
;
show create table mytable2
;
Note that the SERDE is 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
CREATE TABLE `mytable2`(
`i` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'file:/home/cloudera/local_db/mytable2'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1496982426')
You сan specify INPUTFORMAT, OUTPUTFORMAT, SERDE in STORED AS when creating table. Hive allows you to separate your record format from your file format. You can provide custom classes for INPUTFORMAT, OUTPUTFORMAT, SERDE. See details: http://www.dummies.com/programming/big-data/hadoop/defining-table-record-formats-in-hive/
Alternatively you can write simply STORED AS ORC or STORED AS TEXTFILE for example.
STORED AS ORC statement already takes care about INPUTFORMAT, OUTPUTFORMAT and SERDE. This allows you not to write those long fully qualified Java class names for INPUTFORMAT, OUTPUTFORMAT, SERDE. Just STORED AS ORC instead.

External Hive table from AVRO files says it has no data

I created an external Hive table that points to a location that has several avro files. The create statement worked without any issues and it created the expected columns. However, the table is has no data when I try to run a query. I tried to create the table a few different ways and couldn't get it to work. I have also verified the the directory has the avro files.
CREATE EXTERNAL TABLE table_name
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/path/to/avro/data/'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema/ags.avsc');
CREATE EXTERNAL TABLE table_name
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as AVRO
LOCATION '/path/to/avro/data/'
TBLPROPERTIES ('avro.schema.url'='/path/to/schema/ags.avsc');
Any ideas?
Turns out the schema file(which was produced by sqoop) was incorrect. I ended up creating a new schema file by using "avro-tools getschema " Once I used that schema file everything worked as expected.

Insert xml file on hdfs to Hive Parquet Table

tI have a gzipped 3GBs xml file that I want to map to Hive parquet table.
I'm using xml serde for parsing that file to temporary external table and than I'm using INSERT to insert this data to hive parquet table (I want this data to by placed on Hive table, not create interface to xml file on HDFS).
I came up with this script:
CREATE TEMPORARY EXTERNAL TABLE temp_table (someData1 INT, someData2 STRING, someData3 ARRAY<STRING>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.someData1" ="someXpath1/text()",
"column.xpath.someData2"="someXpath2/text()",
"column.xpath.someData3"="someXpath3/text()",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 'hdfs://locationToGzippedXmlFile'
TBLPROPERTIES (
"xmlinput.start"="<MyItem>",
"xmlinput.end"="</MyItem>"
);
CREATE TABLE parquet_table
STORED AS Parquet
AS select * from temp_table
Main point of this is that I want to have optimized way to access the data. I don't want to parse xml every query instead parse whole file once and put the result into parquet table. And running the script above is taking unlimited amount of time additionally in log's i can see that only 1 mapper is used.
I don't really know if it's the correct approach (maybe it's possible to do that with partitions?)
BTW I'm using Hue with cloudera.

Resources