ORC schema evolution - hadoop

After going through a sample ORC file itself I came to know that ORC file format does not store any column information, in fact all the column names will be replaced by _c0 to _cn, in such scenario how a proper schema evolution can be achieved for ORC tables?

ORC format does not store any information about hive column names. There was a bug which was storing column information if ORC file were created using PIG. You can find the details below
https://issues.apache.org/jira/browse/HIVE-7189
I think ORC file format (and other) rely on Hive Metastore for this information. if you will run describe formatted <table_name>, you will get the schema information.
something like
# col_name data_type comment
name string

Related

Schema on read in hive for tsv format file

I am new on hadoop. I have data in tsv format with 50 columns and I need to store the data into hive. How can I create and load the data into table on the fly without manually creating table using create table statementa using schema on read?
Hive requires you to run a CREATE TABLE statement because the Hive metastore must be updated with the description of what data location you're going to be querying later on.
Schema-on-read doesn't mean that you can query every possible file without knowing metadata beforehand such as storage location and storage format.
SparkSQL or Apache Drill, on the other hand, will let you infer the schema from a file, but you must again define the column types for a TSV if you don't want everything to be a string column (or coerced to unexpected types). Both of these tools can interact with a Hive metastore for "decoupled" storage of schema information
you can use Hue :
http://gethue.com/hadoop-tutorial-create-hive-tables-with-headers-and/
or with Spark you can infer the schema of csv file and you can save it as a hive table.
val df=spark.read
.option("delimiter", "\t")
.option("header",true)
.option("inferSchema", "true") // <-- HERE
.csv("/home/cloudera/Book1.csv")

Hive Table Creation based on file structure

i have one doubt, is there any way in HIVE which create table during load to hive warehouse or external table.
As i know hive is based on Schema On Read. so table structure must sync with file structure. but if file size is huge and we don't know its structure for example columns and their datatypes.
Than how to load those file to hive table.
so in short how to load file from HDFS to HIVE Table without knowing its schema structure.
New to Hive, Pardon if my understanding is wrong.
Thanks
By using sqoop you can create hive table while importing data.
Please refer to this link to create hive table while importing data
(or)
if you have imported data in AVRO format then you can generate avro schema by using
/usr/bin/Avro/avro-tools-*.jar then use the generated avro schema while creating table in hive then hive uses the schema and reads the data from HDFS.
Please refer to this link to extract schema from avro data file
(or)
While importing data using sqoop --as-avrodatefile then sqoop creates .avsc file with schema in it, so we can use this .avsc file creating the table.
CREATE EXTERNAL TABLE avro_tbl
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '<hdfs-location>'
TBLPROPERTIES ('avro.schema.url'='<schema-file>');
(or)
By using NiFi to import data NiFi pulls data in avro format by using ExtractAvroMetadata processor we can extract the avro schema and store into HDFS and create table by using this avro schema.
If you want to create table in ORC format then by using ConvertAvroToOrc processor adds hive.ddl attribute to the flowfile as we can execute the ddl statement to create orc table in hive.

Convert data from gzip to sequenceFile format using Hive on spark

I'm trying to read a large gzip file into hive through spark runtime
to convert into SequenceFile format
And, I want to do this efficiently.
As far as I know, Spark supports only one mapper per gzip file same as it does for text files.
Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?
I'm stuck currently.
The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.
The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.
I used to execute this query:
create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');
But now I have to rewrite it in something like that:
CREATE TABLE table_1(
ID BIGINT,
NAME STRING
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE table_1;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;
INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;
INSERT INTO TABLE table_1 SELECT id, name from table_1_text;
Is this the optimal way of doing this, or is there a simpler approach to this problem?
Please help!
As gzip textfile file is not splitable ,only one mapper will be launched or
you have to choose other data formats if you want to use more than one
mappers.
If there are huge json files and you want to save storage on hdfs use bzip2
compression to compress your json files on hdfs.You can query .bzip2 json
files from hive without modifying anything.

How to use Parquet in my current architecture?

My current system is architected in this way.
Log parser will parse raw log at every 5 minutes with format TSV and output to HDFS. I created Hive table out of the TSV file from HDFS.
From some benchmark, I found that Parquet can save up to 30-40% of the space usage. I also found that I can create Hive table out of Parquet file starting Hive 0.13. I would like know if I can convert TSV to Parquet file.
Any suggestion is appreciated.
Yes, in Hive you can easily convert from one format to another by inserting from one table to the other.
For example, if you have a TSV table defined as:
CREATE TABLE data_tsv
(col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
And a Parquet table defined as:
CREATE TABLE data_parquet
(col1 STRING, col2 INT)
STORED AS PARQUET;
You can convert the data with:
INSERT OVERWRITE TABLE data_parquet SELECT * FROM data_tsv;
Or you can skip the Parquet table DDL by:
CREATE TABLE data_parquet STORED AS PARQUET AS SELECT * FROM data_tsv;

Hadoop Hive - best use cases to create a custom Hive Input and Output formats?

Just wanted to understand what best use cases to create a custom Hive InputFormat and Output format?
If anyone of you have created could you please let know when to decide to develop a custom Input / Output formats?
Thanks,
To make Hive varchar behave like Oracle varchar2:
While working on oracle to hadoop migration, we came across a setting in oracle where if the length of data for a varchar2 column exceeds the value defined in table DDL, oracle rejects the record.
Ex: Lets say we have a column 'name' in oracle and hadoop with max length 10 bytes
name varchar2(10 BYTE) - Oracle
name varchar(10) - Hive
If the value for name field="lengthgreaterthanten", oracle rejects the record as oracle applies schema during write time. Whereas hive reads "lengthgrea" i.e. 10 characters as Hive just applies the schema at the time of reading the records from HDFS.
To get over this problem we came up with a custom input format that checks the length of the varchar field by splitting on the delimiter. If the length is greater than the specified length, it continues to the next record. Else if the length is less than or equal to the specified length, the record is written to HDFS.
Hope this helps.
Thanks
one of the various file formats used for Hive are RCFile, Parquet and ORC file formats. These file formats are columnar file format. This gives an advantage that when you reading large tables you don't have to read and process all the data. Most of the aggregation queries refer to only few columns rather than all of them. This speeds up your processing hugely.
Other application could be storing , reading and processing your custom input format, where data might be stored differently than csv structure. These might be binary files or any other structure.
You will have to follow the documentation to create input formats. For details you can follow the link: Custom InputFormat with Hive

Resources