Is there a Hive SerDe for CSV file to infer schema from file header - hadoop

I have a CSV file, with first line as a header. Is there a HIVE SerDe that can create table using CSV header as well infers data type then it is the best.

Short answer - No.
What you're looking for is outside the scope of what SerDes are designed to do. There are, however, tools available that will create a table from a CSV with headers as an intermediate step. Check out hue.

Related

What does "the container format for fields in a row" mean for a file format?

From Hadoop: The Definitive Guide:
There are two dimensions that govern table storage in Hive: the row
format and the file format.
The row format dictates how rows, and the
fields in a particular row, are stored. In Hive parlance, the row
format is defined by a SerDe, a portmanteau word for a
Serializer-Deserializer. When acting as a deserializer, which is the
case when querying a table, a SerDe will deserialize a row of data
from the bytes in the file to objects used internally by Hive to
operate on that row of data. When used as a serializer, which is the
case when performing an INSERT or CTAS (see “Importing Data” on page
500), the table’s SerDe will serialize Hive’s internal representation
of a row of data into the bytes that are written to the output file.
The file format dictates the container format for fields in a row. The
simplest format is a plain-text file, but there are row-oriented and
column-oriented binary formats avail‐ able, too.
What does "the container format for fields in a row" mean for a file format?
How is a file format different from a row format?
Read also guide about SerDe
Hive uses SerDe (and FileFormat) to read and write table rows.
HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object
Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files
You can create tables with a custom SerDe or using a native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified
File Format represents file container, it can be Text or binary format like ORC or Parquet.
Row format can be simple delimited text or rather complex regexp/template-based or JSON for example.
Consider JSON formatted records in a Text file:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
Or JSON records in a sequence file:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS SEQUENCEFILE
Everything is a Java Class actually. What is very confusing for beginners is that there are shortcuts possible in the DDL, this allows you to write DDL without specifying long and complex class names for all formats. Some classes have no corresponding shortcuts embedded in the DDL language.
STORED AS SEQUENCEFILE is a shortcut for
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
These two classes determine how to read/write file container.
And this class determines how the row should be stored and read (JSON):
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
And now DDL with row format and file format without shortcuts:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
And for even better understanding the difference, look at the SequenceFileOutputFormat class (extends FileOutputFormat) and JsonSerDe (implements SerDe) You can dig deep and try to understand methods implemented and base classes/interfaces, look at the source code, serialize and deserialize methods in JsonSerDe class.
And "the container format for fields in a row" is FileInputFormat plus FileOutputFormat mentioned in the above DDLs. In case of ORC file for example, you cannot specify row format (delimited or other SerDe). ORC file dictates that OrcSerDe will be used only for this type of file container, which has it's own internal format for storing rows and columns. Actually you can write ROW FORMAT DELIMITED STORED AS ORC in Hive, but row format delimited will be ignored in this case.

ORC schema evolution

After going through a sample ORC file itself I came to know that ORC file format does not store any column information, in fact all the column names will be replaced by _c0 to _cn, in such scenario how a proper schema evolution can be achieved for ORC tables?
ORC format does not store any information about hive column names. There was a bug which was storing column information if ORC file were created using PIG. You can find the details below
https://issues.apache.org/jira/browse/HIVE-7189
I think ORC file format (and other) rely on Hive Metastore for this information. if you will run describe formatted <table_name>, you will get the schema information.
something like
# col_name data_type comment
name string

Schema on read in hive for tsv format file

I am new on hadoop. I have data in tsv format with 50 columns and I need to store the data into hive. How can I create and load the data into table on the fly without manually creating table using create table statementa using schema on read?
Hive requires you to run a CREATE TABLE statement because the Hive metastore must be updated with the description of what data location you're going to be querying later on.
Schema-on-read doesn't mean that you can query every possible file without knowing metadata beforehand such as storage location and storage format.
SparkSQL or Apache Drill, on the other hand, will let you infer the schema from a file, but you must again define the column types for a TSV if you don't want everything to be a string column (or coerced to unexpected types). Both of these tools can interact with a Hive metastore for "decoupled" storage of schema information
you can use Hue :
http://gethue.com/hadoop-tutorial-create-hive-tables-with-headers-and/
or with Spark you can infer the schema of csv file and you can save it as a hive table.
val df=spark.read
.option("delimiter", "\t")
.option("header",true)
.option("inferSchema", "true") // <-- HERE
.csv("/home/cloudera/Book1.csv")

Creating an ORC file and not Hive table?

From what I googled around and found are ways of creating an ORC table using Hive but I want a an ORC file on which I can run my custom map-reduce job.
Also please let me know that the file created by Hive under the warehouse directory for my ORC table is a table file of ORC and not an actutal ORC file I can use? like: /user/hive/warehouse/tbl_orc/000000_0
[Wrap-up of the discussion]
a Hive table is mapped on a HDFS directory (or a list of
directories, if the table is partitioned)
all files in that directory use the same SerDe (ORC, Parquet, AVRO,
Text, etc.) and have the same column set; all together, they contain all the data available for that table
each file in that directory is the result of a previous MapReduce job
-- either a Hive INSERT, a Pig dataset saved via HCatalog, a Spark dataset saved via HiveContext... or any custom job that happens to
drop a file there, hopefully compliant with the table SerDe and
schema (retrieved via MetastoreClient Java API, or via HCatalog API,
whatever)
note that a single job with 3 reducers will probably create 3 new
files (and maybe 1 empty file + 1 small file + 1 big file!); and a
job with 24 mappers and no reducer will create 24 files, unless some
kind of "merge small files" post-processing step is enabled
note also that most file names give absolutely no information about
the way the file is encoded intenally, they are just sequence numbers
(i.e. the 5th job to add 12 files will typically create files 000004_0 to
000004_11)
All in all, processing an ORC fileset with a Java MapReduce program should be very similar to processing a Text fileset. You just have to provide the correct SerDe and the correct field mapping -- I think that the encryption algorithm is explicit in the files so the Serde handles it auto-magically at read time. Just remember that ORC files are not splittable at record level, but at stripe level (a stripe is a bunch of record stored in columnar format w/ tokenization and optional compression).
Of course, that will not give you access to ORC advanced features such a vectorization or stripe pruning (somewhat similar to "smart scan" in Oracle Exadata).

Hadoop Hive - best use cases to create a custom Hive Input and Output formats?

Just wanted to understand what best use cases to create a custom Hive InputFormat and Output format?
If anyone of you have created could you please let know when to decide to develop a custom Input / Output formats?
Thanks,
To make Hive varchar behave like Oracle varchar2:
While working on oracle to hadoop migration, we came across a setting in oracle where if the length of data for a varchar2 column exceeds the value defined in table DDL, oracle rejects the record.
Ex: Lets say we have a column 'name' in oracle and hadoop with max length 10 bytes
name varchar2(10 BYTE) - Oracle
name varchar(10) - Hive
If the value for name field="lengthgreaterthanten", oracle rejects the record as oracle applies schema during write time. Whereas hive reads "lengthgrea" i.e. 10 characters as Hive just applies the schema at the time of reading the records from HDFS.
To get over this problem we came up with a custom input format that checks the length of the varchar field by splitting on the delimiter. If the length is greater than the specified length, it continues to the next record. Else if the length is less than or equal to the specified length, the record is written to HDFS.
Hope this helps.
Thanks
one of the various file formats used for Hive are RCFile, Parquet and ORC file formats. These file formats are columnar file format. This gives an advantage that when you reading large tables you don't have to read and process all the data. Most of the aggregation queries refer to only few columns rather than all of them. This speeds up your processing hugely.
Other application could be storing , reading and processing your custom input format, where data might be stored differently than csv structure. These might be binary files or any other structure.
You will have to follow the documentation to create input formats. For details you can follow the link: Custom InputFormat with Hive

Resources