What does "the container format for fields in a row" mean for a file format? - hadoop

From Hadoop: The Definitive Guide:
There are two dimensions that govern table storage in Hive: the row
format and the file format.
The row format dictates how rows, and the
fields in a particular row, are stored. In Hive parlance, the row
format is defined by a SerDe, a portmanteau word for a
Serializer-Deserializer. When acting as a deserializer, which is the
case when querying a table, a SerDe will deserialize a row of data
from the bytes in the file to objects used internally by Hive to
operate on that row of data. When used as a serializer, which is the
case when performing an INSERT or CTAS (see “Importing Data” on page
500), the table’s SerDe will serialize Hive’s internal representation
of a row of data into the bytes that are written to the output file.
The file format dictates the container format for fields in a row. The
simplest format is a plain-text file, but there are row-oriented and
column-oriented binary formats avail‐ able, too.
What does "the container format for fields in a row" mean for a file format?
How is a file format different from a row format?

Read also guide about SerDe
Hive uses SerDe (and FileFormat) to read and write table rows.
HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object
Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files
You can create tables with a custom SerDe or using a native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified
File Format represents file container, it can be Text or binary format like ORC or Parquet.
Row format can be simple delimited text or rather complex regexp/template-based or JSON for example.
Consider JSON formatted records in a Text file:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
Or JSON records in a sequence file:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS SEQUENCEFILE
Everything is a Java Class actually. What is very confusing for beginners is that there are shortcuts possible in the DDL, this allows you to write DDL without specifying long and complex class names for all formats. Some classes have no corresponding shortcuts embedded in the DDL language.
STORED AS SEQUENCEFILE is a shortcut for
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
These two classes determine how to read/write file container.
And this class determines how the row should be stored and read (JSON):
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
And now DDL with row format and file format without shortcuts:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
And for even better understanding the difference, look at the SequenceFileOutputFormat class (extends FileOutputFormat) and JsonSerDe (implements SerDe) You can dig deep and try to understand methods implemented and base classes/interfaces, look at the source code, serialize and deserialize methods in JsonSerDe class.
And "the container format for fields in a row" is FileInputFormat plus FileOutputFormat mentioned in the above DDLs. In case of ORC file for example, you cannot specify row format (delimited or other SerDe). ORC file dictates that OrcSerDe will be used only for this type of file container, which has it's own internal format for storing rows and columns. Actually you can write ROW FORMAT DELIMITED STORED AS ORC in Hive, but row format delimited will be ignored in this case.

Related

Is Row format serde a compulsory parameter to be used while creating Hive table

I created a temporary hive table on top of textfile like this:
CREATE EXTERNAL TABLE tc (fc String,cno String,cs String,tr String,at String,act String,wa String,dn String,pnm String,rsk String,ttp String,tte String,aml String,pn String,ttn String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
location '/home/hbaseuser/tc';
Then I created an ORC table like this:
CREATE EXTERNAL TABLE tc1 (fc String,cno String,cs String,tr String,at String,act String,wa String,dn String,pnm String,rsk String,ttp String,tte String,aml String,pn String,ttn String)
Row format delimited
Fields terminated by '\t'
STORED AS orc
location '/user/hbaseuser/tc1';
Then I used this command to import data to hive table:
insert overwrite table tc1 select * from table tc;
now orc file is available at '/user/hbaseuser/tc1'
and I am able to read from orc table.
my question is what is the use of tag Row format serde 'org.apache.hadoop.hive.contrib.serde2.ORCSerDe'
When ROW FORMAT Serde is specified, it overrides the native Serde and uses that for table creation.
As per documentation,
You can create tables with a custom SerDe or using a native SerDe. A
native SerDe is used if ROW FORMAT is not specified or ROW FORMAT
DELIMITED is specified. Use the SERDE clause to create a table with a
custom SerDe.
STORED AS ORC statement is equivalent to writing
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
You can either use "Stored as" or "Row Format Serde" statement. You can refer the below documentation for more details:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe

How to convert existing text data in hdfs to Avro?

I have a table in hdfs which is stored in Text format, so now i have a requirement to add new column in between. So I thought to load new columns in avro as Avro supports schema evolution,but now the previous data is still in text format.
if you already have a table you can load that directly into avro table from hive, if not you can create hive table for that text file and load that to avro table.
Something like
create table test(fields type) row format delimited fields terminated by ',' stored as textile location 'textfilepath';
create table avrotbl like test stored as avrofile;
insert into abrotbl select * from test;

Does field delimiter matter in binary file formats in Hive?

In textfile format, data is stored in text format with fields delimited by field delimiter. That's why we prefer non-readable delimiter like CTRL^A.
But is there any effect of using field delimiter while creating hive table in rcfile, orc, avro & sequencefile.
In some hive tutorials, I saw usage of delimiter in these binary file formats too.
Example:
create table olympic_orcfile(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as orcfile;
Does field delimiter ignored or it matter in binary file formats in Hive?
Ignored by RCFILE, ORC and AVRO but does matter for SEQUENCEFILE.

Describe table shows "from deserializer" for column comments in Hue Hive Avro format

We have observed that when we store the data in Avro format, it converts byte stream to binary, due to which all the comments gets converted to “from deserializer”.
We found a jira bug for this issue as well, few confirms, this issue has been addressed with 0.13 version. We are using hive 1.1 (Cloudera). But we are still facing the issue.
Jira :- https://issues.apache.org/jira/browse/HIVE-6681
https://www.bountysource.com/issues/1320154-describe-on-a-table-returns-from-deserializer-for-column-comments-instead-of-values-supplied-in-create-table
But when we change the input and output format to normal text (specified explicitly), column description can be retained, however, it seems it is losing on its actual avro functionality in such a case. So the below code cannot be used.
-- Below is input and output format using text
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

Is there a Hive SerDe for CSV file to infer schema from file header

I have a CSV file, with first line as a header. Is there a HIVE SerDe that can create table using CSV header as well infers data type then it is the best.
Short answer - No.
What you're looking for is outside the scope of what SerDes are designed to do. There are, however, tools available that will create a table from a CSV with headers as an intermediate step. Check out hue.

Resources