Describe table shows "from deserializer" for column comments in Hue Hive Avro format - hadoop

We have observed that when we store the data in Avro format, it converts byte stream to binary, due to which all the comments gets converted to “from deserializer”.
We found a jira bug for this issue as well, few confirms, this issue has been addressed with 0.13 version. We are using hive 1.1 (Cloudera). But we are still facing the issue.
Jira :- https://issues.apache.org/jira/browse/HIVE-6681
https://www.bountysource.com/issues/1320154-describe-on-a-table-returns-from-deserializer-for-column-comments-instead-of-values-supplied-in-create-table
But when we change the input and output format to normal text (specified explicitly), column description can be retained, however, it seems it is losing on its actual avro functionality in such a case. So the below code cannot be used.
-- Below is input and output format using text
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

Related

What does "the container format for fields in a row" mean for a file format?

From Hadoop: The Definitive Guide:
There are two dimensions that govern table storage in Hive: the row
format and the file format.
The row format dictates how rows, and the
fields in a particular row, are stored. In Hive parlance, the row
format is defined by a SerDe, a portmanteau word for a
Serializer-Deserializer. When acting as a deserializer, which is the
case when querying a table, a SerDe will deserialize a row of data
from the bytes in the file to objects used internally by Hive to
operate on that row of data. When used as a serializer, which is the
case when performing an INSERT or CTAS (see “Importing Data” on page
500), the table’s SerDe will serialize Hive’s internal representation
of a row of data into the bytes that are written to the output file.
The file format dictates the container format for fields in a row. The
simplest format is a plain-text file, but there are row-oriented and
column-oriented binary formats avail‐ able, too.
What does "the container format for fields in a row" mean for a file format?
How is a file format different from a row format?
Read also guide about SerDe
Hive uses SerDe (and FileFormat) to read and write table rows.
HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object
Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files
You can create tables with a custom SerDe or using a native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified
File Format represents file container, it can be Text or binary format like ORC or Parquet.
Row format can be simple delimited text or rather complex regexp/template-based or JSON for example.
Consider JSON formatted records in a Text file:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
Or JSON records in a sequence file:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS SEQUENCEFILE
Everything is a Java Class actually. What is very confusing for beginners is that there are shortcuts possible in the DDL, this allows you to write DDL without specifying long and complex class names for all formats. Some classes have no corresponding shortcuts embedded in the DDL language.
STORED AS SEQUENCEFILE is a shortcut for
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
These two classes determine how to read/write file container.
And this class determines how the row should be stored and read (JSON):
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
And now DDL with row format and file format without shortcuts:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
And for even better understanding the difference, look at the SequenceFileOutputFormat class (extends FileOutputFormat) and JsonSerDe (implements SerDe) You can dig deep and try to understand methods implemented and base classes/interfaces, look at the source code, serialize and deserialize methods in JsonSerDe class.
And "the container format for fields in a row" is FileInputFormat plus FileOutputFormat mentioned in the above DDLs. In case of ORC file for example, you cannot specify row format (delimited or other SerDe). ORC file dictates that OrcSerDe will be used only for this type of file container, which has it's own internal format for storing rows and columns. Actually you can write ROW FORMAT DELIMITED STORED AS ORC in Hive, but row format delimited will be ignored in this case.

ORC schema evolution

After going through a sample ORC file itself I came to know that ORC file format does not store any column information, in fact all the column names will be replaced by _c0 to _cn, in such scenario how a proper schema evolution can be achieved for ORC tables?
ORC format does not store any information about hive column names. There was a bug which was storing column information if ORC file were created using PIG. You can find the details below
https://issues.apache.org/jira/browse/HIVE-7189
I think ORC file format (and other) rely on Hive Metastore for this information. if you will run describe formatted <table_name>, you will get the schema information.
something like
# col_name data_type comment
name string

Result of Hive unbase64() function is correct in the Hive table, but becomes wrong in the output file

There are two questions:
I use unbase64() to process data and the output is completely correct in both Hive and SparkSQL. But in Presto, it shows:
Then I insert the data to both local path and hdfs, and the the data in both output files are wrong:
The code I used to insert data:
insert overwrite directory '/tmp/ssss'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
select * from tmp_ol.aaa;
My question is:
1. Why the processed data can be shown correctly in both hive and SparkSQL but Presto? The Presto on my machine can display this kind of character.
Why the data cannot be shown correctly in the output file? The files is in utf-8 format.
You can try using CAST (AS STRING) over output of unbase64() function.
spark.sql("""Select CAST(unbase64('UsImF1dGhvcml6ZWRSZXNvdXJjZXMiOlt7Im5h') AS STRING) AS values FROM dual""").show(false)```

Is there a Hive SerDe for CSV file to infer schema from file header

I have a CSV file, with first line as a header. Is there a HIVE SerDe that can create table using CSV header as well infers data type then it is the best.
Short answer - No.
What you're looking for is outside the scope of what SerDes are designed to do. There are, however, tools available that will create a table from a CSV with headers as an intermediate step. Check out hue.

Hadoop Hive - best use cases to create a custom Hive Input and Output formats?

Just wanted to understand what best use cases to create a custom Hive InputFormat and Output format?
If anyone of you have created could you please let know when to decide to develop a custom Input / Output formats?
Thanks,
To make Hive varchar behave like Oracle varchar2:
While working on oracle to hadoop migration, we came across a setting in oracle where if the length of data for a varchar2 column exceeds the value defined in table DDL, oracle rejects the record.
Ex: Lets say we have a column 'name' in oracle and hadoop with max length 10 bytes
name varchar2(10 BYTE) - Oracle
name varchar(10) - Hive
If the value for name field="lengthgreaterthanten", oracle rejects the record as oracle applies schema during write time. Whereas hive reads "lengthgrea" i.e. 10 characters as Hive just applies the schema at the time of reading the records from HDFS.
To get over this problem we came up with a custom input format that checks the length of the varchar field by splitting on the delimiter. If the length is greater than the specified length, it continues to the next record. Else if the length is less than or equal to the specified length, the record is written to HDFS.
Hope this helps.
Thanks
one of the various file formats used for Hive are RCFile, Parquet and ORC file formats. These file formats are columnar file format. This gives an advantage that when you reading large tables you don't have to read and process all the data. Most of the aggregation queries refer to only few columns rather than all of them. This speeds up your processing hugely.
Other application could be storing , reading and processing your custom input format, where data might be stored differently than csv structure. These might be binary files or any other structure.
You will have to follow the documentation to create input formats. For details you can follow the link: Custom InputFormat with Hive

Resources