Does field delimiter matter in binary file formats in Hive? - hadoop

In textfile format, data is stored in text format with fields delimited by field delimiter. That's why we prefer non-readable delimiter like CTRL^A.
But is there any effect of using field delimiter while creating hive table in rcfile, orc, avro & sequencefile.
In some hive tutorials, I saw usage of delimiter in these binary file formats too.
Example:
create table olympic_orcfile(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as orcfile;
Does field delimiter ignored or it matter in binary file formats in Hive?

Ignored by RCFILE, ORC and AVRO but does matter for SEQUENCEFILE.

Related

What does "the container format for fields in a row" mean for a file format?

From Hadoop: The Definitive Guide:
There are two dimensions that govern table storage in Hive: the row
format and the file format.
The row format dictates how rows, and the
fields in a particular row, are stored. In Hive parlance, the row
format is defined by a SerDe, a portmanteau word for a
Serializer-Deserializer. When acting as a deserializer, which is the
case when querying a table, a SerDe will deserialize a row of data
from the bytes in the file to objects used internally by Hive to
operate on that row of data. When used as a serializer, which is the
case when performing an INSERT or CTAS (see “Importing Data” on page
500), the table’s SerDe will serialize Hive’s internal representation
of a row of data into the bytes that are written to the output file.
The file format dictates the container format for fields in a row. The
simplest format is a plain-text file, but there are row-oriented and
column-oriented binary formats avail‐ able, too.
What does "the container format for fields in a row" mean for a file format?
How is a file format different from a row format?
Read also guide about SerDe
Hive uses SerDe (and FileFormat) to read and write table rows.
HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object
Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files
You can create tables with a custom SerDe or using a native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified
File Format represents file container, it can be Text or binary format like ORC or Parquet.
Row format can be simple delimited text or rather complex regexp/template-based or JSON for example.
Consider JSON formatted records in a Text file:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
Or JSON records in a sequence file:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS SEQUENCEFILE
Everything is a Java Class actually. What is very confusing for beginners is that there are shortcuts possible in the DDL, this allows you to write DDL without specifying long and complex class names for all formats. Some classes have no corresponding shortcuts embedded in the DDL language.
STORED AS SEQUENCEFILE is a shortcut for
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
These two classes determine how to read/write file container.
And this class determines how the row should be stored and read (JSON):
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
And now DDL with row format and file format without shortcuts:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
And for even better understanding the difference, look at the SequenceFileOutputFormat class (extends FileOutputFormat) and JsonSerDe (implements SerDe) You can dig deep and try to understand methods implemented and base classes/interfaces, look at the source code, serialize and deserialize methods in JsonSerDe class.
And "the container format for fields in a row" is FileInputFormat plus FileOutputFormat mentioned in the above DDLs. In case of ORC file for example, you cannot specify row format (delimited or other SerDe). ORC file dictates that OrcSerDe will be used only for this type of file container, which has it's own internal format for storing rows and columns. Actually you can write ROW FORMAT DELIMITED STORED AS ORC in Hive, but row format delimited will be ignored in this case.

How to create a HIVE table to read semicolon separated values

I want to create a HIVE table that will read in semicolon separated values, but my code keeps giving me errors. Does anyone have any suggestions?
CREATE TABLE test_details(Time STRING, Vital STRING, sID STRING)
PARTITIONED BY(Country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ';'
STORED AS TEXTFILE;
For me nothing worked except this:
FIELDS TERMINATED BY '\u0059'
Edit: After updating Hive:
FIELDS TERMINATED BY '\u003B'
so in full:
CREATE TABLE test_details(Time STRING, Vital STRING, sID STRING)
PARTITIONED BY(Country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0059'
STORED AS TEXTFILE;
The delimiter you are using is the cause for errors. Semi colon is the line terminator for hive which describes completion of hive query.
Use the below modified ddl:
CREATE TABLE test_details(Time STRING, Vital STRING, sID STRING)
PARTITIONED BY(Country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
STORED AS TEXTFILE;
This will work for you.
Is your text properly sanitized? HIVE natively does not handle quotes in text nicely.
Try using serde with custom separator (i.e. semi-colon in this case).

handling newline character in hive

I have created a table in hive as
Create table(id int, Description String)
My data looks something as follows :
1|This will return corrupt data since there is a ',' in the first string.
some text
Change the data
2|There is prob in reading data
sometext
After the data is loaded into hive since the default line terminator is \n, the description column cannot be read by hive, Hence it displays a NULL value. Can anyone suggest how to handle newline before loading into hive.
I know this question is old, but you have a couple of options. You can't control this with FIELDS TERMINATED BY, because that only controls what terminates the fields, not the records. Records in Hive are hard-coded to be terminated by the newline character (even though there is a LINES TERMINATED BY clause, it is not implemented).
Write a custom InputFormat that uses a RecordReader that
understands non-newline delimited records. Look at the code for
LineReader/LineRecordReader and TextInputFormat.
Use a format
other than text/ASCII, like Parquet. I would recommend this
regardless, as text is probably the worst format you can store data
in anyway.
try adding the below property in hive-site.xml or you can just try for temporary hive session level.
hive.query.result.fileformat=SequenceFile
By default hive takes in NEWLINE ('\N') as delimiter .
You can change the delimiter using:
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

Hive: CREATE TABLE on unicode csv files

On HDInsight cluster, trying to create Hive table on unicode csv files.
Invoke-Hive -Query #"
CREATE EXTERNAL TABLE TestUnicode(Numeric1 INT,Numeric2 INT,Numeric3 INT,Name String)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION
'wasb://$containerName#$storageAccountName.blob.core.windows.net/TestUnicode.csv';
"#
But Hive is not recognising unicode strings properly. Also all integer fields are loaded as NULL.
Change encoding of TestUnicode.csv to UTF-8. Works for me.

Import flat files with key=value pair in Hive

I have raw files in HDFS in the format
name=ABC age=10 Location=QWERTY
name=DEF age=15 Location=IWIORS
How do I import data from these flat files into a Hive table with columns 'name' and 'location' only.
You can do the following.
In table declaration, use:
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ' ' --space
        MAP KEYS TERMINATED BY '='
Also your table will have a single column with data type as Map.
So when you can retireve data from the single column using the key.
Other option:
Write your own SerDe. Link below explain the process for JSON data. I am sure you can customize it for your requirements:
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

Resources