Hive file formats advantages and disadvantages - hadoop

I start to work with Hive.
I wanted to know what queries should to use for each table format among formats:
rcfile, orcfile, parquet, delimited text

when you have tables with very large number of columns and you tend to use specific columns frequently, RC file format would be a good choice. Rather than reading the entire row of data you would just retrieve the required columns, thus saving time. The data is divided into groups of rows, which are then divided into groups of columns.
Delimited text file is the general file format.

For ORC file format , have a look at the hive documentation which has a detailed description here : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
Parquet file format stores data in column form.
eg:
Col1 Col2
A 1
B 2
C 3
Normal data is stored as A1B2C3. Using Parquet, data is stored as ABC123.
For parquet file format , have a read on https://blog.twitter.com/2013/dremel-made-simple-with-parquet

I see that there are a couple of answers but since your question didn't asked for any particular file formats, the answers addressed one or the other file format.
There are a bunch of file formats that you can use in Hive. Notable mentions are AVRO, Parquet. RCFile & ORC. There are some good documents available online that you may refer to if you want to compare the performance and space utilization of these file formats. Follows some useful links that will get you going.
This Blog Post
This link from MapR [They don't discuss Parquet though]
This link from Inquidia
The above given links will get you going. I hope this answer your query.
Thanks!

Related

Page level skip/read in apache parquet

Question: Does Parquet have the ability to skip/read certain pages in a column chunk based on the query we run?
Can page header metadata help here?
http://parquet.apache.org/documentation/latest/
Under File Format, I read this statement and it seemed doubtful
Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.

1 Billion records join(Filters) in Spark with Parquet file format vs HadoopText Input format

When reading a 1 Billion records of a table in Spark from Hive and this table have date and country columns as partitions. It is running for very long time since we are doing many transformations on it. If I change the Hive table file format to Parquet then will it be there any performance? Any suggestions on improvement of performance .
Change the Orc to Parquet maybe will not improve the performance.
But it depends of the type of data you have. If you are working with nested objects you need to use Parquet, Orc is not good for that.
But to create some improvement, I suggest you to do some steps that can help with your data in Hive.
Check the number of files in Hive.
One common thing that can create big problems in Hive Query is the number of files in each partition, and the size of these files are. If you are using Spark to store the data, I suggest you to check the size of the files and if they are stored with the size of your Hadoop block. If not, try to use the command CONCATENATE to solve that problem. As you can see here.
Predicate PushDown
This is what Hive, and Orc files can give you with the best performance in query the data. I suggest you to run one ANALYSE command to force the creation of the Statistics of your table, this will improve the performance and if the data is not efficient this will help. Check here and with this will update the Hive Metastore and will give you some relevant data information.
Ordered Data
If it is possible, try to store your data ordered by some column, and filter and do other stuffs in that column. Your join can be improved with this.

Apache Solr support for ORC file format

I have a bunch of tables in Hive, stored as ORC. I want to index their data in a SolrCloud collection.
Is there any support for indexing data stored in ORC format in Solr?
I've googled around but nothing came out.
Looks like you want SolR to read data from a specific Hive file format.
You might look at the problem the other way i.e. use Hive to write data to SolR -- and thus let Hive take care of the complexity of the actual input file format (whether ORC, Parquet, AVRO, whatever -- even HBase data files).
In the LucidWorks GitHub repo you will find a project labeled hive-solr. Have a look.
I'll accept Samson's answer.
Anyway, I'm not fully satisfied about this solution. In fact, now I still need to create an external table manually declaring all fields in the original table. In terms of operations, it is not different from creating a new table (stored ad textfile) starting from the original one, indexing the new text files and finally dropping them (of course, this may be a problem for very large tables, which is not my case).
Being ORC a self-describing format, it would be great for Solr to read both field names and data directly from the compressed files.

Hadoop Input Formats - Usage

I know different file formats in Hadoop ? By default hadoop uses text input format. what is advantage/disadvantage of using text input format.
What is advantage/disadvantage of avro over text input format.
Also please help me understand use case for different file formats(Avro, Sequence, TextInput, RCFile ).
I believe there are no advantages of Text as default other than its contents are human readable and friendly. You could easily view contents by issuing Hadoop fs -cat .
The disadvantages with Text format are
It takes more resources on disk, so would impact the production job efficiency.
Writing/Parsing the text records take more time
No option to maintain data types incase the text is composed from multiple columns.
The Sequence , Avro , RCFile format have very significant advantages over Text format.
Sequence - The key/value objects are directly stored in the binary format through the Hadoop's native serialization process by implementing Writable interface. The data types of the columns are very well maintained, and parsing the records with relevant data type also done easily. Obvoiusly it takes lesser space compared with Text due to the binary format.
Avro - Its a very compact binary storage format for hadoop key/value pairs, Reads/writes records through Avro serialization/deserialization. It is very similar to Sequence file format but also provides Language interoperability and cell versioning.
You may choose Avro over Sequence only if u need cell versioning or the data to be stored will used by few other applications written in different languages other than Java.Avro files can be processed by any languages like C, Ruby, Python, PHP, Java wherein Sequence files are specific only for Java.
RCFile - The Record Columnar File format is column oriented and it is a Hive specific storage format designed to make hive to support faster data load, reduce storage space.
Apart from this you may also consider the ORC and the Parquet file formats.

Amazon EMR JSON

I am using Amazon EMR Hadoop Hive for big data processing. Current data in my log files is in CSV format. In order to make the table from log files, I wrote regex expression to parse the data and store into different columns of external table. I know that SerDe can be used to read data in JSON format and this means that each log file line could be as JSON object. Are there any Hadoop performance advantages if my log files are in JSON format comparing CSV format.
If you can process the output of the table (that you created with the regexp) why do another processing? Try to avoid unnecessary stuff.
I think the main issue here is which format is faster to read. I believe CSV will provide better speed over JSON but don't take my word. Hadoop really doesn't care. It's all byte arrays to him, once in memory.

Resources