Choosing File Format in hadoop

Choosing File Format in hadoop - hadoop

Folks,
What are the recommended file format that can be used in different phases of Hadoop processing.
Processing : I have been using text format / JSON serde in hive to do the processing. Is this a good format for staging table where i perform the ETL (Transformation) operation ? is there a better formats which i should be using ?
I know Parquet / ORC / AVRO are specialized format but does it fit well for ETL(Transformation) operation . Also if i use a compression technique such as Snappy for Zlib would that be a recommended approach(I don't want to loose performance due to the extra CPU utilization because of compression , Correct me if compression would have a better performance)
Reporting : Depending upon my query needs
Aggregation :
using a columnar storage seems to be a logical solution. Does Parquet with Snappy compression a good fit (Assuming my hadoop distribution is Cloudera).
Complete row fetch
If my query pattern needs all columns in a row , would choosing a columnar storage be a wise decision ? Or should i choose AVRO file format
Archive : For archiving data i plan to use AVRO as it handles schema evolution with good compression.

Choosing the file format depends on the usecase.
You are processing data in hive hence below are the recommendation.
Processing : Use ORC for processing as you are using aggregation and other column level operation. It will help in increasing performance many fold.
Compression : Using it wisely on case basis will help in increasing performance by reducing expensive IO operation time.
If use case is row based operation then using Avro is recommended.
Hope this will help in taking decision.

Related

Storing data in HBase vs Parquet files

I am new to big data and am trying to understand the various ways of persisting and retrieving data.
I understand both Parquet and HBase are column oriented storage formats but Parquet is a file oriented storage and not a database unlike HBase.
My questions are :
What is the use case of using Parquet instead HBase
Is there a use case where Parquet can be used together with HBase.
In case of performing joins will Parquet be better performant than
HBase (say, accessed through a SQL skin like Phoenix)?

As you have already said in question, parquet is a storage while HBase is storage(HDFS) + Query Engine(API/shell) So a valid comparison should be done between parquet+Impala/Hive/Spark and HBase. Below are the key differences -
1) Disk space - Parquet takes less disk space in comparison to HBase. Parquet encoding saves more space than block compression in HBase.
2) Data Ingestion - Data ingestion in parquet is more efficient than HBase. A simple reason could be point 1. As in case of parquet, less data needs to be written on disk.
3) Record lookup on key - HBase is faster as this is a key-value storage while parquet is not. Indexing in parquet will be supported in future release.
4) Filter and other Scan queries - Since parquet store more information about records stored in a row group, it can skip lot of records while scanning the data. This is the reason, it's faster than HBase.
5) Updating records - HBase provides record updates while this may be problematic in parquet as the parquet files needs to be re-written. A careful design of schema and partitioning may improve updates but it's not comparable with HBase.
By comparing the above features, HBase seems more suitable for situations where updates are required and queries involve mainly key-value lookup. Query involving key range scan will also have better performance in HBase.
Parquet is suitable for use cases where updates are very few and queries involves filters, joins and aggregations.

Parquet file format performance issue with map reduce

I have a 2.1TB uncompressed data which I am loading into 2 tables, both snappy compressed, but one with parquet file format and another is using ORC file format. While creating parquet file format, I am keeping hdfs block size same as parquet.block.size.
I am observing that my map reduce queries are performing very poor with parquet compared to ORC by a large margin. These are aggregate queries and ORC takes time under a minute, whereas the parquet is taking more than 5-6 mins. When I use Tez execution engine, then the performance is comparable.
I am using hdp 2.5.x version of distribution.
anyone faced similiar issue and any hints on improving the performance with MR alone?

Hive Tremendous data size increase from converting avro to parquet

I wanted to convert one days avro data (~2 TB) to parquet.
I ran a hive query and data successfully got converted to parquet.
But the data size became 6 TB.
What would have happened that data became thrice the size?

Typically, Parquet can be more efficient than Avro, as it's a columnar format columns of the same type are adjacent on the disk. This allows compression algorithms to be more effective in some cases. Typically we use Snappy which is sufficient, easy on CPU and has several properties that make it suitable for Hadoop relative to other compression methods like zip or gzip. Mainly snappy is splittable; each block retains information necessary to determine schema. MParquet is a great format and we have been very happy with query performance after moving from Avro (and we also can use Impapla which is super-fast).

Difference between external and internal tables performance?

I want to create a table with static data such as country codes and names in HDFS. I will use a csv to load the data into the system. It doesn't matter if I drop the table and the data because this is information you can easily find on the Internet.
Is there any performance consideration about external/internal tables for this type of data? Should I stick with external tables like all the people in this post says?

As Stephen ODonnell pointed out in the comments, internal/external is really more about the location of the data and what manages it.
I would say there are other important performance factors to consider, for example the table format and whether or not compression is to be used.
The following is from an HDP perspective; for Cloudera the general concept is the same, but the specifics would probably differ.)
For example, you could define the table as being in ORC Format, which offers many optimizations, such as predicate pushdown that allows rows to be optimized out at the storage layer before they are even added into the SQL processing layer. More details on that.
Another option would be whether or not you want to specify compression, such as Snappy, a compression algorithm which balances speed and compression ratio (see ORC link above for more info).
Generally speaking, I treat the HDFS data as a source, and sqoop it into Hive into a managed (internal) table with with ORC format and snappy compression enabled. I find that provides good performance with the added benefit that any ETL can be done to this data without regard for the original source data in HDFS, since it was copied into Hive during the sqoop.
This does of course require extra space, which may be a consideration depending on your environment and/or specific use case.

Avro vs. Parquet

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data!
Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms?

Avro is a Row based format. If you want to retrieve the data as a whole you can use Avro
Parquet is a Column based format. If your data consists of a lot of columns but you are interested in a subset of columns then you can use Parquet
HBase is useful when frequent updating of data is involved. Avro is fast in retrieval, Parquet is much faster.

If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,
job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());
for
job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());
The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.
In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.
I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.

Avro
Widely used as a serialization platform
Row-based, offers a compact and fast binary format
Schema is encoded on the file so the data can be untagged
Files support block compression and are splittable
Supports schema evolution
Parquet
Column-oriented binary file format
Uses the record shredding and assembly algorithm described in the Dremel paper
Each data file contains the values for a set of rows
Efficient in terms of disk I/O when specific columns need to be queried
From Choosing an HDFS data storage format- Avro vs. Parquet and more

Both Avro and Parquet are "self-describing" storage formats, meaning that both embed data, metadata information and schema when storing data in a file.
The use of either storage formats depends on the use case. Three aspects constitute the basis upon which you may choose which format will be optimal in your case:
Read/Write operation: Parquet is a column-based file format. It supports indexing. Because of that it is suitable for write-once and read-intensive, complex or analytical querying, low-latency data queries. This is generally used by end users/data scientists.
Meanwhile Avro, being a row-based file format, is best used for write-intensive operation. This is generally used by data engineers. Both support serialization and compression formats, although they do so in different ways.
Tools: Parquet is a good fit for Impala. (Impala is a Massive Parallel Processing (MPP) RDBM SQL-query engine which knows how to operate on data that resides in one or a few external storage engines.) Again Parquet lends itself well to complex/interactive querying and fast (low-latency) outputs over data in HDFS. This is supported by CDH (Cloudera Distribution Hadoop). Hadoop supports Apache's Optimized Row Columnar (ORC) formats (selections depends on the Hadoop distribution), whereas Avro is best suited to Spark processing.
Schema Evolution: Evolving a DB schema means changing the DB's structure, therefore its data, and thus its query processing. Both Parquet and Avro supports schema evolution but to a varying degree.
Parquet is good for 'append' operations, e.g. adding columns, but not for renaming columns unless 'read' is done by index.
Avro is better suited for appending, deleting and generally mutating columns than Parquet. Historically Avro has provided a richer set of schema evolution possibilities than Parquet, and although their schema evolution capabilities tend to blur, Avro still shines in that area, when compared to Parquet.

Your understanding is right. In fact, we ran into a similar situation during data migration in our DWH. We chose Parquet over Avro as the disk saving we got was almost double than what we got with AVro. Also, the query processing time was much better than Avro. But yes, our queries were based on aggregation, column based operations etc. hence Parquet was predictably a clear winner.
We are using Hive 0.12 from CDH distro. You mentioned you are running into issues with Hive+Parquet, what are those? We did not encounter any.

Silver Blaze put description nicely with an example use case and described how Parquet was the best choice for him. It makes sense to consider one over the other depending on your requirements. I am putting up a brief description of different other file formats too along with time space complexity comparison. Hope that helps.
There are a bunch of file formats that you can use in Hive. Notable mentions are AVRO, Parquet. RCFile & ORC. There are some good documents available online that you may refer to if you want to compare the performance and space utilization of these file formats. Follows some useful links that will get you going.
This Blog Post
This link from MapR [They don't discuss Parquet though]
This link from Inquidia
The above given links will get you going. I hope this answer your query.
Thanks!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio