Parquet vs ORC vs ORC with Snappy - hadoop

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.
I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.
Follows some details of my data.
Table A- Text File Format- 2.5GB
Table B - ORC - 652MB
Table C - ORC with Snappy - 802MB
Table D - Parquet - 1.9 GB
Parquet was worst as far as compression for my table is concerned.
My tests with the above tables yielded following results.
Row count operation
Text Format Cumulative CPU - 123.33 sec
Parquet Format Cumulative CPU - 204.92 sec
ORC Format Cumulative CPU - 119.99 sec
ORC with SNAPPY Cumulative CPU - 107.05 sec
Sum of a column operation
Text Format Cumulative CPU - 127.85 sec
Parquet Format Cumulative CPU - 255.2 sec
ORC Format Cumulative CPU - 120.48 sec
ORC with SNAPPY Cumulative CPU - 98.27 sec
Average of a column operation
Text Format Cumulative CPU - 128.79 sec
Parquet Format Cumulative CPU - 211.73 sec
ORC Format Cumulative CPU - 165.5 sec
ORC with SNAPPY Cumulative CPU - 135.45 sec
Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU - 72.48 sec
Parquet Format Cumulative CPU - 136.4 sec
ORC Format Cumulative CPU - 96.63 sec
ORC with SNAPPY Cumulative CPU - 82.05 sec
Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?
Thanks!

I would say, that both of these formats have their own advantages.
Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.
And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.
The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB

You are seeing this because:
Hive has a vectorized ORC reader but no vectorized parquet reader.
Spark has a vectorized parquet reader and no vectorized ORC reader.
Spark performs best with parquet, hive performs best with ORC.
I've seen similar differences when running ORC and Parquet with Spark.
Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.
(correct as of Hive 2.0 and Spark 2.1)

Both Parquet and ORC have their own advantages and disadvantages. But I simply try to follow a simple rule of thumb - "How nested is your Data and how many columns are there". If you follow the Google Dremel you can find how parquet is designed. They user a hierarchal tree-like structure to store data. More the nesting deeper the tree.
But ORC is designed for a flattened file store. So if your Data is flattened with fewer columns, you can go with ORC, otherwise, parquet would be fine for you. Compression on flattened Data works amazingly in ORC.
We did some benchmarking with a larger flattened file, converted it to spark Dataframe and stored it in both parquet and ORC format in S3 and did querying with **Redshift-Spectrum **.
Size of the file in parquet: ~7.5 GB and took 7 minutes to write
Size of the file in ORC: ~7.1. GB and took 6 minutes to write
Query seems faster in ORC files.
Soon we will do some benchmarking for nested Data and update the results here.

We did some benchmark comparing the different file formats (Avro, JSON, ORC, and Parquet) in different use cases.
https://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
The data is all publicly available and benchmark code is all open source at:
https://github.com/apache/orc/tree/branch-1.4/java/bench

Both of them have their advantages. We use Parquet at work together with Hive and Impala, but just wanted to point a few advantages of ORC over Parquet: during long-executing queries, when Hive queries ORC tables GC is called about 10 times less frequently. Might be nothing for many projects, but might be crucial for others.
ORC also takes much less time, when you need to select just a few columns from the table. Some other queries, especially with joins, also take less time because of vectorized query execution, which is not available for Parquet
Also, ORC compression is sometimes a bit random, while Parquet compression is much more consistent. It looks like when ORC table has many number columns - it doesn't compress as well. It affects both zlib and snappy compression

The default file format for Spark is parquet, but for Hive is orc.
As far as I know (maybe I'm wrong), the ratio of compression using zlib is higher than with snappy but it requires more CPU. Snappy on the other hand is a great "decent" compression format when you don't want too much CPU consumption.
I haven't tried the parquet API to write/read files, but I have some experience doing that with ORC. The ORC format is great, but it has what it seems as a bottleneck when you're trying to write files at the same time in different threads of the same JVM process. And it also has some memory problems too. I had to make some minor changes in the classes
org.apache.hadoop.hive.ql.io.orc.MemoryManager
org.apache.hadoop.hive.ql.io.orc.WriterImpl
in order to make it work better and faster (HDP 2.6.4.0).
As previous fellows have said, it all depends on your data structure, the API or framework you're using to read the data and what you're trying to do with that data.
ORC files have statistics at different levels (file, stripes and rows) that can improve a lot the performance when you're filtering data, for example.
ORC has also some improvements when writing if your columns have null values or the same value repeats often.
A benchmark is not that useful when what you're really trying to do has nothing to do with what the benchmark is testing.

Related

How to improve performance of csv to parquet file format using pyspark?

I have a large dataset I need to convert from csv to parquet format using pyspark. There is approximately 500GB of data scattered across thousands of csv files. My initial implementation is simplistic ...
spark = SparkSession.builder \
.master("local") \
.appName("test") \
.getOrCreate()
df = spark.read.csv(input_files, header=True, inferSchema=True)
df.repartition(1).write.mode('overwrite').parquet(output_dir)
The performance is abysmal, I have let it run for 2+ hours before giving up. From logging output I infer it does not even complete reading the csv files into the dataframe.
I am running spark locally on a server with 128 high performance CPU cores and 1TB of memory. Disk storage is SSD based with confirmed read speeds of 650 MB/s. My intuition is that I should be able to significantly improve performance given the computing resources available. I'm looking for tips on how to do this.
I have tried...
not inferring schema, this did not produce a noticeable difference in performance (The schema is four columns of text)
using the configuration setting spark.executor.cores to match the number of physical cores on my server. The setting did not seem to have any effect, I did not observe the system using more cores.
I'm stuck using pyspark for now per management direction, but if necessary I can convince them to use a different tool.
Some suggestions based on my experience working with spark :
You should not infer the schema if you are dealing with huge data. It might not show significant improvement in the performance but definitely it would still save you some time.
Don't use repartition(1) as it would shuffle the data and create a single partition with data and that is what you don't want with huge volume of data that you have. I would suggest you to increase the number of partitions if possible based on the cluster configuration you have in order to get the parquet files saved faster.
Don't Cache/persist your data frame if you are just reading the csv files and then in the next step saving it as parquet files. It can increase your saving time as caching itself takes some time. Caching the data frame would have helped if you were performing multiple transformations on the data frame and then performing multiple actions on it. your are performing only one action of writing the data frame as parquet file, so according to me you should not cache the data frame.
Some possible improvments :
Don't use .repartition(1) as you lose parallelism for writing operation
Persisit/cache the dataframe before writing : df.persist()
If you really need to save it as 1 parquet file, you can first write into temp folder without reducing partitions then use coalesce in a second write operation :
df = spark.read.csv(input_files, header=True, inferSchema=True).persist()
# ....
df.write.mode('overwrite').parquet("/temp/folder")
df.unpersist()
df1 = spark.read.parquet("/temp/folder")
df1.coalesce(1).write.mode('overwrite').parquet(output_dir)

Storing data in HBase vs Parquet files

I am new to big data and am trying to understand the various ways of persisting and retrieving data.
I understand both Parquet and HBase are column oriented storage formats but Parquet is a file oriented storage and not a database unlike HBase.
My questions are :
What is the use case of using Parquet instead HBase
Is there a use case where Parquet can be used together with HBase.
In case of performing joins will Parquet be better performant than
HBase (say, accessed through a SQL skin like Phoenix)?
As you have already said in question, parquet is a storage while HBase is storage(HDFS) + Query Engine(API/shell) So a valid comparison should be done between parquet+Impala/Hive/Spark and HBase. Below are the key differences -
1) Disk space - Parquet takes less disk space in comparison to HBase. Parquet encoding saves more space than block compression in HBase.
2) Data Ingestion - Data ingestion in parquet is more efficient than HBase. A simple reason could be point 1. As in case of parquet, less data needs to be written on disk.
3) Record lookup on key - HBase is faster as this is a key-value storage while parquet is not. Indexing in parquet will be supported in future release.
4) Filter and other Scan queries - Since parquet store more information about records stored in a row group, it can skip lot of records while scanning the data. This is the reason, it's faster than HBase.
5) Updating records - HBase provides record updates while this may be problematic in parquet as the parquet files needs to be re-written. A careful design of schema and partitioning may improve updates but it's not comparable with HBase.
By comparing the above features, HBase seems more suitable for situations where updates are required and queries involve mainly key-value lookup. Query involving key range scan will also have better performance in HBase.
Parquet is suitable for use cases where updates are very few and queries involves filters, joins and aggregations.

Parquet file format performance issue with map reduce

I have a 2.1TB uncompressed data which I am loading into 2 tables, both snappy compressed, but one with parquet file format and another is using ORC file format. While creating parquet file format, I am keeping hdfs block size same as parquet.block.size.
I am observing that my map reduce queries are performing very poor with parquet compared to ORC by a large margin. These are aggregate queries and ORC takes time under a minute, whereas the parquet is taking more than 5-6 mins. When I use Tez execution engine, then the performance is comparable.
I am using hdp 2.5.x version of distribution.
anyone faced similiar issue and any hints on improving the performance with MR alone?

Choosing File Format in hadoop

Folks,
What are the recommended file format that can be used in different phases of Hadoop processing.
Processing : I have been using text format / JSON serde in hive to do the processing. Is this a good format for staging table where i perform the ETL (Transformation) operation ? is there a better formats which i should be using ?
I know Parquet / ORC / AVRO are specialized format but does it fit well for ETL(Transformation) operation . Also if i use a compression technique such as Snappy for Zlib would that be a recommended approach(I don't want to loose performance due to the extra CPU utilization because of compression , Correct me if compression would have a better performance)
Reporting : Depending upon my query needs
Aggregation :
using a columnar storage seems to be a logical solution. Does Parquet with Snappy compression a good fit (Assuming my hadoop distribution is Cloudera).
Complete row fetch
If my query pattern needs all columns in a row , would choosing a columnar storage be a wise decision ? Or should i choose AVRO file format
Archive : For archiving data i plan to use AVRO as it handles schema evolution with good compression.
Choosing the file format depends on the usecase.
You are processing data in hive hence below are the recommendation.
Processing : Use ORC for processing as you are using aggregation and other column level operation. It will help in increasing performance many fold.
Compression : Using it wisely on case basis will help in increasing performance by reducing expensive IO operation time.
If use case is row based operation then using Avro is recommended.
Hope this will help in taking decision.

Hive Tremendous data size increase from converting avro to parquet

I wanted to convert one days avro data (~2 TB) to parquet.
I ran a hive query and data successfully got converted to parquet.
But the data size became 6 TB.
What would have happened that data became thrice the size?
Typically, Parquet can be more efficient than Avro, as it's a columnar format columns of the same type are adjacent on the disk. This allows compression algorithms to be more effective in some cases. Typically we use Snappy which is sufficient, easy on CPU and has several properties that make it suitable for Hadoop relative to other compression methods like zip or gzip. Mainly snappy is splittable; each block retains information necessary to determine schema. MParquet is a great format and we have been very happy with query performance after moving from Avro (and we also can use Impapla which is super-fast).

Resources