Difference between Apache parquet and arrow - parquet

I'm looking into a way to speed up my memory intensive frontend vis app. I saw some people recommend Apache Arrow, while I'm looking into it, I'm confused about the difference between Parquet and Arrow.
They are both columnized data structure. Originally I thought parquet is for disk, and arrow is for in-memory format. However, I just learned that you can save arrow into files at desk as well, like abc.arrow In that case, what's the difference? Aren't they doing the same thing?

Parquet is a columnar file format for data serialization. Reading a Parquet file requires decompressing and decoding its contents into some kind of in-memory data structure. It is designed to be space/IO-efficient at the expense of CPU utilization for decoding. It does not provide any data structures for in-memory computing. Parquet is a streaming format which must be decoded from start-to-end, while some "index page" facilities have been added to the storage format recently, in general random access operations are costly.
Arrow on the other hand is first and foremost a library providing columnar data structures for in-memory computing. When you read a Parquet file, you can decompress and decode the data into Arrow columnar data structures, so that you can then perform analytics in-memory on the decoded data. Arrow columnar format has some nice properties: random access is O(1) and each value cell is next to the previous and following one in memory, so it's efficient to iterate over.
What about "Arrow files" then? Apache Arrow defines a binary "serialization" protocol for arranging a collection of Arrow columnar arrays (called a "record batch") that can be used for messaging and interprocess communication. You can put the protocol anywhere, including on disk, which can later be memory-mapped or read into memory and sent elsewhere.
This Arrow protocol is designed so that you can "map" a blob of Arrow data without doing any deserialization, so performing analytics on Arrow protocol data on disk can use memory-mapping and pay effectively zero cost. The protocol is used for many things, such as streaming data between Spark SQL and Python for running pandas functions against chunks of Spark SQL data, these are called "pandas udfs".
In some applications, Parquet and Arrow can be used interchangeably for on-disk data serialization. Some things to keep in mind:
Parquet is intended for "archival" purposes, meaning if you write a file today, we expect that any system that says they can "read Parquet" will be able to read the file in 5 years or 7 years. We are not yet making this assertion about long-term stability of the Arrow format (though we might in the future)
Parquet is generally a lot more expensive to read because it must be decoded into some other data structure. Arrow protocol data can simply be memory-mapped.
Parquet files are often much smaller than Arrow-protocol-on-disk because of the data encoding schemes that Parquet uses. If your disk storage or network is slow, Parquet is going to be a better choice
So, in summary, Parquet files are designed for disk storage, Arrow is designed for in-memory (but you can put it on disk, then memory-map later). They are intended to be compatible with each other and used together in applications.
For a memory-intensive frontend app I might suggest looking at the Arrow JavaScript (TypeScript) library.

Related

How/Where can I write time series data? As Parquet format to Hadoop, or HBase, Cassandra?

I have real-time time series sensor data. My primary goal is to keep the raw data. I should do this so that the cost of storage is minimal.
My scenario like this;
All sensors produces time series data, and i must save these raw time series data for batch analysis. Parquet format is great for less storage cost. But, does it make sense if each incoming time series data are written as a parquet format?
On the other hand, I want to process each incoming time series data in real time. For real-time scenario; I can use Kafka. But, can Hbase or Cassandra be used for both batch and real-time analysis instead of Kafka?
If I use Cassandra, how can I do batch analysis?
But, can Hbase or Cassandra be used for both batch and real-time analysis instead of Kafka?
Think of Kafka as a pipe into these stores. It is not a replacement to use "instead of" either. HBase and Cassnadras are stores, and you need to "batch" the data out of them... You would use Kafka Streams (or Spark, Flink, or my personal favorite NiFi) for actual (near) real-time processing before these systems.
I would suggest using Kafka rather than have point-to-point metrics into Hadoop (or related tools). I would also encourage using something meant for such data like TimescaleDB, CrateDB or InfluxDB, maybe Prometheus with some modification to the infrastructure... You could use Kafka for ingestion into both Hadoop and these other tools that are better tuned to store such datasets (which is the benefit of "buffering" the data in Kafka first)
does it make sense if each incoming time series data are written as a parquet format?
Sure. If you want to store lots of data for large batch analysis. But if you window your stream hourly data-points, and perform sums and averages, for example, then do you really need to store each and every datapoint?
If I use Cassandra, how can I do batch analysis?
Well, I would hope the same way you currently do it. Schedule a query against a database? Hope all the data is there? (no late-arriving records)
I have real-time time series sensor data. My primary goal is to keep the raw data. I should do this so that the cost of storage is minimal.
If your requirement is storing the raw data, you can write them to hdfs is compressed form. Using parquet format might not be feasible here. Formats can change.
If you have the incoming data in kafka you can use kafka connect to write to hdfs in batches from a topic.
All sensors produces time series data, and i must save these raw time series data for batch analysis. Parquet format is great for less storage cost. But, does it make sense if each incoming time series data are written as a parquet format?
Not sure if I understand correctly, but it does not make any sense to store each data point in a seperate parquet file.
parquet format has overhead compared to raw data
parquet format is specifically designed for having table like data with many rows, so that filtering on that data is fast (with local access).
batch processing and filesystems are most of the time really unhappy about lots of small files.
On the other hand, I want to process each incoming time series data in real time. For real-time scenario; I can use Kafka. But, can Hbase or Cassandra be used for both batch and real-time analysis instead of Kafka?
Depending on your use case it might be easy enough for batch processing to use hive or spark sql on the raw data.
Maybe kafka-streams processor is enough for your real time requirements.
So many options. It all depends on the use cases...

How does compression in hive results in better query performance?

Many best practices suggest that the data should be stored in a compressed format in HDFS.
There are clear performance differences while running a hive queries on a table comprising of compressed text files ( chunked gzip files of around 250 MB each) vs uncompressed textfile.
Can somebody please explain what is happening behind the scenes?
As per my understanding, while the query input is being assigned to mapper tasks, there is a decompression stage and then there is a query. If this is the case, how can it provide better performance over uncompressed text file as it will have the overhead of decompression?
There are two aspects involved here:
Network overhead: Map Reduce paradigm is heavily criticized for overhead for shuffle and sorting. If you look the processing steps in very selfish way then these steps are not contributing anything in the processing you want. Plus when bigger data flows thru the network in physical level even if you employ gigabit freq switch then also (if its not about very involved operation) then shuffle-sort becomes bottleneck. Hence more compressed the data easily it can pass thru the shuffle sort bottleneck.
Sparse Data: Bigger dataset are mostly sparse (Exceptions exist but take it as rule of thumb). So compression brings down the size of the data and then again shuffle sort step is pretty small.
data compressesion in Hive tables has is been known to give better performance than uncompressed storage, both in terms of disk usage and query performance.
You can import text files compressed with Gzip directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution.
Record compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.

Why column oriented file formats are not well suited to streaming writes?

Hadoop the definitive guide(4th edition) has a paragraph on page 137:
Column-oriented formats need more memory for reading and writing,
since they have to buffer a row split in memory, rather than just a
single row. Also, it’s not usually possible to control when writes
occur (via flush or sync operations), so column-oriented formats are
not suited to streaming writes, as the current file cannot be
recovered if the writer process fails. On the other hand, row-oriented
formats like sequence files and Avro datafiles can be read up to the
last sync point after a writer failure. It is for this reason that
Flume (see Chapter 14) uses row-oriented formats.
I don't understands why current block can not be recovered in the case of failure. Can someone explain technical difficulties about this statement:
we can not control when writes occur (via flush or sync operations)
don't understands why current block can not be recovered in the case of failure.
Simply because there is no block to recover. The explanation is quite clear that the columnar formats (ORC, Parquet etc) make their own decision when to flush. If there was no flush then there is no 'block'. As Flume cannot control when the columnar memory buffers get written out to storage, it cannot rely on such formats.

Avro vs. Parquet

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data!
Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms?
Avro is a Row based format. If you want to retrieve the data as a whole you can use Avro
Parquet is a Column based format. If your data consists of a lot of columns but you are interested in a subset of columns then you can use Parquet
HBase is useful when frequent updating of data is involved. Avro is fast in retrieval, Parquet is much faster.
If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,
job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());
for
job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());
The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.
In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.
I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.
Avro
Widely used as a serialization platform
Row-based, offers a compact and fast binary format
Schema is encoded on the file so the data can be untagged
Files support block compression and are splittable
Supports schema evolution
Parquet
Column-oriented binary file format
Uses the record shredding and assembly algorithm described in the Dremel paper
Each data file contains the values for a set of rows
Efficient in terms of disk I/O when specific columns need to be queried
From Choosing an HDFS data storage format- Avro vs. Parquet and more
Both Avro and Parquet are "self-describing" storage formats, meaning that both embed data, metadata information and schema when storing data in a file.
The use of either storage formats depends on the use case. Three aspects constitute the basis upon which you may choose which format will be optimal in your case:
Read/Write operation: Parquet is a column-based file format. It supports indexing. Because of that it is suitable for write-once and read-intensive, complex or analytical querying, low-latency data queries. This is generally used by end users/data scientists.
Meanwhile Avro, being a row-based file format, is best used for write-intensive operation. This is generally used by data engineers. Both support serialization and compression formats, although they do so in different ways.
Tools: Parquet is a good fit for Impala. (Impala is a Massive Parallel Processing (MPP) RDBM SQL-query engine which knows how to operate on data that resides in one or a few external storage engines.) Again Parquet lends itself well to complex/interactive querying and fast (low-latency) outputs over data in HDFS. This is supported by CDH (Cloudera Distribution Hadoop). Hadoop supports Apache's Optimized Row Columnar (ORC) formats (selections depends on the Hadoop distribution), whereas Avro is best suited to Spark processing.
Schema Evolution: Evolving a DB schema means changing the DB's structure, therefore its data, and thus its query processing. Both Parquet and Avro supports schema evolution but to a varying degree.
Parquet is good for 'append' operations, e.g. adding columns, but not for renaming columns unless 'read' is done by index.
Avro is better suited for appending, deleting and generally mutating columns than Parquet. Historically Avro has provided a richer set of schema evolution possibilities than Parquet, and although their schema evolution capabilities tend to blur, Avro still shines in that area, when compared to Parquet.
Your understanding is right. In fact, we ran into a similar situation during data migration in our DWH. We chose Parquet over Avro as the disk saving we got was almost double than what we got with AVro. Also, the query processing time was much better than Avro. But yes, our queries were based on aggregation, column based operations etc. hence Parquet was predictably a clear winner.
We are using Hive 0.12 from CDH distro. You mentioned you are running into issues with Hive+Parquet, what are those? We did not encounter any.
Silver Blaze put description nicely with an example use case and described how Parquet was the best choice for him. It makes sense to consider one over the other depending on your requirements. I am putting up a brief description of different other file formats too along with time space complexity comparison. Hope that helps.
There are a bunch of file formats that you can use in Hive. Notable mentions are AVRO, Parquet. RCFile & ORC. There are some good documents available online that you may refer to if you want to compare the performance and space utilization of these file formats. Follows some useful links that will get you going.
This Blog Post
This link from MapR [They don't discuss Parquet though]
This link from Inquidia
The above given links will get you going. I hope this answer your query.
Thanks!

Storage format in HDFS

How Does HDFS store data?
I want to store huge files in a compressed fashion.
E.g : I have a 1.5 GB of file, with default replication factor of 3.
It requires (1.5)*3 = 4.5 GB of space.
I believe currently no implicit compression of data takes place.
Is there a technique to compress the file and store it in HDFS to save disk space ?
HDFS stores any file in a number of 'blocks'. The block size is configurable on a per file basis, but has a default value (like 64/128/256 MB)
So given a file of 1.5 GB, and block size of 128 MB, hadoop would break up the file into ~12 blocks (12 x 128 MB ~= 1.5GB). Each block is also replicated a configurable number of times.
If your data compresses well (like text files) then you can compress the files and store the compressed files in HDFS - the same applies as above, so if the 1.5GB file compresses to 500MB, then this would be stored as 4 blocks.
However, one thing to consider when using compression is whether the compression method supports splitting the file - that is can you randomly seek to a position in the file and recover the compressed stream (GZIp for example does not support splitting, BZip2 does).
Even if the method doesn't support splitting, hadoop will still store the file in a number of blocks, but you'll lose some benefit of 'data locality' as the blocks will most probably be spread around your cluster.
In your map reduce code, Hadoop has a number of compression codecs installed by default, and will automatically recognize certain file extensions (.gz for GZip files for example), abstracting you away from worrying about whether the input / output needs to be compressed.
Hope this makes sense
EDIT Some additional info in response to comments:
When writing to HDFS as output from a Map Reduce job, see the API for FileOutputFormat, in particular the following methods:
setCompressOutput(Job, boolean)
setOutputCompressorClass(Job, Class)
When uploading files to HDFS, yes they should be pre-compressed, and with the associated file extension for that compression type (out of the box, hadoop supports gzip with the .gz extension, so file.txt.gz would denote a gzipped file)
Some time ago I tried to summarize that in a blog post here.
Essentially that is a question of data splittability, as a file is devided into blocks which are elementary blocks for replication. Name node is responsible for keeping track of all those blocks belonging to one file. It is essential that block is autonomous when choosing compression - not all codecs are splittable. If the format + codec is not splittable that means that in order to decompress it it needs to be in one place which has big impact on parallelism in mapreduce. Essentially running in single slot.
Hope that helps.
Have a look at presentation # Hadoop_Summit, especially Slide 6 and Slide 7.
If DFS block size is 128 MB, for 4.5 GB storage (including replication factor of 3), you need 35.15 ( ~36 blocks)
Only bzip2 file format is splittable. In other formats, all blocks of entire files are stored in same Datanode
Have a look at algorithm types and class names and codecs
#Chris White answer provides information on how to enable zipping while writing Map output
The answer to this question is to first understand the file format available in Hadoop today. There is now choice available within HDFS that can manage file format and compression techniques. Alternative to explicit encoding and splitting using LZO or BZIP. There is many format that today support block compression and columnar row compression with features.
A storage format is a way you define how information is to be stored. This is sometimes usually indicated by the extension of the file. For example we know images can be several storage formats, PNG, JPG, and GIF etc. All these formats can store the same image, but each has specific storage characteristics.
In Hadoop filesystem you have all of traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data.
Why is it important to know these formats
In any performance tradeoffs, a huge bottleneck for HDFS-enabled applications like MapReduce, Hive, HBase, and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are accentuated when you manage large datasets. The Hadoop file formats have evolved to ease these issues across a number of use cases.
Choosing an appropriate file format can have some significant benefits:
Optimum read time
Optimum write time
Spliting or partitioning of files (so you don’t need to read the whole file, just a part of it)
Schema adaption (allowing a field changes to a dataset) Compression support (without sacrificing these features)
Some file formats are designed for general use, others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice when storing data in Hadoop and one should know to optimally store data in HDFS. Currently my go to storage is ORC format.
Check if your Big data components (Spark, Hive, HBase etc) support these format and make the decision accordingly. For example, I am currently injecting data into Hive and converting it into ORC format which works for me in terms of compression and performance.
Some common storage formats for Hadoop include:
Plain text storage (eg, CSV, TSV files, Delimited file etc)
Data is laid out in lines, with each line being a record. Lines are terminated by a newline character \n in the typical UNIX world. Text-files are inherently splittable. but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2. This is not efficient and will require a bit of work when performing MapReduce tasks.
Sequence Files
Originally designed for MapReduce therefore very easy to integrate with Hadoop MapReduce processes. They encode a key and a value for each record and nothing more. Stored in a binary format that is smaller than a text-based format. Even here it doesn't encode the key and value in anyway. One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. Though still not efficient as per statistics like Parquet and ORC.
Avro
The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Its file format with additional framework for, serialization and deserialization framework. With regular old sequence files you can store complex objects but you have to manage the process. It also supports block-level compression.
Parquet
My favorite and hot format these days. Its a columnar file storage structure while it encodes and writes to the disk. So datasets are partitioned both horizontally and vertically. One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). Try using this if your processing can optimally use column storage. You can refer to advantages of columnar storages.
If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application, but frankly if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required.
ORC
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
It is similar to the Parquet but with different encoding technique. Its not for this thread but you can lookup on Google for differences.

Resources