How do I stream parquet using pyarrow? - parquet

I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.
How can I use Pyarrow to do this?

At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.

Related

Best method to save intermediate tables in pyspark

This is my first question on Stackoverflow.
I am replicating a SAS codebase in Pyspark. The SAS codebase produces and stores scores of intermediate SAS datasets (100 when I last counted) which are used to cross check the final output and also for other analyses at a later point in time.
My purpose is to save numerous Pyspark dataframes in some format so that they can be re-used in a separate Pyspark session. I have thought of 2 options:
Save dataframes as hive tables.
Save them as parquet files.
Are there any other formats? Which method is faster? Will parquet files or csv files have schema related issues while re-reading the files as Pyspark dataframes?
The best option is to use parquet files as they have following advantages:
3x compressed saves space
Columnar format, faster pushdowns
Optimized with spark catalyst optimizer
Schema persists as parquet contains schema related info.
The only issue is make sure you are not generating multiple small files, the default parquet block size is 128 mb so make sure you have files sufficiently large. You can repartition the data to make sure the file size is large enough
Use Deleta Lake, to iterate over data changes, changeable schema, parquet advantages, easy updates, track chages, data versioning
Parquet is default for pyspark and goes well. So you can just store as parquet files / hive table. Before pushing to hdfs/hive you can repartition files if may small files on source. If it's a huge data try partitioning hive table with a suitable column.

Effectively merge big parquet files

I'm using parquet-tools to merge parquet files. But it seems that parquet-tools needs an amount of memory as big as the merged file. Do we have other ways or configurable options in parquet-tools to use memory more effectively? Cause I run the merge job in as a map job on hadoop env. And the container gets killed every time cause it used more memory than it is provided.
Thank you.
I wouldn't recommend using parquet-tools merge, since it just places row groups one after the another, so you will still have small groups, just packed together in a single file. The resulting file will typically not have noticably better performance, and under certain circumstances it may even perform worse than separate files. See PARQUET-1115 for details.
Currently the only proper way to merge Parquet files is to read all data from them and write it to a new Parquet file. You can do it with a MapReduce job (requires writing custom code for this purpose) or using Spark, Hive or Impala.

Data format and database choices Spark/hadoop

I am working on structured data (one value per field, the same fields for each row) that I have to put in a NoSql environment with Spark (as analysing tool) and Hadoop. Though, I am wondering what format to use. i was thinking about json or csv but I'm not sure. What do you think and why? I don't have enough experience in this field to properly decide.
2nd question : I have to analyse these data (stored in an HDFS). So, as far as I know I have two possibilities to query them (before the analysis):
direct reading and filtering. i mean that it can be done with Spark, for exemple:
data = sqlCtxt.read.json(path_data)
Use Hbase/Hive to properly make a query and then process the data.
So, I don't know what is the standard way of doing all this and above all, what will be the fastest.
Thank you by advance!
Use Parquet. I'm not sure about CSV but definitely don't use JSON. My personal experience using JSON with spark was extremely, extremely slow to read from storage, after switching to Parquet my read times were much faster (e.g. some small files took minutes to load in compressed JSON, now they take less than a second to load in compressed Parquet).
On top of improving read speeds, compressed parquet can be partitioned by spark when reading, whereas compressed JSON cannot. What this means is that Parquet can be loaded onto multiple cluster workers, whereas JSON will just be read onto a single node with 1 partition. This isn't a good idea if your files are large and you'll get Out Of Memory Exceptions. It also won't parallelise your computations, so you'll be executing on one node. This isn't the 'Sparky' way of doing things.
Final point: you can use SparkSQL to execute queries on stored parquet files, without having to read them into dataframes first. Very handy.
Hope this helps :)

Caching vs Tempview

I have a parquet file which I reading atleast 4-5 times within my application. I was wondering what is most efficient thing to do.
Option 1. While writing parquet file read it back on dataset and call cache. I am assuming by doing an immediate read I might use some existing hdfs/spark cache as part from write process.
Option 2. In my application when I need the dataset first time, after reading it cache it.
Option 3. While writing parquet file, after completion create a temp view out of it. In all subsequent usage, use the view.
I am also not very clear about efficiency of reading from tempview vs parquet dataset.
The datasets doesn't fit all into memory.
You should cache dataset (Option 2).
writing to disk will provide no improvements over Spark in-memory format
temporary views don't cache.

Using Parquet for realtime queries

I'm trying to come up with a solution for doing realtime (maybe within 0.x second), and I'm going to use Parquet to store the data. I want to use Presto and API to query the data.
My question is, since Parquet stores data in HDFS, where files are invisible until closed, how do I effectively achieve the near realtime query results?
The Parquet files must be closed in HDFS quickly enough, in order to let the query tool to see and use them. But, that means I can't put too much data into each Parquet file, ending up with too many small files and/or not real-time enough. Any better ideas, or Parquet is not a good format for realtime solutions?
Thanks for any input!

Resources