How can one append to parquet files and how does it affect partitioning? - parquet

Does parquet allow appending to a parquet file periodically ?
How does appending relate to partitioning if any ? For example if i was able to identify a column that had low cardinality and partition it by that column, if i were to append more data to it would parquet be able to automatically append data while preserving partitioning or would one have to repartition the file ?

Does parquet allow appending to a parquet file periodically ?
Yes and No. The parquet spec describes a format that could be appended to by reading the existing footer, writing a row group, and then writing out a modified footer. This process is described a bit here.
Not all implementations support this operation. The only implementation I am aware of at the moment is fastparquet (see this answer). It is usually acceptable, less complexity, and potentially better performance to cache and batch, either by caching in memory or writing the small files and batching them together at some point later.
How does appending relate to partitioning if any?
Parquet does not have any concept of partitioning.
Many tools that support parquet implement partitioning. For example, pyarrow has a datasets feature which supports partitioning. If you were to append new data using this feature a new file would be created in the appropriate partition directory.

Its possible to append row groups to already existing parquet file using fastparquet.
Here is my SO answer on the same topic.
From fast parquet docs
append: bool (False) or ‘overwrite’ If False, construct data-set from
scratch; if True, add new row-group(s) to existing data-set. In the
latter case, the data-set must exist, and the schema must match the
input data.
from fastparquet import write
write('output.parquet', df, append=True)
EXAMPLE UPDATE:
Here is a PY script. The first run, it will create a file with one row group. Subsequent runs, it will append row groups to the same parquet file.
import os.path
import pandas as pd
from fastparquet import write
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
file_path = "C:\\Users\\nsuser\\dev\\write_parq_row_group.parquet"
if not os.path.isfile(file_path):
write(file_path, df)
else:
write(file_path, df, append=True)

Related

Can I access a Parquet file via index without reading the entire file into memory?

I just read that HDF5 allows you to access seek into data without reading the entire file into memory.
Is this seeking behavior possible in Parquet files without Java (non-pyspark solutions)? I am using Parquet because of the strong dtype support.
import h5py
f = h5py.File('my_file.hdf5', 'w')
dset = f.create_dataset('coords', data=my_ndarray)
f.close()
f = h5py.File('my_file.hdf5', 'r')
dset = f['coords']
my_array = dset[-2:]
https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata
I see here that Parquet metadata has num_row_groups: 1 (or more). But I am not sure how that helps me fetch rows [23, 42, 117, 99293184].
Parquet allows some forms of partial / random access. However, it is limited. Each parquet file is made up of one or more row groups and each parquet file is made up of one or more columns. You can retrieve any combination of rows groups & columns that you want.
There is only one way to store columns in a parquet file. However, it is up to the creator of the file how to distribute the rows into row groups. The creator could put every row in its own row group (although this would be too inefficient) or they could choose to use one row group for the entire file (this is quite common).
This means the ability to do partial reads is going to depend on how the file was created. If you are creating the files and you know ahead of time what sorts of reads are going to be done to access the data you can use this to create row groups. If you don't know the access patterns ahead of time or you have no control over the creation of the files you are reading then you will likely have to read the entire file into memory and filter later.
Another common scenario is to store a single large dataset across many files (so that some rows are in each file). This allows for the same sort of partial read behavior that you would have from multiple row groups. However, having multiple files is sometimes easier to manage.
Both pyarrow and fastparquet should give you APIs for filtering row groups. They also expose the parquet file metadata so that you can access the metadata information yourself to implement some custom filtering mechanism.

Best method to save intermediate tables in pyspark

This is my first question on Stackoverflow.
I am replicating a SAS codebase in Pyspark. The SAS codebase produces and stores scores of intermediate SAS datasets (100 when I last counted) which are used to cross check the final output and also for other analyses at a later point in time.
My purpose is to save numerous Pyspark dataframes in some format so that they can be re-used in a separate Pyspark session. I have thought of 2 options:
Save dataframes as hive tables.
Save them as parquet files.
Are there any other formats? Which method is faster? Will parquet files or csv files have schema related issues while re-reading the files as Pyspark dataframes?
The best option is to use parquet files as they have following advantages:
3x compressed saves space
Columnar format, faster pushdowns
Optimized with spark catalyst optimizer
Schema persists as parquet contains schema related info.
The only issue is make sure you are not generating multiple small files, the default parquet block size is 128 mb so make sure you have files sufficiently large. You can repartition the data to make sure the file size is large enough
Use Deleta Lake, to iterate over data changes, changeable schema, parquet advantages, easy updates, track chages, data versioning
Parquet is default for pyspark and goes well. So you can just store as parquet files / hive table. Before pushing to hdfs/hive you can repartition files if may small files on source. If it's a huge data try partitioning hive table with a suitable column.

How do I stream parquet using pyarrow?

I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.
How can I use Pyarrow to do this?
At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.

Hive partitioned column doesn't appear in rdd via sc.textFile

The Hive partitioned column is not the part of the underlying saved data, I need to know how it can be pulled via sc.textFile(filePath) syntax to be loaded in RDD.
I know the other way of creating hive context and all but was wondering is there a way I can directly get it via sc.textFile(filePath) syntax and use it.
By partitioning the data by a column when saving, that columns data will be stored in the file structure and not in the actual files. Since, sc.textFile(filePath) is made for reading single files I do not believe it supports reading partitioned data.
I would recommend reading the data as a dataframe, for example:
val df = hiveContext.read().format("orc").load("path/to/table/")
The wholeTextFiles() method could also be used. Then you would get a tuple of (file path, file data) and from that it should be possible to parse out the partitioned data column and then add it as a new column.
If the storage size is no problem, then an alternative solution would be to store the information of the partitioned column twice. Once in the file structure (done by partitioning on that column), and once more in the data itself. This is achieved by duplicating the column before saving it to file. Say the column in question is named colA,
val df2 = df.withColumn("colADup", $"colA")
df2.write.partitionBy("colADup").orc("path/to/save/")
This can also easily be extended to multiple columns.

Would spark dataframe read from external source on every action?

On a spark shell I use the below code to read from a csv file
val df = spark.read.format("org.apache.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").csv("/opt/person.csv") //spark here is the spark session
df.show()
Assuming this displays 10 rows. If I add a new row in the csv by editing it, would calling df.show() again show the new row? If so, does it mean that the dataframe reads from an external source (in this case a csv file) on every action?
Note that I am not caching the dataframe nor I am recreating the dataframe using the spark session
After each action spark forgets about the loaded data and any intermediate variables value you used in between.
So, if you invoke 4 actions one after another, it computes everything from beginning each time.
Reason is simple, spark works by building DAG, which lets it visualize path of operation from reading of data to action, and than it executes it.
That is the reason cache and broadcast variables are there. Onus is on developer to know and cache, if they know they are going to reuse that data or dataframe N number of times.
TL;DR DataFrame is not different than RDD. You can expect that the same rules apply.
With simple plan like this the answer is yes. It will read data for every show although, if action doesn't require all data (like here0 it won't read complete file.
In general case (complex execution plans) data can accessed from the shuffle files.

Resources