Can I access a Parquet file via index without reading the entire file into memory? - parquet

I just read that HDF5 allows you to access seek into data without reading the entire file into memory.
Is this seeking behavior possible in Parquet files without Java (non-pyspark solutions)? I am using Parquet because of the strong dtype support.
import h5py
f = h5py.File('my_file.hdf5', 'w')
dset = f.create_dataset('coords', data=my_ndarray)
f.close()
f = h5py.File('my_file.hdf5', 'r')
dset = f['coords']
my_array = dset[-2:]
https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata
I see here that Parquet metadata has num_row_groups: 1 (or more). But I am not sure how that helps me fetch rows [23, 42, 117, 99293184].

Parquet allows some forms of partial / random access. However, it is limited. Each parquet file is made up of one or more row groups and each parquet file is made up of one or more columns. You can retrieve any combination of rows groups & columns that you want.
There is only one way to store columns in a parquet file. However, it is up to the creator of the file how to distribute the rows into row groups. The creator could put every row in its own row group (although this would be too inefficient) or they could choose to use one row group for the entire file (this is quite common).
This means the ability to do partial reads is going to depend on how the file was created. If you are creating the files and you know ahead of time what sorts of reads are going to be done to access the data you can use this to create row groups. If you don't know the access patterns ahead of time or you have no control over the creation of the files you are reading then you will likely have to read the entire file into memory and filter later.
Another common scenario is to store a single large dataset across many files (so that some rows are in each file). This allows for the same sort of partial read behavior that you would have from multiple row groups. However, having multiple files is sometimes easier to manage.
Both pyarrow and fastparquet should give you APIs for filtering row groups. They also expose the parquet file metadata so that you can access the metadata information yourself to implement some custom filtering mechanism.

Related

How can one append to parquet files and how does it affect partitioning?

Does parquet allow appending to a parquet file periodically ?
How does appending relate to partitioning if any ? For example if i was able to identify a column that had low cardinality and partition it by that column, if i were to append more data to it would parquet be able to automatically append data while preserving partitioning or would one have to repartition the file ?
Does parquet allow appending to a parquet file periodically ?
Yes and No. The parquet spec describes a format that could be appended to by reading the existing footer, writing a row group, and then writing out a modified footer. This process is described a bit here.
Not all implementations support this operation. The only implementation I am aware of at the moment is fastparquet (see this answer). It is usually acceptable, less complexity, and potentially better performance to cache and batch, either by caching in memory or writing the small files and batching them together at some point later.
How does appending relate to partitioning if any?
Parquet does not have any concept of partitioning.
Many tools that support parquet implement partitioning. For example, pyarrow has a datasets feature which supports partitioning. If you were to append new data using this feature a new file would be created in the appropriate partition directory.
Its possible to append row groups to already existing parquet file using fastparquet.
Here is my SO answer on the same topic.
From fast parquet docs
append: bool (False) or ‘overwrite’ If False, construct data-set from
scratch; if True, add new row-group(s) to existing data-set. In the
latter case, the data-set must exist, and the schema must match the
input data.
from fastparquet import write
write('output.parquet', df, append=True)
EXAMPLE UPDATE:
Here is a PY script. The first run, it will create a file with one row group. Subsequent runs, it will append row groups to the same parquet file.
import os.path
import pandas as pd
from fastparquet import write
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
file_path = "C:\\Users\\nsuser\\dev\\write_parq_row_group.parquet"
if not os.path.isfile(file_path):
write(file_path, df)
else:
write(file_path, df, append=True)

How do I stream parquet using pyarrow?

I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.
How can I use Pyarrow to do this?
At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.

Page level skip/read in apache parquet

Question: Does Parquet have the ability to skip/read certain pages in a column chunk based on the query we run?
Can page header metadata help here?
http://parquet.apache.org/documentation/latest/
Under File Format, I read this statement and it seemed doubtful
Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.

Hive partitioned column doesn't appear in rdd via sc.textFile

The Hive partitioned column is not the part of the underlying saved data, I need to know how it can be pulled via sc.textFile(filePath) syntax to be loaded in RDD.
I know the other way of creating hive context and all but was wondering is there a way I can directly get it via sc.textFile(filePath) syntax and use it.
By partitioning the data by a column when saving, that columns data will be stored in the file structure and not in the actual files. Since, sc.textFile(filePath) is made for reading single files I do not believe it supports reading partitioned data.
I would recommend reading the data as a dataframe, for example:
val df = hiveContext.read().format("orc").load("path/to/table/")
The wholeTextFiles() method could also be used. Then you would get a tuple of (file path, file data) and from that it should be possible to parse out the partitioned data column and then add it as a new column.
If the storage size is no problem, then an alternative solution would be to store the information of the partitioned column twice. Once in the file structure (done by partitioning on that column), and once more in the data itself. This is achieved by duplicating the column before saving it to file. Say the column in question is named colA,
val df2 = df.withColumn("colADup", $"colA")
df2.write.partitionBy("colADup").orc("path/to/save/")
This can also easily be extended to multiple columns.

How to output multiple s3 files in Parquet

Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?
AvroParquetWriter<GenericRecord> writer =
new AvroParquetWriter<GenericRecord>(file, schema);
GenericData.Record record = new GenericRecordBuilder(schema)
.set("name", "myname")
.set("favorite_number", i)
.set("favorite_color", "mystring").build();
writer.write(record);
For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like
df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])
But I can find no equivalent to partitionBy in plain Java with Hadoop.
In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:
job.setNumReduceTasks(N);
or alternatively via the system property:
-Dmapreduce.job.reduces=N
I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.

Resources