Page level skip/read in apache parquet - parquet

Question: Does Parquet have the ability to skip/read certain pages in a column chunk based on the query we run?
Can page header metadata help here?
http://parquet.apache.org/documentation/latest/
Under File Format, I read this statement and it seemed doubtful
Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.

Related

Partitioned parquet file takes more space and more time to query

Theoretically a Parquet file is expected to take less space than CSV and should provide results quicker. My experiment shows the opposite.
https://github.com/yashgt/Samples/blob/master/Parquet.ipynb
I am converting the CSV file at
to a Parquet file partitioned on the "city" field.
The activity takes 7m
The size of the Parquet folder is 48MB, while CSV is 2.5MB.
Querying the Parquet with a filtering criteria on "city" takes 350ms
while the CSV takes 111ms.
The code is here https://github.com/yashgt/Samples/blob/master/Parquet.ipynb
The executed notebook in PDF form is here https://github.com/yashgt/Samples/raw/master/parquet.pdf
What am I doing wrong?
you should do this test on a much larger dataset to see the expected results. parquet is columnar storage for big data analytics. it has lots of metadata and in your case it might be not efficient compared with the content size itself so you dont have any benefits of the fact that you select only few columns or even all given that this is the dataset size compared to csv.

Can I access a Parquet file via index without reading the entire file into memory?

I just read that HDF5 allows you to access seek into data without reading the entire file into memory.
Is this seeking behavior possible in Parquet files without Java (non-pyspark solutions)? I am using Parquet because of the strong dtype support.
import h5py
f = h5py.File('my_file.hdf5', 'w')
dset = f.create_dataset('coords', data=my_ndarray)
f.close()
f = h5py.File('my_file.hdf5', 'r')
dset = f['coords']
my_array = dset[-2:]
https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata
I see here that Parquet metadata has num_row_groups: 1 (or more). But I am not sure how that helps me fetch rows [23, 42, 117, 99293184].
Parquet allows some forms of partial / random access. However, it is limited. Each parquet file is made up of one or more row groups and each parquet file is made up of one or more columns. You can retrieve any combination of rows groups & columns that you want.
There is only one way to store columns in a parquet file. However, it is up to the creator of the file how to distribute the rows into row groups. The creator could put every row in its own row group (although this would be too inefficient) or they could choose to use one row group for the entire file (this is quite common).
This means the ability to do partial reads is going to depend on how the file was created. If you are creating the files and you know ahead of time what sorts of reads are going to be done to access the data you can use this to create row groups. If you don't know the access patterns ahead of time or you have no control over the creation of the files you are reading then you will likely have to read the entire file into memory and filter later.
Another common scenario is to store a single large dataset across many files (so that some rows are in each file). This allows for the same sort of partial read behavior that you would have from multiple row groups. However, having multiple files is sometimes easier to manage.
Both pyarrow and fastparquet should give you APIs for filtering row groups. They also expose the parquet file metadata so that you can access the metadata information yourself to implement some custom filtering mechanism.

Hive file formats advantages and disadvantages

I start to work with Hive.
I wanted to know what queries should to use for each table format among formats:
rcfile, orcfile, parquet, delimited text
when you have tables with very large number of columns and you tend to use specific columns frequently, RC file format would be a good choice. Rather than reading the entire row of data you would just retrieve the required columns, thus saving time. The data is divided into groups of rows, which are then divided into groups of columns.
Delimited text file is the general file format.
For ORC file format , have a look at the hive documentation which has a detailed description here : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
Parquet file format stores data in column form.
eg:
Col1 Col2
A 1
B 2
C 3
Normal data is stored as A1B2C3. Using Parquet, data is stored as ABC123.
For parquet file format , have a read on https://blog.twitter.com/2013/dremel-made-simple-with-parquet
I see that there are a couple of answers but since your question didn't asked for any particular file formats, the answers addressed one or the other file format.
There are a bunch of file formats that you can use in Hive. Notable mentions are AVRO, Parquet. RCFile & ORC. There are some good documents available online that you may refer to if you want to compare the performance and space utilization of these file formats. Follows some useful links that will get you going.
This Blog Post
This link from MapR [They don't discuss Parquet though]
This link from Inquidia
The above given links will get you going. I hope this answer your query.
Thanks!

Kettle: load CSV file which contains multiple data tables

I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.
I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.

Amazon EMR JSON

I am using Amazon EMR Hadoop Hive for big data processing. Current data in my log files is in CSV format. In order to make the table from log files, I wrote regex expression to parse the data and store into different columns of external table. I know that SerDe can be used to read data in JSON format and this means that each log file line could be as JSON object. Are there any Hadoop performance advantages if my log files are in JSON format comparing CSV format.
If you can process the output of the table (that you created with the regexp) why do another processing? Try to avoid unnecessary stuff.
I think the main issue here is which format is faster to read. I believe CSV will provide better speed over JSON but don't take my word. Hadoop really doesn't care. It's all byte arrays to him, once in memory.

Resources