Is there a way to get the partitioning information from a child parquet file path? - parquet

Consider a partitioned parquet file.
example_partitioned_parquet_file.parquet/
├── partitioned_column=value1
│ └── part-00000.c000.snappy.parquet
└── partitioned_column=value2
└── part-00000.c000.snappy.parquet
Could I get the partitioning information if I only have access to a leaf parquet file? For example to only /example_partitioned_parquet_file.parquet/partitioned_column=value1/part-00000.c000.snappy.parquet
While path splitting and regex based approaches are possible, I am more interested in an arrow (or similar) based programmatic approach. This example has one partitioned_column, but there could be more partitioned columns.
For example one could get the partitioning information if I could create a parquet data set from the parent.
import pyarrow.parquet as pq
dataset_parent = pq.ParquetDataset("example_partitioned_parquet_file.parquet", use_legacy_dataset=False)
dataset_parent.fragments
[<pyarrow.dataset.ParquetFileFragment path=example_partitioned_parquet_file.parquet/partitioned_column=value1/part-00000.c000.snappy.parquet partition=[partitioned_column=value1]>,
<pyarrow.dataset.ParquetFileFragment path=example_partitioned_parquet_file.parquet/partitioned_column=value2/part-00000.c000.snappy.parquet partition=[partitioned_column=value2]>]
Can something similar be achieved if I have only the path to the leaf parquet file ?

Related

parquet: Dataset files with differing columns

Using pyarrow.
I have a Parquet Dataset composed of multiple parquet files. If the columns differ between the files then i get a "ValueError: Schema in was different".
Is there a way to avoid this?
Meaning i'd like to have a Dataset composed of files which each contain different columns.
I guess this could be done by pyarrow by filling in the values of the missing columns as na if the columns are not there in a particular component file of the Dataset.
Thanks
Loads the files with separate dataframes such as df1 and df2, merge those dataframes by referencing THIS article.
In the article, you may find two ways to merge, one is
df1.merge(df2, how = 'outer')
and the other one with the pandas package as follows:
pd.concat([df1, df2])

Is there a way in pyarrow to query the values of parquet dataset partitions?

For example, I have a dataset look like this:
dataset
├── a=1
│ └── 1.parquet
├── a=2
│ └── 2.parquet
├── a=3
└── 3.parquet
and it's loaded in as dataset = pyarrow.parquet.ParquetDataset('./dataset')
How do I query the available entries of partition "a" without reading the whole dataset into memory? Thanks~
See the pieces attribute of ParquetDataset. The partition_keys attribute of each ParquetDatasetPiece will give you the value of each partition key. If you have ideas about an API to make this simpler, please open a JIRA issue in Apache Arrow.
See also https://issues.apache.org/jira/browse/ARROW-1956 about reading specific portions of a partitioned dataset.

How to merge HDFS small files into a one large file?

I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
Any suggestions?
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
you should be able to use .repartition(1) to write all results to 1 file. if you need to split by date, consider partitionBy("your_date_value") .
if you're working within HDFS and S3, this may also be helpful. you might actually even use s3-dist-cp and stay within HDFS.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
There's a specific option to aggregate multiple files in HDFS using a --groupBy option based n a regular expression pattern. So if the date is in the file name, you can group based on that pattern.
You can develop a spark application. Using this application read the data from small files and create dataframe and write dataframe to big file in append mode.

How to output multiple s3 files in Parquet

Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?
AvroParquetWriter<GenericRecord> writer =
new AvroParquetWriter<GenericRecord>(file, schema);
GenericData.Record record = new GenericRecordBuilder(schema)
.set("name", "myname")
.set("favorite_number", i)
.set("favorite_color", "mystring").build();
writer.write(record);
For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like
df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])
But I can find no equivalent to partitionBy in plain Java with Hadoop.
In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:
job.setNumReduceTasks(N);
or alternatively via the system property:
-Dmapreduce.job.reduces=N
I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.

How can I use a for loop to import multiple data files in Neo4j?

Suppose I have a list of paths to five CSV files that I would like to import, that all have the same structure. How can I simply do something like this?
for path in paths:
LOAD CSV WITH HEADERS FROM path as row
WITH row
CREATE (n:Person { name : name})
;
This is not directly possible with Cypher. Use some preprocessing tool either aggregate your csv files into one or call LOAD CSV for each of the files.
For preprocessing csvkit is a good choice.

Resources