Partition Parquet files on Azure Blob (pyarrow) - azure-blob-storage

I have been manually partitioning files with pandas (creating an index or multi-index and then writing a separate parquet file for each index in a loop) to Azure Blob.
However, when reading the docs for pyarrow, I see that it is possible to create a 'dataset' which includes a folder structure for partitioned data. https://arrow.apache.org/docs/python/parquet.html
The example for the Monthly / daily folder is exactly what I am trying to achieve.
dataset_name/
year=2007/
month=01/
0.parq
1.parq
...
month=02/
0.parq
1.parq
...
month=03/
...
year=2008/
month=01/
...
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
pq.write_to_dataset(table, root_path='dataset_name',
partition_cols=['one', 'two'], filesystem=fs)
Can I do this with Azure Blob (or Minio which uses S3 and wraps over my Azure Blob storage)? My ultimate goal is to only read files which make sense for my 'query'.

Just per my experience and based on your current environment Linux on Azure VM, I think there are two solutions can read partition parquet files from Azure Storage.
Follow the section Reading a Parquet File from Azure Blob storage of the document Reading and Writing the Apache Parquet Format of pyarrow, manually to list the blob names with the prefix like dataset_name using the API list_blob_names(container_name, prefix=None, num_results=None, include=None, delimiter=None, marker=None, timeout=None) of Azure Storgae SDK for Python as the figure below, then to read these blobs one by one like the sample code to dataframes, finally to concat these dataframes to a single one.
Try to use Azure/azure-storage-fuse to mount a container of Azure Blob Storage to your Linux filesystem, then you just need to follow the document section Reading from Partitioned Datasets to read the Partitioned Dataset locally from Azure Blob Storage.

Related

How to load data from Lake database into Synapse Notebook through Pipeline?

I have a Spark Datalake in Synapse in which i have 6 tables. The data in all tables i have loaded from 6 different csv files. These csv files have been loaded and updated manually by third party if some new data comes. In future also the data in these files will be loaded manually. File name will always be same.
Currently in my synapse Notebook i am using those 6 tables data for transforming a new file that came for processing and i have transformed one file by using Pyspark in my synapse notebook. But In my case i am manually giving the file name in my code which is connected to Synapse ADLS as our Source files are coming there but in future the this process will be automated. The Code Should work for every new Source File that came for Processing .
My Question here is about the 6 Tables which are in my Spark Datalake when we create a ETL process for it in Synapse and load my code in Notebook Activity will at that time the 6 tables i am using in my Code will able to read data from those tables and Suppose if some new data been updatedto those 6 tables will i able to see changes in my tables and same in my transformed file also.
This is the Code which i am using for loading data from one of the table from my lake database into my notebook currently
%%pyspark
df_IndustryData = spark.sql("SELECT * FROM DATA.Industry_data")
Display(df_Industry_data)
Thanks in advance for your responses
I'm not sure I understand your question, but I think you may find this applicable.
Lake Database (Spark) Tables are persisted in ADLS as a folder of Parquet files, and then exposed as External Tables. You can then query these tables from either your notebook or Serverless SQL.
A Lake Database Table is a therefore a logical schema overlaid on top of the physical files in Storage. So whenever you update/overwrite the underlying data [the physical parquet files], the Lake Database Table (External Table) will show the current data in the files.

Lowest storage cost option for batch data in Parquet format?

I'm implementing a batch dataset in the Parquet format, where the data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage. The files will be consumed by an Azure Synapse Analytics serverless SQL pool. Which one of the options would be best to minimize storage cost?
Using Snappy compression for files, or
Using OPENROWSET to query the Parquet files.
Creating an external table that contains a subset of columns from the Parquet files.
Storing all data as string in the Parquet files.

Blob Storage read efficiency

One question about the read efficiency when using Azure Blob Storage. Whehter it's faster to read from multiple small size file (e.g. 5MB) or it's faster to read from a large file (e.g. > 200MB)?
In my current project, I need to persist the stream data in Azure Blob Storage in Avro format and read it afterwards. For example, I can either persist the data every 15min into a single Avro file and that will generate 4*24 = 96 files for one day. Or I can use AppendBlob to append all data to one file, which will generate a single Avro file for one day. When reading the data of past few days from the Blob Stroage, which case could be more efficient?

Build pipeline from Oracle DB to AWS DynamoDB

I have an Oracle instance running on a stand alone EC2 VM, I want to do two things.
1) Copy the data from one of my Oracle tables into a cloud directory that can be read by DynamoDB. This will only be done once.
2) Then daily I want to append any changes to that source table into the DynamoDB table as another row that will share an id so I can visualize how that row is changing over time.
Ideally I'd like a solution that would be as easy as pipeing the results of a SQL query into a program that dumps that data into a cloud files system (S3, HDFS?), then I will want to convert that data into a format that can be read with DynamoDB.
So I need these things:
1) A transport device, I want to be able to type something like this on the command line:
sqlplus ... "SQL Query" | transport --output_path --output_type etc etc
2) For the path I need a cloud file system, S3 looks like the obvious choice since I want a turn key solution here.
3) This last part is a nice to have because I can always use a temp directory to hold my raw text and convert it in another step.
I assume the "cloud directory" or "cloud file system" you are referring to is S3? I don't see how it could be anything else in this context, but you are using very vague terms.
Triggering the DynamoDB insert to happen whenever you copy a new file to S3 is pretty simple, just have S3 trigger a Lambda function to process the data and insert into DynamoDB. I'm not clear on how you are going to get the data into S3 though. If you are just running a cron job to periodically query Oracle and dump some data to a file, which you then copy to S3, then that should work.
You need to know that you can't append to a file on S3, you would need to write the entire file each time you push new data to S3. If you are wanting to stream the data somehow then using Kenesis instead of S3 might be a better option.

Difference between Database and File Storage in Parse.com

Based on the FAQ at Parse.com:
What is the difference between database storage and file storage?
Database storage refers to data stored as Parse Objects, which are
limited to 128 KB in size. File storage refers to static assets that
are stored using the Parse File APIs, typically images, documents, and
other types of binary data.
Just want some clarification here:
So the Strings, Arrays etc created are considered as Parse Objects and would fall under the database storage, also the URL of the file will be considered under the database storage since it is a Parse Object. But the actual files itself are considered under File Storage?
Thanks.
Yes. Any file that you upload to Parse goes to the File storage, the rest is stored in the database including the URL of such files.

Resources