Lowest storage cost option for batch data in Parquet format? - parquet

I'm implementing a batch dataset in the Parquet format, where the data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage. The files will be consumed by an Azure Synapse Analytics serverless SQL pool. Which one of the options would be best to minimize storage cost?
Using Snappy compression for files, or
Using OPENROWSET to query the Parquet files.
Creating an external table that contains a subset of columns from the Parquet files.
Storing all data as string in the Parquet files.

Related

How to load data from Lake database into Synapse Notebook through Pipeline?

I have a Spark Datalake in Synapse in which i have 6 tables. The data in all tables i have loaded from 6 different csv files. These csv files have been loaded and updated manually by third party if some new data comes. In future also the data in these files will be loaded manually. File name will always be same.
Currently in my synapse Notebook i am using those 6 tables data for transforming a new file that came for processing and i have transformed one file by using Pyspark in my synapse notebook. But In my case i am manually giving the file name in my code which is connected to Synapse ADLS as our Source files are coming there but in future the this process will be automated. The Code Should work for every new Source File that came for Processing .
My Question here is about the 6 Tables which are in my Spark Datalake when we create a ETL process for it in Synapse and load my code in Notebook Activity will at that time the 6 tables i am using in my Code will able to read data from those tables and Suppose if some new data been updatedto those 6 tables will i able to see changes in my tables and same in my transformed file also.
This is the Code which i am using for loading data from one of the table from my lake database into my notebook currently
%%pyspark
df_IndustryData = spark.sql("SELECT * FROM DATA.Industry_data")
Display(df_Industry_data)
Thanks in advance for your responses
I'm not sure I understand your question, but I think you may find this applicable.
Lake Database (Spark) Tables are persisted in ADLS as a folder of Parquet files, and then exposed as External Tables. You can then query these tables from either your notebook or Serverless SQL.
A Lake Database Table is a therefore a logical schema overlaid on top of the physical files in Storage. So whenever you update/overwrite the underlying data [the physical parquet files], the Lake Database Table (External Table) will show the current data in the files.

Blob Storage read efficiency

One question about the read efficiency when using Azure Blob Storage. Whehter it's faster to read from multiple small size file (e.g. 5MB) or it's faster to read from a large file (e.g. > 200MB)?
In my current project, I need to persist the stream data in Azure Blob Storage in Avro format and read it afterwards. For example, I can either persist the data every 15min into a single Avro file and that will generate 4*24 = 96 files for one day. Or I can use AppendBlob to append all data to one file, which will generate a single Avro file for one day. When reading the data of past few days from the Blob Stroage, which case could be more efficient?

uncompress snappy parquet files in Azure Databricks

i have a bunch OF snappy parquet files in a folder in azure data lake
Does anyone have code that i can use to uncompress snappy parquet files to parquet using Azure Databricks.
Thanks
The compression of Parquet files is internal to the format. You cannot simply uncompress them as with usual files that are compressed at once. In Parquet each column chunk (or actually even smaller parts of it) are compressed individually. Thus for uncompressing, you would need to read in with spark.read.parquet and write them out as completely new files with different Parquet settings for the write.
Note that using no compression is actually not useful in most settings. Snappy is such a CPU-efficient format that the minimal CPU time it uses is in no contrast to the benefit on time the size savings have on the transferral of the files to disk or over the network.

Partition Parquet files on Azure Blob (pyarrow)

I have been manually partitioning files with pandas (creating an index or multi-index and then writing a separate parquet file for each index in a loop) to Azure Blob.
However, when reading the docs for pyarrow, I see that it is possible to create a 'dataset' which includes a folder structure for partitioned data. https://arrow.apache.org/docs/python/parquet.html
The example for the Monthly / daily folder is exactly what I am trying to achieve.
dataset_name/
year=2007/
month=01/
0.parq
1.parq
...
month=02/
0.parq
1.parq
...
month=03/
...
year=2008/
month=01/
...
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
pq.write_to_dataset(table, root_path='dataset_name',
partition_cols=['one', 'two'], filesystem=fs)
Can I do this with Azure Blob (or Minio which uses S3 and wraps over my Azure Blob storage)? My ultimate goal is to only read files which make sense for my 'query'.
Just per my experience and based on your current environment Linux on Azure VM, I think there are two solutions can read partition parquet files from Azure Storage.
Follow the section Reading a Parquet File from Azure Blob storage of the document Reading and Writing the Apache Parquet Format of pyarrow, manually to list the blob names with the prefix like dataset_name using the API list_blob_names(container_name, prefix=None, num_results=None, include=None, delimiter=None, marker=None, timeout=None) of Azure Storgae SDK for Python as the figure below, then to read these blobs one by one like the sample code to dataframes, finally to concat these dataframes to a single one.
Try to use Azure/azure-storage-fuse to mount a container of Azure Blob Storage to your Linux filesystem, then you just need to follow the document section Reading from Partitioned Datasets to read the Partitioned Dataset locally from Azure Blob Storage.

Import small stream in Impala

We are currently on a Big Data project.
The Big Data platform Hadoop Cloudera.
Input of our system we have a small flow of data, we collect via Kafka (approximately 80Mo/h continuously).
Then the messages are stored in HDFS to be queried via Impala.
Our client does not want to separate the hot data with the cold data. After 5 mins, the data must be accessible in the history data (cold data). We chose to have a single database.
To insert the data, we use the JDBC connector provided by Impala API (eg INSERT INTO ...).
we are aware that this is not the recommended solution, each Impala insertion creates a file (<10kb) in HDFS.
We seek a solution to insert a small stream in a Imapala base which avoids getting many small files.
What solution we preconize?

Resources