How to write CSV to Azure Storage Gen2 with Databricks(Python) - azure-databricks

I would like write reqular CSV file to Storage, but what I get is folder "sample_file.csv" and 4 files under it. How to create normal csv file from data frame to Azure Storage Gen2?
I'm happy with any advice or link to article.
df.coalesce(1).write.option("header", "true").csv(TargetDirectory + "/sample_file.csv")

Related

DBFS FileStore Equivalent in Azure Synapse?

Is there an equivalent to Databricks' DBFS FileStore system in Azure Synapse? Is it possible to upload csv files and read them into pandas dataframes within Azure Synapse notebooks? Ideally I'd like to not load the csv into a database; looking for something as simple as DBFS' FileStore folder.
In Databricks: pd.read_csv('/dbfs/FileStore/name_of_file.csv')
In Synapse: ?
I don't see anywhere to upload csv files directly like in DBFS:
The azure synapse equivalent of using FileStore in Databricks would be to use the data lake file system linked to your synapse workspace. Once you go to your synapse studio, navigate to Data->Linked where you can find the linked storage account. This storage account was created/assigned when you create your workspace.
This primary data lake functions close to the FileStore in azure Databricks. You can use the UI shown in the above image to upload required files. You can right click on any of the files and load it into a Dataframe. As you can see in the image below, you can right click on the file and then choose new notebook -> Load to DataFrame.
The UI automatically provides a code which helps to load the csv file to a spark Dataframe. You can modify this code to load the file as a pandas Dataframe.
'''
#this is provided by synapse when you select file and choose to load to Dataframe
df = spark.read.load('abfss://data#datalk1506.dfs.core.windows.net/sample_1.csv', format='csv'
## If header exists uncomment line below
##, header=True
)
display(df.limit(10))
'''
#Use this following code to load as pandas dataframe
import pandas as pd
df = pd.read_csv('abfss://data#datalk1506.dfs.core.windows.net/sample_1.csv')
This data lake storage will be linked to the workspace with the help of the linked service (Can be viewed in Manage->Linked services). This is created by default from the data lake and file system information provided by the user (mandatory) while creating the synapse workspace.

Azure Data Factory Unzipping many files into partitions based on filename

I have a large zip file that has 900k json files in it. I need to process these with a data flow. I'd like to organize the files into folders using the last two digits in the file name so I can process them in junks of 10k. My question is how to I setup a pipeline to use part of the file name of the files in the zip file (the source) as part of the path in the sink?
current setup: zipfile.zip -> /json/XXXXXX.json
desired setup: zipfile.zip -> /json/XXX/XXXXXX.json
Please check if below references can help
In source transformation, you can read from a container, folder, or
individual file in Azure Blob storage. Use the Source options tab to
manage how the files are read. Using a wildcard pattern will instruct
the service to loop through each matching folder and file in a single
source transformation. This is an effective way to process multiple
files within a single flow.
[ ] Matches one or more characters in the brackets.
/data/sales/**/*.csv Gets all .csv files under /data/sales
And please go through 1. Copy and transform data in Azure Blob storage - Azure Data Factory
& Azure Synapse | Microsoft Docs for other patterns and to check all filtering
possibilities in azure blob storage.
How to UnZip Multiple Files which are stored on Azure Blob Storage
By using Azure Data Factory - Bing video
In the sink transformation, you can write to either a container or a folder in Azure Blob storage.
File name option: Determines how the destination files are named in the destination folder.

Why is an empty file with the name of folder inside a Azure Blob storage container is created?

I am running a Hive QL through HD Insight on-demand cluster which does the following
Spool the data from a hive view
Create a folder by name abcd inside a Blob storage container
named XYZ
Store the view data in a file inside the abcd folder
However, when the hive QL is run, there is an empty file with the name abcd that is getting created outside the abcd folder
Any idea why this is happening and how do we stop it from happening. Please suggest
Thanks,
Surya
You get this because the Azure storage you are mounting does not have a hierarchical file system. For example, the mount is a blob storage of type StorageV2 but you have not ticked the Use hierarchical filesystem at creation time. A version 2 blob with hierarchical file system is known as Azure Data Lake Storage generation 2 (ADLS Gen2), where they basically get rid of the blob - lake difference you had with ADLS Gen 1 vs older blob generations.
According to the blob API you are using, a number of tricks are used to give you the illusion of a hierarchical FS even when you don't have one. Like creating empty files, or hidden ones. The main is that the hierarchy is flat (i.e. there is none), so you can't just create an empty folder, you have to put something there.
For example, if you mount a v2 blob with the wasbs:// driver in Databricks, and you do a mkdir -p /dbfs/mnt/mymount/this/is/a/path from a %sh cell you will see something like this:
this folder, this empty file
this/is folder, this/is empty file
etc.
Finally, while this is perfectly file for Azure blob itself, it might cause trouble to anything else not expecting it, even %sh ls.
Just recreate the storage as ADLS Gen2, or update it live enabling the hierarchical FS.
Thanks,

Blob files have to renamed manually to include parent folder path

We are new to Windows azure and have used Windows azure storage for blob objects while developing sitefinity application but the blob files which are uploaded to this storage via publishing to azure from Visual Studio uploads files with only the file names and do not maintain the prefix folder name and slash. Hence we have to rename all files manually on the windows azure management portal and put the folder name and slash in the beginning of each file name so that the page which is accessing these images can show the images properly otherwise the images are not shown due to incorrect path.
Though in sitefinity admin panel , when we upload these images/blob files in those pages , we upload them inside a folder and we have configured to leverage sitefinity to use azure storage instead of database.
Please check the file attached to see the screenshot.
Please help me to solve this.
A few things I would like to mention first:
Windows Azure does not support rename functionality. Rename blob functionality = copy blob followed by delete blob.
Copy blob operation is asynchronous so you must wait for copy operation to finish before deleting the blob.
Blob storage does not support folder hierarchy natively. As you may have already discovered, you create an illusion of a folder by prepending a blob name (say logo.png) with the name of folder you want (say images) and separate them with slash (/) so your blob name becomes images/logo.png.
Now coming to your problem. Needless to say that manually renaming the blobs would be a cumbersome exercise. I would recommend using a storage management tool to do that. One such example would be Azure Management Studio from Cerebrata. If you use that tool, essentially what you can do is create an empty folder in the container and then move the files into that folder. That to me would be the fastest way to achieve your objective.
If you wish to write some code to do that, here are the steps you will take:
First you will list all blobs in a blob container.
Next you will loop over this list.
For each blob (let's call it source blob), you would get its name and prepend the folder name that you want and create an instance of a CloudBlockBlob object.
Next you would initiate a copy blob operation on that blob using StartCopyFromBlob on this new blob where source is your source blob.
You would need to wait for the copy operation to finish. Once the copy operation is finished, you can safely delete the source blob.
P.S. I would have written some code but unfortunately I'm stuck with something else. I might write something later on (but please don't hold your breath for that :)).

Unzip the .GZ file in worker Process of Azure

Can any1 provide me an Idea, How to implement unzipping of .gz format file through Worker. If i try to write unzipping of file then, where i need to store unzipped file(i.e one text file
) , Will it be loaded in any location in azure. how can i specify the path in Windows Azure Worker process like current execting directory. If this approach doesnot work, then i need to create one more blob to store unzipped .gz file i.e txt.
-mahens
In your Worker Role, it is up to you how a .gz file arrive (downloaded from Azure Blob storage) however on the file is available you can use GZipStream to compress/uncompress a .GZ file. You can also find code sample in above link with Compress and Decompress function.
This SO discussion shares a few tools and code to explain how you can unzip .GZ using C#:
Unzipping a .gz file using C#
Next when you will use Decompress/Compress code in a Worker Role you have ability to store it directly to local storage (as suggested by JcFx) or use MemoryStream to store directly to Azure Blob Storage.
The following SO article shows how you can use GZipStream to store unzipped content into MemoryStream and then use UploadFromStream() API to store directly to Azure Blob storage:
How do I use GZipStream with System.IO.MemoryStream?
If you don't have any action related to your unzipped file then storing directly to Azure Blob storage is best however if you have to do something with unzipped content you can save locally as well as storage to Azure Blob storage back for further usage.
This example, using SharpZipLib, extracts a .gzip file to a stream. From there, you could write it to Azure local storage, or to blob storage:
http://wiki.sharpdevelop.net/GZip-and-Tar-Samples.ashx

Resources