Unzip the .GZ file in worker Process of Azure - windows

Can any1 provide me an Idea, How to implement unzipping of .gz format file through Worker. If i try to write unzipping of file then, where i need to store unzipped file(i.e one text file
) , Will it be loaded in any location in azure. how can i specify the path in Windows Azure Worker process like current execting directory. If this approach doesnot work, then i need to create one more blob to store unzipped .gz file i.e txt.
-mahens

In your Worker Role, it is up to you how a .gz file arrive (downloaded from Azure Blob storage) however on the file is available you can use GZipStream to compress/uncompress a .GZ file. You can also find code sample in above link with Compress and Decompress function.
This SO discussion shares a few tools and code to explain how you can unzip .GZ using C#:
Unzipping a .gz file using C#
Next when you will use Decompress/Compress code in a Worker Role you have ability to store it directly to local storage (as suggested by JcFx) or use MemoryStream to store directly to Azure Blob Storage.
The following SO article shows how you can use GZipStream to store unzipped content into MemoryStream and then use UploadFromStream() API to store directly to Azure Blob storage:
How do I use GZipStream with System.IO.MemoryStream?
If you don't have any action related to your unzipped file then storing directly to Azure Blob storage is best however if you have to do something with unzipped content you can save locally as well as storage to Azure Blob storage back for further usage.

This example, using SharpZipLib, extracts a .gzip file to a stream. From there, you could write it to Azure local storage, or to blob storage:
http://wiki.sharpdevelop.net/GZip-and-Tar-Samples.ashx

Related

Azure Data Factory Unzipping many files into partitions based on filename

I have a large zip file that has 900k json files in it. I need to process these with a data flow. I'd like to organize the files into folders using the last two digits in the file name so I can process them in junks of 10k. My question is how to I setup a pipeline to use part of the file name of the files in the zip file (the source) as part of the path in the sink?
current setup: zipfile.zip -> /json/XXXXXX.json
desired setup: zipfile.zip -> /json/XXX/XXXXXX.json
Please check if below references can help
In source transformation, you can read from a container, folder, or
individual file in Azure Blob storage. Use the Source options tab to
manage how the files are read. Using a wildcard pattern will instruct
the service to loop through each matching folder and file in a single
source transformation. This is an effective way to process multiple
files within a single flow.
[ ] Matches one or more characters in the brackets.
/data/sales/**/*.csv Gets all .csv files under /data/sales
And please go through 1. Copy and transform data in Azure Blob storage - Azure Data Factory
& Azure Synapse | Microsoft Docs for other patterns and to check all filtering
possibilities in azure blob storage.
How to UnZip Multiple Files which are stored on Azure Blob Storage
By using Azure Data Factory - Bing video
In the sink transformation, you can write to either a container or a folder in Azure Blob storage.
File name option: Determines how the destination files are named in the destination folder.

Azure Logic App - FTP connection download zip file to blob storage - output zip file corrupted

I've set up the pipeline and it works (I followed this documentation https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-ftp), it downloads the zip file and loads it to the blob storage.
however the resulted zip file is corrupted. it has a slightly different size than the original file.
I set the infer content type to YES. Also tried this setting to no but didn't change result.
I tried with hardcoded and dynamic naming.

How to write CSV to Azure Storage Gen2 with Databricks(Python)

I would like write reqular CSV file to Storage, but what I get is folder "sample_file.csv" and 4 files under it. How to create normal csv file from data frame to Azure Storage Gen2?
I'm happy with any advice or link to article.
df.coalesce(1).write.option("header", "true").csv(TargetDirectory + "/sample_file.csv")

Quick Way to Bluk Copy Azure Blobs

I have about 40,000 blobs in azure storage, they have been given the wrong file extension. They have been uploaded with the filename <name>.png and I need to correct the name to <name>.jpg. In the 1st instance I'd like simply copy the originals into the same blob store but with a new file name.
azcopy would normally be my go to for this kind of thing, but it doesn't seem to have the options I need.
How can I bulk copy and rename files in an azure blob store?
Azure Blob Storage doesn't support renaming directly. However, you can work it around by copying the blob to a new blob with modified name (by StartCopy method), and removing the original blob (by Delete method). The copy procedure can be pretty fast if the source and destination is under the same storage account since it's actually a shallow copy.

Blob files have to renamed manually to include parent folder path

We are new to Windows azure and have used Windows azure storage for blob objects while developing sitefinity application but the blob files which are uploaded to this storage via publishing to azure from Visual Studio uploads files with only the file names and do not maintain the prefix folder name and slash. Hence we have to rename all files manually on the windows azure management portal and put the folder name and slash in the beginning of each file name so that the page which is accessing these images can show the images properly otherwise the images are not shown due to incorrect path.
Though in sitefinity admin panel , when we upload these images/blob files in those pages , we upload them inside a folder and we have configured to leverage sitefinity to use azure storage instead of database.
Please check the file attached to see the screenshot.
Please help me to solve this.
A few things I would like to mention first:
Windows Azure does not support rename functionality. Rename blob functionality = copy blob followed by delete blob.
Copy blob operation is asynchronous so you must wait for copy operation to finish before deleting the blob.
Blob storage does not support folder hierarchy natively. As you may have already discovered, you create an illusion of a folder by prepending a blob name (say logo.png) with the name of folder you want (say images) and separate them with slash (/) so your blob name becomes images/logo.png.
Now coming to your problem. Needless to say that manually renaming the blobs would be a cumbersome exercise. I would recommend using a storage management tool to do that. One such example would be Azure Management Studio from Cerebrata. If you use that tool, essentially what you can do is create an empty folder in the container and then move the files into that folder. That to me would be the fastest way to achieve your objective.
If you wish to write some code to do that, here are the steps you will take:
First you will list all blobs in a blob container.
Next you will loop over this list.
For each blob (let's call it source blob), you would get its name and prepend the folder name that you want and create an instance of a CloudBlockBlob object.
Next you would initiate a copy blob operation on that blob using StartCopyFromBlob on this new blob where source is your source blob.
You would need to wait for the copy operation to finish. Once the copy operation is finished, you can safely delete the source blob.
P.S. I would have written some code but unfortunately I'm stuck with something else. I might write something later on (but please don't hold your breath for that :)).

Resources