Azure Data Factory Unzipping many files into partitions based on filename

Azure Data Factory Unzipping many files into partitions based on filename - azure-blob-storage

I have a large zip file that has 900k json files in it. I need to process these with a data flow. I'd like to organize the files into folders using the last two digits in the file name so I can process them in junks of 10k. My question is how to I setup a pipeline to use part of the file name of the files in the zip file (the source) as part of the path in the sink?
current setup: zipfile.zip -> /json/XXXXXX.json
desired setup: zipfile.zip -> /json/XXX/XXXXXX.json

Please check if below references can help
In source transformation, you can read from a container, folder, or
individual file in Azure Blob storage. Use the Source options tab to
manage how the files are read. Using a wildcard pattern will instruct
the service to loop through each matching folder and file in a single
source transformation. This is an effective way to process multiple
files within a single flow.
[ ] Matches one or more characters in the brackets.
/data/sales/**/*.csv Gets all .csv files under /data/sales
And please go through 1. Copy and transform data in Azure Blob storage - Azure Data Factory
& Azure Synapse | Microsoft Docs for other patterns and to check all filtering
possibilities in azure blob storage.
How to UnZip Multiple Files which are stored on Azure Blob Storage
By using Azure Data Factory - Bing video
In the sink transformation, you can write to either a container or a folder in Azure Blob storage.
File name option: Determines how the destination files are named in the destination folder.

Related

Writing data into a single file for Azure instead of part files

I am trying to write a file to my Azure Storage using Mosaic Decisions. I want to write the file to Azure as a single file and not part files. How can this be achieved?

You need to simply switch on the toggle in your Writer Node Configuration menu(Double click on the writer node) for Single File Output.
The file would be written into a single file if this toggle is enabled.

Why is an empty file with the name of folder inside a Azure Blob storage container is created?

I am running a Hive QL through HD Insight on-demand cluster which does the following
Spool the data from a hive view
Create a folder by name abcd inside a Blob storage container
named XYZ
Store the view data in a file inside the abcd folder
However, when the hive QL is run, there is an empty file with the name abcd that is getting created outside the abcd folder
Any idea why this is happening and how do we stop it from happening. Please suggest
Thanks,
Surya

You get this because the Azure storage you are mounting does not have a hierarchical file system. For example, the mount is a blob storage of type StorageV2 but you have not ticked the Use hierarchical filesystem at creation time. A version 2 blob with hierarchical file system is known as Azure Data Lake Storage generation 2 (ADLS Gen2), where they basically get rid of the blob - lake difference you had with ADLS Gen 1 vs older blob generations.
According to the blob API you are using, a number of tricks are used to give you the illusion of a hierarchical FS even when you don't have one. Like creating empty files, or hidden ones. The main is that the hierarchy is flat (i.e. there is none), so you can't just create an empty folder, you have to put something there.
For example, if you mount a v2 blob with the wasbs:// driver in Databricks, and you do a mkdir -p /dbfs/mnt/mymount/this/is/a/path from a %sh cell you will see something like this:
this folder, this empty file
this/is folder, this/is empty file
etc.
Finally, while this is perfectly file for Azure blob itself, it might cause trouble to anything else not expecting it, even %sh ls.
Just recreate the storage as ADLS Gen2, or update it live enabling the hierarchical FS.
Thanks,

Quick Way to Bluk Copy Azure Blobs

I have about 40,000 blobs in azure storage, they have been given the wrong file extension. They have been uploaded with the filename <name>.png and I need to correct the name to <name>.jpg. In the 1st instance I'd like simply copy the originals into the same blob store but with a new file name.
azcopy would normally be my go to for this kind of thing, but it doesn't seem to have the options I need.
How can I bulk copy and rename files in an azure blob store?

Azure Blob Storage doesn't support renaming directly. However, you can work it around by copying the blob to a new blob with modified name (by StartCopy method), and removing the original blob (by Delete method). The copy procedure can be pretty fast if the source and destination is under the same storage account since it's actually a shallow copy.

Blob files have to renamed manually to include parent folder path

We are new to Windows azure and have used Windows azure storage for blob objects while developing sitefinity application but the blob files which are uploaded to this storage via publishing to azure from Visual Studio uploads files with only the file names and do not maintain the prefix folder name and slash. Hence we have to rename all files manually on the windows azure management portal and put the folder name and slash in the beginning of each file name so that the page which is accessing these images can show the images properly otherwise the images are not shown due to incorrect path.
Though in sitefinity admin panel , when we upload these images/blob files in those pages , we upload them inside a folder and we have configured to leverage sitefinity to use azure storage instead of database.
Please check the file attached to see the screenshot.
Please help me to solve this.

A few things I would like to mention first:
Windows Azure does not support rename functionality. Rename blob functionality = copy blob followed by delete blob.
Copy blob operation is asynchronous so you must wait for copy operation to finish before deleting the blob.
Blob storage does not support folder hierarchy natively. As you may have already discovered, you create an illusion of a folder by prepending a blob name (say logo.png) with the name of folder you want (say images) and separate them with slash (/) so your blob name becomes images/logo.png.
Now coming to your problem. Needless to say that manually renaming the blobs would be a cumbersome exercise. I would recommend using a storage management tool to do that. One such example would be Azure Management Studio from Cerebrata. If you use that tool, essentially what you can do is create an empty folder in the container and then move the files into that folder. That to me would be the fastest way to achieve your objective.
If you wish to write some code to do that, here are the steps you will take:
First you will list all blobs in a blob container.
Next you will loop over this list.
For each blob (let's call it source blob), you would get its name and prepend the folder name that you want and create an instance of a CloudBlockBlob object.
Next you would initiate a copy blob operation on that blob using StartCopyFromBlob on this new blob where source is your source blob.
You would need to wait for the copy operation to finish. Once the copy operation is finished, you can safely delete the source blob.
P.S. I would have written some code but unfortunately I'm stuck with something else. I might write something later on (but please don't hold your breath for that :)).

Unzip the .GZ file in worker Process of Azure

Can any1 provide me an Idea, How to implement unzipping of .gz format file through Worker. If i try to write unzipping of file then, where i need to store unzipped file(i.e one text file
) , Will it be loaded in any location in azure. how can i specify the path in Windows Azure Worker process like current execting directory. If this approach doesnot work, then i need to create one more blob to store unzipped .gz file i.e txt.
-mahens

In your Worker Role, it is up to you how a .gz file arrive (downloaded from Azure Blob storage) however on the file is available you can use GZipStream to compress/uncompress a .GZ file. You can also find code sample in above link with Compress and Decompress function.
This SO discussion shares a few tools and code to explain how you can unzip .GZ using C#:
Unzipping a .gz file using C#
Next when you will use Decompress/Compress code in a Worker Role you have ability to store it directly to local storage (as suggested by JcFx) or use MemoryStream to store directly to Azure Blob Storage.
The following SO article shows how you can use GZipStream to store unzipped content into MemoryStream and then use UploadFromStream() API to store directly to Azure Blob storage:
How do I use GZipStream with System.IO.MemoryStream?
If you don't have any action related to your unzipped file then storing directly to Azure Blob storage is best however if you have to do something with unzipped content you can save locally as well as storage to Azure Blob storage back for further usage.

This example, using SharpZipLib, extracts a .gzip file to a stream. From there, you could write it to Azure local storage, or to blob storage:
http://wiki.sharpdevelop.net/GZip-and-Tar-Samples.ashx

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio