How to mount Azure Blob Storage (hierarchical namespace disabled) from Databricks - azure-databricks

I need to mount Azure Blob storage (where hierarchical namespace is disabled) from databricks. Mount command returns true but when I run fs.ls command, it returns error UnknownHostException. Please suggest

I got a similar kind of error. I tried and unmounted my blob storage account. Then, Remounted my storage account. Now, it's working fine.
Unmounting Storage account:
dbutils.fs.unmount("<mount_point>")
Mount Blob Storage:
dbutils.fs.mount(
source = "wasbs://<container>#<Storage_account_name>.blob.core.windows.net/",
mount_point = "<mount_point>",
extra_configs = {"fs.azure.account.key.vamblob.blob.core.windows.net":"Access_key"})
display(dbutils.fs.ls('/mnt/fgs'))
This command display(dbutils.fs.ls('/mnt/fgs')) returns all the files available in the mount point. You can perform all the required operations and then write to this DBFS, which will be reflected in your blob storage container also.
For more information refer this MS Document.

Related

AuthorizationPermissionMismatch error during AzCopy

I'm getting an error using AzCopy to copy an s3 bucket into an azure container, following the guide at https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3
I used azcopy login to authenticate, and added the below permissions to my azure account
Storage Blob Data Contributor
Storage Blob Data Owner
Storage Queue Data Contributor
Then trying to copy my bucket with
./azcopy copy 'https://s3.us-east-1.amazonaws.com/my-bucket' 'https://my-account.blob.core.windows.net/my-container' --recursive=true
I then receive an error that
AuthorizationPermissionMismatch
RESPONSE Status: 403 This request is not authorized to perform this operation using this permission.
What other permissions could I be missing or what else could it be?
Turns out I just had to wait a few hours for the permissions to fully propagate
Please check if below is missing:
To Authorize with AWS S3 ,you may need to gather your AWS access key and secret and then set the environment variables of that s3 source after getting hold of them.
Windows:
set AWS_ACCESS_KEY_ID=<access-key>
set AWS_SECRET_ACCESS_KEY=<secret-access-key>
(or)
Linux:
export AWS_ACCESS_KEY_ID=<access-key>
export AWS_SECRET_ACCESS_KEY=<secret-access-key>
Please make sure you've been granted the required permissions /actions for Amazon S3 object operations to copy data from Amazon S3,
for example> s3:GetObject and s3:GetObjectVersion.
References:
azcopy
Authorize with AWS S3

Blob access control

I am designing a storage using azure blob storage. In a container, how to do access control between different blobs?
For example, under container "images", there are 2 blobs: design1/logo.png and design2/logo.png. How to make the access to design1/ and design2/ are exclusively?
Have you tried to configure the access permissions from RBAC?
https://learn.microsoft.com/en-us/azure/storage/blobs/assign-azure-role-data-access?tabs=portal

Why is an empty file with the name of folder inside a Azure Blob storage container is created?

I am running a Hive QL through HD Insight on-demand cluster which does the following
Spool the data from a hive view
Create a folder by name abcd inside a Blob storage container
named XYZ
Store the view data in a file inside the abcd folder
However, when the hive QL is run, there is an empty file with the name abcd that is getting created outside the abcd folder
Any idea why this is happening and how do we stop it from happening. Please suggest
Thanks,
Surya
You get this because the Azure storage you are mounting does not have a hierarchical file system. For example, the mount is a blob storage of type StorageV2 but you have not ticked the Use hierarchical filesystem at creation time. A version 2 blob with hierarchical file system is known as Azure Data Lake Storage generation 2 (ADLS Gen2), where they basically get rid of the blob - lake difference you had with ADLS Gen 1 vs older blob generations.
According to the blob API you are using, a number of tricks are used to give you the illusion of a hierarchical FS even when you don't have one. Like creating empty files, or hidden ones. The main is that the hierarchy is flat (i.e. there is none), so you can't just create an empty folder, you have to put something there.
For example, if you mount a v2 blob with the wasbs:// driver in Databricks, and you do a mkdir -p /dbfs/mnt/mymount/this/is/a/path from a %sh cell you will see something like this:
this folder, this empty file
this/is folder, this/is empty file
etc.
Finally, while this is perfectly file for Azure blob itself, it might cause trouble to anything else not expecting it, even %sh ls.
Just recreate the storage as ADLS Gen2, or update it live enabling the hierarchical FS.
Thanks,

Cannot create folder in Azure blob storage $root container

Azure storage allows for a default container called $root
As explained in the documentation
Using the Azure Portal. When I try to upload a scripts folder to my $root container I get the error
upload error for validate-form.js
Upload block blob to blob store failed:
Make sure blob store SAS uri is valid and permission has not expired.
Make sure CORS policy on blob store is set correctly.
StatusCode = 0, StatusText = error
How do I fix this?
I can upload to containers that are not called $root
[Update]
I guess SAS means Shared Access Signature.
I set up the container with Blob ( anonymous read access for blobs only)
I will try Container ( anonymous read access for containers and blobs)
[Update]
Changing the access policy made no difference.
The access policy is not displayed for $root
I am aware that one must put a file in a new folder in order for the folder to create. This is not that issue.
[Update]
Here is what my website blob looks like. I can do this for my website container but not my $root container.
As Create a container states as follows:
Container names must start with a letter or number, and can contain only letters, numbers, and the dash (-) character.
Every dash (-) character must be immediately preceded and followed by a letter or number; consecutive dashes are not permitted in container names.
All letters in a container name must be lowercase.
Container names must be from 3 through 63 characters long.
AFAIK, when managing your blob storage via Azure Storage Client Library, we could not create the container with the name $root. Also, you could leverage Azure Storage Explorer to manage your storage resources, and I assume that the limitation for creating the name for your blob container is applied.
But I tested it on azure portal and found that I could encounter the same issue as you mentioned. I could create the container with the name $root. And I could upload files to the root virtual directory via Azure Portal and Azure Storage Explorer. I assumed that the container name starts with $ is reserved by Azure and we could not create it. And you need to follow the limitation for container names, then you could upload your files as usual. The behavior on azure portal for creating the container name starts with $ is unusual. You could send your feedback here.
UPDATE:
As you mentioned about Working with the Root Container:
A blob in the root container cannot include a forward slash (/) in its name.
So you could not create virtual folder(s) under the Root Container as the normal blob container.
For example: https://myaccount.blob.core.windows.net/$root/virtual-directory/myblob, the blob name would be virtual-directory/myblob, but it is invalid and you could not create it.
UPDATE2:
Have you been able to create a folder within the $root container? If so can you tell me how to please.
We could not create a folder within the $root container, because the root container has the limitation for blob name. You could not treat azure blob storage as the file system, the folder info belongs to the blob name as follows:

What is a good way to access external data from aws

I would like to access external data from my aws ec2 instance.
In more detail: I would like to specify inside by user-data the name of a folder containing about 2M of binary data. When my aws instance starts up, I would like it to download the files in that folder and copy them to a specific location on the local disk. I only need to access the data once, at startup.
I don't want to store the data in S3 because, as I understand it, this would require storing my aws credentials on the instance itself, or passing them as userdata which is also a security risk. Please correct me if I am wrong here.
I am looking for a solution that is both secure and highly reliable.
which operating system do you run ?
you can use an elastic block storage. it's like a device you can mount at boot (without credentials) and you have permanent storage there.
You can also sync up instances using something like Gluster filesystem. See this thread on it.

Resources