Databricks and Azure Blob Storage - azure-blob-storage

I am running this on databricks notebook
dbutils.fs.ls("/mount/valuable_folder")
I am getting this error
Caused by: StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
I tried using dbutils.fs.refreshMounts()
to get any updates in azure blob storage, but still getting above error.

Such errors most often arise when credentials that you have used for mounting are expired - for example, SAS is expired, storage key is rotated, or service principal secret is expired. You need to unmount the storage using dbutils.fs.unmount and mount it again with dbutils.fs.mount. dbutils.fs.refreshMounts() just refreshes a list of mounts in the backend, it doesn't recheck for credentials.

Related

Databricks Azure Blob Storage access

I am trying to access files stored in Azure blob storage and have followed the documentation linked below:
https://docs.databricks.com/external-data/azure-storage.html
I was successful in mounting the Azure blob storage on dbfs but it seems that the method is not recommended anymore. So, I tried to set up direct access using URI using SAS authentication.
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "<token>")
Now when I try to access any file using:
spark.read.load("abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/<path-to-data>")
I get the following error:
Operation failed: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.", 403, HEAD,
I am able to mount the storage account using the same SAS token but this is not working.
What needs to be changed for this to work?
If you are using blob storage, then you have to use wasbs and not abfss. I have tried using using the same code as yours with my SAS token and got the same error with my blob storage.
spark.conf.set("fs.azure.account.auth.type.<storage_account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage_account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage_account>.dfs.core.windows.net", "<token>")
df = spark.read.load("abfss://<container>#<storage_account>.dfs.core.windows.net/input/sample1.csv")
When I used the following modified code, I was able to successfully read the data.
spark.conf.set("fs.azure.account.auth.type.<storage_account>.blob.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage_account>.blob.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage_account>.blob.core.windows.net", "<token>")
df = spark.read.format("csv").load("wasbs://<container>#<storage_account>.blob.core.windows.net/input/sample1.csv")
UPDATE:
To access files from azure blob storage where the firewall settings are only from selected networks, you need to configure VNet for the Databricks workspace.
Now add the same virtual network to your storage account as well.
I have also selected service endpoints and subnet delegation as following:
Now when I run the same code again using the file path as wasbs://<container>#<storage_account>.blob.core.windows.net/<path>, the file is read successfully.

Go storage client not able to access GCP bucket

I have a golang service which has an API exposed where we try to upload a CSV to a GCP bucket. On my local host, I set the environment variable GOOGLE_APPLICATION_CREDENTIAL
and point this variable to the filepath of service account json. But when deploying to an actual GCP instance, I'm getting the below error while trying to access this API. Ideally,the service should talk to GCP metadata server and fetch the credentials and then store them in a json file. So there are 2 problems here:
Service is not querying the metadata service to get the credentials.
If file is present(I created it manually), it's not able to access due to permission issues.
Any help would be appreciated.
Error while initializing storage Client:dialing: google: error getting credentials using well-known file (/root/.config/gcloud/application_default_credentials.json): open /root/.config/gcloud/application_default_credentials.json: permission denied
Finally, after long debugging and searching over the web, found out that there's already an open PR for the go-storage client which is open: https://github.com/golang/oauth2/issues/337. I had to make a few changes in the code using this method: https://pkg.go.dev/golang.org/x/oauth2/google#ComputeTokenSource where in basically we are trying to fetch the token explicitly from metadata server and then calling subsequent cloud API's.

access issue while connecting to azure data lake gen 2 from databricks

I am getting this below access issue while trying to connect from databricks to gen2 data lake using Service principal and OAuth 2.0
Steps performed: Reference article
created new service principal
provide necessary access to this service principal from azure storage account IAM with Contributor role access.
Firewalls and private end points connection has been enabled on databricks and storage account.
StatusCode=403
StatusDescription=This request is not authorized to perform this operation using this permission.
ErrorCode=AuthorizationPermissionMismatch
ErrorMessage=This request is not authorized to perform this operation using this permission.
However when I tried connecting via access keys it works well without any issue. Now I started suspecting if #3 from my steps is the reason for this access issue. If so, do I need to give any additional access to make it success? Any thoughts?
When performing the steps in the Assign the application to a role, make sure to assign the Storage Blob Data Contributor role to the service principal.
Repro: I have provided owner permission to the service principal and tried to run the “dbutils.fs.ls("mnt/azure/")”, returned same error message as above.
Solution: Now assigned the Storage Blob Data Contributor role to the service principal.
Finally, able to get the output without any error message after assigning Storage Blob Data Contributor role to the service principal.
For more details, refer “Tutorial: Azure Data Lake Storage Gen2, Azure Databricks & Spark”.

Data Factory Blob Storage Linked Service with Managed Identity : The remote server returned an error: (403)

I have created a linked service connection to a storage account using a managed identity and it successfully validates but when I try to use the linked service on a dataset I get an error:
A storage operation failed with the following error 'The remote server returned an error: (403)
The error is displayed when I attempt to browse the blob to set the file path.
The managed identity for the data factory has been assigned the Contributor role.
The blob container is set to private access.
Anyone know how I make this work?
Turns out I was using the wrong role. Just need to add an assignment for
Storage Blob Data Contributor.

Amazon S3 : The application is giving access denied exception in linux. Why?

I have a spring-boot application to upload and delete a file in Amazon-S3 bucket.
The project is working fine on Windows but when I am trying to upload anything using curl command in linux through putty, it's giving me the access denied exception.
The exception given is :
com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied
The screenshot :
You probably didn't setup your AWS credentials for your Linux.
The instructions are here
just make sure your have your aws_access_key_id and aws_secret_access_key
Can you check your using IAM credential and S3 policy setting ?
Credential
Regardless of platforms, it's necessary to use credentials (access key id & secret access key). Please check credential files have same access key id.
S3 policy
S3 policy can handle deny/allow access according to credentials or IP addresses. Do you configure such policy ?

Resources