Azure databricks dataframe write gives job abort error

Azure databricks dataframe write gives job abort error - azure-databricks

I am trying to write data to a csv files and store the file on Azure Data Lake Gen2 and run into job aborted error message. This same code used to work fine previously.
Error Message:
org.apache.spark.SparkException: Job aborted.
Code:
import requests
response = requests.get('https://myapiurl.com/v1/data', auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])
df.write.format(source).mode("overwrite").save(path) #error line

I summarize the solution below
If you want to access Azure data lake gen2 in Azure databricks, you have two choices to do that.
Mount Azure data lake gen2 as Azure databricks's file system. After doing that, you can read and write files with the path /mnt/<>. And We just need to run the code one time.
a. Create a service principal and assign Storage Blob Data Contributor to the sp in the scope of the Data Lake Storage Gen2 storage account
az login
az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
--scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>
b. code
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<appId>",
"fs.azure.account.oauth2.client.secret": "<clientSecret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)
Access directly using the storage account access key.
We can add the code spark.conf.set( "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-name>") to our script. Then we can read and write files with path abfss://<file-system-name>#<storage-account-name>.dfs.core.windows.net/.
for example
from pyspark.sql.types import StringType
spark.conf.set(
"fs.azure.account.key.testadls05.dfs.core.windows.net", "<account access key>")
df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
df.show()
df.coalesce(1).write.format('csv').option('header', True).mode('overwrite').save('abfss://test#testadls05.dfs.core.windows.net/result_csv')
For more details, please refer to here

Related

How to mount Azure Blob Storage (hierarchical namespace disabled) from Databricks

I need to mount Azure Blob storage (where hierarchical namespace is disabled) from databricks. Mount command returns true but when I run fs.ls command, it returns error UnknownHostException. Please suggest

I got a similar kind of error. I tried and unmounted my blob storage account. Then, Remounted my storage account. Now, it's working fine.
Unmounting Storage account:
dbutils.fs.unmount("<mount_point>")
Mount Blob Storage:
dbutils.fs.mount(
source = "wasbs://<container>#<Storage_account_name>.blob.core.windows.net/",
mount_point = "<mount_point>",
extra_configs = {"fs.azure.account.key.vamblob.blob.core.windows.net":"Access_key"})
display(dbutils.fs.ls('/mnt/fgs'))
This command display(dbutils.fs.ls('/mnt/fgs')) returns all the files available in the mount point. You can perform all the required operations and then write to this DBFS, which will be reflected in your blob storage container also.
For more information refer this MS Document.

Invalid configuration value detected for fs.azure.account.key copy activity fails

Data factory Copy activity fails when copy the delta table from databricks to storage account gen2
Details
ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Caused by: Invalid configuration value detected for fs.azure.account.key.
Appreciate your help.

The above error mainly happens because the staging is not enabled. We need to enable staging to copy data from delta Lake.
Go to Azure Databricks inside cluster -> advance option and edit spark config as per the below format.
spark.hadoop.fs.azure.account.key.<storage_account_name>.blob.core.windows.net <Access Key>
After that you can follow this official document it has detail explanation about copy activity with delta lake.
you can refer this Article by RishShah-4592

Edit the cluster,
fs.azure.account.key..dfs.core.windows.net{{secrets//}}
Its working fine now...Able to copy data from delta lake table to adls gen2

I think so you can pass the secret as below:
spark.hadoop.fs.azure.account.key.<storage_account_name>.blob.core.windows.net {{secrets/<secret-scope-name>/<secret-name>}}

AuthorizationPermissionMismatch error during AzCopy

I'm getting an error using AzCopy to copy an s3 bucket into an azure container, following the guide at https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3
I used azcopy login to authenticate, and added the below permissions to my azure account
Storage Blob Data Contributor
Storage Blob Data Owner
Storage Queue Data Contributor
Then trying to copy my bucket with
./azcopy copy 'https://s3.us-east-1.amazonaws.com/my-bucket' 'https://my-account.blob.core.windows.net/my-container' --recursive=true
I then receive an error that
AuthorizationPermissionMismatch
RESPONSE Status: 403 This request is not authorized to perform this operation using this permission.
What other permissions could I be missing or what else could it be?

Turns out I just had to wait a few hours for the permissions to fully propagate

Please check if below is missing:
To Authorize with AWS S3 ,you may need to gather your AWS access key and secret and then set the environment variables of that s3 source after getting hold of them.
Windows:
set AWS_ACCESS_KEY_ID=<access-key>
set AWS_SECRET_ACCESS_KEY=<secret-access-key>
(or)
Linux:
export AWS_ACCESS_KEY_ID=<access-key>
export AWS_SECRET_ACCESS_KEY=<secret-access-key>
Please make sure you've been granted the required permissions /actions for Amazon S3 object operations to copy data from Amazon S3,
for example> s3:GetObject and s3:GetObjectVersion.
References:
azcopy
Authorize with AWS S3

DBFS FileStore Equivalent in Azure Synapse?

Is there an equivalent to Databricks' DBFS FileStore system in Azure Synapse? Is it possible to upload csv files and read them into pandas dataframes within Azure Synapse notebooks? Ideally I'd like to not load the csv into a database; looking for something as simple as DBFS' FileStore folder.
In Databricks: pd.read_csv('/dbfs/FileStore/name_of_file.csv')
In Synapse: ?
I don't see anywhere to upload csv files directly like in DBFS:

The azure synapse equivalent of using FileStore in Databricks would be to use the data lake file system linked to your synapse workspace. Once you go to your synapse studio, navigate to Data->Linked where you can find the linked storage account. This storage account was created/assigned when you create your workspace.
This primary data lake functions close to the FileStore in azure Databricks. You can use the UI shown in the above image to upload required files. You can right click on any of the files and load it into a Dataframe. As you can see in the image below, you can right click on the file and then choose new notebook -> Load to DataFrame.
The UI automatically provides a code which helps to load the csv file to a spark Dataframe. You can modify this code to load the file as a pandas Dataframe.
'''
#this is provided by synapse when you select file and choose to load to Dataframe
df = spark.read.load('abfss://data#datalk1506.dfs.core.windows.net/sample_1.csv', format='csv'
## If header exists uncomment line below
##, header=True
)
display(df.limit(10))
'''
#Use this following code to load as pandas dataframe
import pandas as pd
df = pd.read_csv('abfss://data#datalk1506.dfs.core.windows.net/sample_1.csv')
This data lake storage will be linked to the workspace with the help of the linked service (Can be viewed in Manage->Linked services). This is created by default from the data lake and file system information provided by the user (mandatory) while creating the synapse workspace.

Access Azure Storage Emulator through hadoop FileSystem api

I have a scala codebase where i am accessing azure blob files using Hadoop FileSystem Apis (and not the azure blob web client). My usage is of the format:
val hadoopConfig = new Configuration()
hadoopConfig.set(s"fs.azure.sas.${blobContainerName}.${accountName}.blob.windows.core.net",
sasKey)
hadoopConfig.set("fs.defaultFS",
s"wasbs://${blobContainerName}#${accountName}.blob.windows.core.net")
hadoopConfig.set("fs.wasb.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem")
hadoopConfig.set("fs.wasbs.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure")
val fs = FileSystem.get(
new java.net.URI(s"wasbs://" +
s"${blobContainerName}#${accountName}.blob.windows.core.net"), hadoopConfig)
I am now writing unit tests for this code using azure storage emulator as the storage account. I went through this page but it only explains how to access azure emulator through web apis of AzureBlobClient. I need to figure out how to test my above code by accessing azure storage emulator using hadoop FileSystem apis. I have tried the following way but this does not work:
val hadoopConfig = new Configuration()
hadoopConfig.set(s"fs.azure.sas.{containerName}.devstoreaccount1.blob.windows.core.net",
"Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==")
hadoopConfig.set("fs.defaultFS",
s"wasbs://{containerName}#devstoreaccount1.blob.windows.core.net")
hadoopConfig.set("fs.wasb.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem")
hadoopConfig.set("fs.wasbs.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure")
val fs = FileSystem.get(
new java.net.URI(s"wasbs://{containerName}#devstoreaccount1.blob.windows.core.net"), hadoopConfig)

I was able to solve this problem and connect to storage emulator by adding the following 2 configurations:
hadoopConfig.set("fs.azure.test.emulator",
"true")
hadoopConfig.set("fs.azure.storage.emulator.account.name",
"devstoreaccount1.blob.windows.core.net")

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio