Writing parquet file throws...An HTTP header that's mandatory for this request is not specified - parquet

I have two ADLSv2 storage accounts, both are hierarchical namespace enabled.
In my Python Notebook, I'm reading a CSV file from one storage account and writing as parquet file in another storage, after some enrichment.
I am getting below error when writing the parquet file...
StatusCode=400, An HTTP header that's mandatory for this request is not
Any help is greatly appreciated.
Below is my Notebook code snippet...
# Databricks notebook source
# MAGIC %python
# MAGIC
# MAGIC STAGING_MOUNTPOINT = "/mnt/inputfiles"
# MAGIC if STAGING_MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]:
# MAGIC dbutils.fs.unmount(STAGING_MOUNTPOINT)
# MAGIC
# MAGIC PERM_MOUNTPOINT = "/mnt/outputfiles"
# MAGIC if PERM_MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]:
# MAGIC dbutils.fs.unmount(PERM_MOUNTPOINT)
STAGING_STORAGE_ACCOUNT = "--------"
STAGING_CONTAINER = "--------"
STAGING_FOLDER = --------"
PERM_STORAGE_ACCOUNT = "--------"
PERM_CONTAINER = "--------"
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "#####################",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="DemoScope",key="DemoSecret"),
"fs.azure.account.oauth2.client.endpoint":
"https://login.microsoftonline.com/**********************/oauth2/token"}
STAGING_SOURCE =
"abfss://{container}#{storage_acct}.blob.core.windows.net/".format(container=STAGING_CONTAINER,
storage_acct=STAGING_STORAGE_ACCOUNT)
try:
dbutils.fs.mount(
source=STAGING_SOURCE,
mount_point=STAGING_MOUNTPOINT,
extra_configs=configs)
except Exception as e:
if "Directory already mounted" in str(e):
pass # Ignore error if already mounted.
else:
raise e
print("Staging Storage mount Success.")
inputDemoFile = "{}/{}/demo.csv".format(STAGING_MOUNTPOINT, STAGING_FOLDER)
readDF = (spark
.read.option("header", True)
.schema(inputSchema)
.option("inferSchema", True)
.csv(inputDemoFile))
LANDING_SOURCE =
"abfss://{container}#{storage_acct}.blob.core.windows.net/".format(container=LANDING_CONTAINER,
storage_acct=PERM_STORAGE_ACCOUNT)
try:
dbutils.fs.mount(
source=PERM_SOURCE,
mount_point=PERM_MOUNTPOINT,
extra_configs=configs)
except Exception as e:
if "Directory already mounted" in str(e):
pass # Ignore error if already mounted.
else:
raise e
print("Landing Storage mount Success.")
outPatientsFile = "{}/patients.parquet".format(outPatientsFilePath)
print("Writing to parquet file: " + outPatientsFile)
***Below call is failing…error is
StatusCode=400
StatusDescription=An HTTP header that's mandatory for this request is not specified.
ErrorCode=
ErrorMessage=***
(readDF
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.option("compression", "snappy")
.parquet(outPatientsFile)
)

I summarize the solution as below.
If you want to mount Azure data lake storage gen2 as Azure databricks file system, the URL should be like abfss://<file-system-name>#<storage-account-name>.dfs.core.windows.net/. For more details, please refer to here
For example
Create an Azure Data Lake Storage Gen2 account.
az login
az storage account create \
--name <account-name> \
--resource-group <group name> \
--location westus \
--sku Standard_RAGRS \
--kind StorageV2 \
--enable-hierarchical-namespace true
Create a service principal and assign Storage Blob Data Contributor to the sp in the scope of the Data Lake Storage Gen2 storage account
az login
az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
--scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>
mount Azure data lake gen2 in Azure databricks(python)
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<appId>",
"fs.azure.account.oauth2.client.secret": "<clientSecret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)

Couple of important points to note while mounting Storage accounts in Azure Databricks.
For Azure Blob storage: source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<directory-name>"
For Azure Data Lake Storage gen2: source = "abfss://<file-system-name>#<storage-account-name>.dfs.core.windows.net/"
To mount an Azure Data Lake Storage Gen2 filesystem or a folder inside it as Azure Databricks file system, the URL should be like abfss://<file-system-name>#<storage-account-name>.dfs.core.windows.net/
Reference: Azure Databricks - Azure Data Lake Storage Gen2

Related

Azure - Copy LARGE blobs from one container to other using logic apps

I successfully built logic app where whenever a blob is added in container-one, it gets copied to container-2. However it fails when any blobs larger than 50 MB (default size) is uploaded.
Could you please guide.
Blobs are added via rest api.
Below is the flow,
Currently, the maximum file size with disabled chunking is 50MB. One of the workarounds is to use Azure functions in order to transfer the files from one container to another.
Below is the sample Python Code that worked for me when I'm trying to transfer files from One container to Another
from azure.storage.blob import BlobClient, BlobServiceClient
from azure.storage.blob import ResourceTypes, AccountSasPermissions
from azure.storage.blob import generate_account_sas
from datetime import datetime,timedelta
connection_string = '<Your Connection String>'
account_key = '<Your Account Key>'
source_container_name = 'container1'
blob_name = 'samplepdf.pdf'
destination_container_name = 'container2'
# Create client
client = BlobServiceClient.from_connection_string(connection_string)
# Create sas token for blob
sas_token = generate_account_sas(
account_name = client.account_name,
account_key = account_key,
resource_types = ResourceTypes(object=True),
permission= AccountSasPermissions(read=True),
expiry = datetime.utcnow() + timedelta(hours=4)
)
# Create blob client for source blob
source_blob = BlobClient(
client.url,
container_name = source_container_name,
blob_name = blob_name,
credential = sas_token
)
# Create new blob and start copy operation
new_blob = client.get_blob_client(destination_container_name, blob_name)
new_blob.start_copy_from_url(source_blob.url)
RESULT:
REFERENCES:
General Limits
How to copy a blob from one container to another container using Azure Blob storage SDK

How to Copy data from mount point databricks to ADLS gen2

I m trying to write the data from /mnt/Demo folder to adls gen2, Could you pls help the steps to do that. So far, i could able to execute the below lines of code and could able to copy data from adls to /mnt/demo folder and read data from it. How to write data to adls through databricks
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "", #Enter <appId> = Application ID
"fs.azure.account.oauth2.client.secret": "", #Enter <password> = Client Secret created in AAD
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/cccc/oauth2/token", #Enter <tenant> = Tenant ID
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://yy#xxx.dfs.core.windows.net/Test2", #Enter <container-name> = filesystem name <storage-account-name> = storage name
mount_point = "/mnt/Demo17",
extra_configs = configs)
df = spark.read.csv("/mnt/Demo16/Contract.csv", header="true")
df_review = df[['AccountId', 'Id', 'Contract_End_Date_2__c', 'Contract_Type__c', 'StartDate', 'Contract_Term_Type__c', 'Status', 'Description', 'CreatedDate', 'LastModifiedDate']]
df_review.repartition(1).write.mode("append").csv("abfss://salesforcedata#storagedemovs.dfs.core.windows.net/Test2/trial")
display(df_review)
display(df)

Reading Blob Into Pyspark

I'm trying to read in a series of json files stored in an azure blob into spark using the databricks notebook. I set the conf() with my account and key but it always returns the error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: java.lang.IllegalArgumentException: The String is not a valid Base64-encoded string.
I've followed along with the information provided here:
https://docs.databricks.com/_static/notebooks/data-import/azure-blob-store.html
and here:
https://luminousmen.com/post/azure-blob-storage-with-pyspark
I can pull the data just fine using the azure sdk for python
storage_account_name = "name"
storage_account_access_key = "key"
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
file_location = "wasbs://loc/locationpath"
file_type = "json"
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
Should return a dataframe of the json file

When provisioning with Terraform, how does code obtain a reference to machine IDs (e.g. database machine address)

Let's say I'm using Terraform to provision two machines inside AWS:
An EC2 Machine running NodeJS
An RDS instance
How does the NodeJS code obtain the address of the RDS instance?
You've got a couple of options here. The simplest one is to create a CNAME record in Route53 for the database and then always point to that CNAME in your application.
A basic example would look something like this:
resource "aws_db_instance" "mydb" {
allocated_storage = 10
engine = "mysql"
engine_version = "5.6.17"
instance_class = "db.t2.micro"
name = "mydb"
username = "foo"
password = "bar"
db_subnet_group_name = "my_database_subnet_group"
parameter_group_name = "default.mysql5.6"
}
resource "aws_route53_record" "database" {
zone_id = "${aws_route53_zone.primary.zone_id}"
name = "database.example.com"
type = "CNAME"
ttl = "300"
records = ["${aws_db_instance.default.endpoint}"]
}
Alternative options include taking the endpoint output from the aws_db_instance and passing that into a user data script when creating the instance or passing it to Consul and using Consul Template to control the config that your application uses.
You may try Sparrowform - a lightweight provision tool for Terraform based instances, it's capable to make an inventory of Terraform resources and provision related hosts, passing all the necessary data:
$ terrafrom apply # bootstrap infrastructure
$ cat sparrowfile # this scenario
# fetches DB address from terraform cache
# and populate configuration file
# at server with node js code:
#!/usr/bin/env perl6
use Sparrowform;
$ sparrowfrom --ssh_private_key=~/.ssh/aws.pem --ssh_user=ec2 # run provision tool
my $rdb-adress;
for tf-resources() -> $r {
my $r-id = $r[0]; # resource id
if ( $r-id 'aws_db_instance.mydb') {
my $r-data = $r[1];
$rdb-address = $r-data<address>;
last;
}
}
# For instance, we can
# Install configuration file
# Next chunk of code will be applied to
# The server with node-js code:
template-create '/path/to/config/app.conf', %(
source => ( slurp 'app.conf.tmpl' ),
variables => %(
rdb-address => $rdb-address
),
);
# sparrowform --ssh_private_key=~/.ssh/aws.pem --ssh_user=ec2 # run provisioning
PS. disclosure - I am the tool author

AWS SDK v2 for s3

Can any one provide me a good documentation for uploading files to S3 using asw-sdk Version 2. I checked out the main doc and in v1 we used to do like
s3 = AWS::S3.new
obj = s3.buckets['my-bucket']
Now in v2 when I try as
s3 = Aws::S3::Client.new
am ending up with
Aws::Errors::MissingRegionError: missing region; use :region option or export region name to ENV['AWS_REGION']
Can anyone help me with this?
As per official documentation:
To use the Ruby SDK, you must configure a region and credentials.
Therefore,
s3 = Aws::S3::Client.new(region:'us-west-2')
Alternatively, a default region can be loaded from one of the following locations:
Aws.config[:region]
ENV['AWS_REGION']
Here's a complete S3 demo on aws v2 gem that worked for me:
Aws.config.update(
region: 'us-east-1',
credentials: Aws::Credentials.new(
Figaro.env.s3_access_key_id,
Figaro.env.s3_secret_access_key
)
)
s3 = Aws::S3::Client.new
resp = s3.list_buckets
puts resp.buckets.map(&:name)
Gist
Official list of AWS region IDs here.
If you're unsure of the region, the best guess would be US Standard, which has the ID us-east-1 for config purposes, as shown above.
If you were using a aws.yml file for your credentials in Rails, you might want to create a file config/initializers/aws.rb with the following content:
filename = File.expand_path(File.join(Rails.root, "config", "aws.yml"))
config = YAML.load_file(filename)
aws_config = config[Rails.env.to_s].symbolize_keys
Aws.config.update({
region: aws_config[:region],
credentials: Aws::Credentials.new(aws_config[:access_key_id], aws_config[:secret_access_key])
})
The config/aws.yml file would need to be adapter to include the region.
development: &development
region: 'your region'
access_key_id: 'your access key'
secret_access_key: 'your secret access key'
production:
<<: *development

Resources