I have a scala codebase where i am accessing azure blob files using Hadoop FileSystem Apis (and not the azure blob web client). My usage is of the format:
val hadoopConfig = new Configuration()
hadoopConfig.set(s"fs.azure.sas.${blobContainerName}.${accountName}.blob.windows.core.net",
sasKey)
hadoopConfig.set("fs.defaultFS",
s"wasbs://${blobContainerName}#${accountName}.blob.windows.core.net")
hadoopConfig.set("fs.wasb.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem")
hadoopConfig.set("fs.wasbs.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure")
val fs = FileSystem.get(
new java.net.URI(s"wasbs://" +
s"${blobContainerName}#${accountName}.blob.windows.core.net"), hadoopConfig)
I am now writing unit tests for this code using azure storage emulator as the storage account. I went through this page but it only explains how to access azure emulator through web apis of AzureBlobClient. I need to figure out how to test my above code by accessing azure storage emulator using hadoop FileSystem apis. I have tried the following way but this does not work:
val hadoopConfig = new Configuration()
hadoopConfig.set(s"fs.azure.sas.{containerName}.devstoreaccount1.blob.windows.core.net",
"Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==")
hadoopConfig.set("fs.defaultFS",
s"wasbs://{containerName}#devstoreaccount1.blob.windows.core.net")
hadoopConfig.set("fs.wasb.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem")
hadoopConfig.set("fs.wasbs.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure")
val fs = FileSystem.get(
new java.net.URI(s"wasbs://{containerName}#devstoreaccount1.blob.windows.core.net"), hadoopConfig)
I was able to solve this problem and connect to storage emulator by adding the following 2 configurations:
hadoopConfig.set("fs.azure.test.emulator",
"true")
hadoopConfig.set("fs.azure.storage.emulator.account.name",
"devstoreaccount1.blob.windows.core.net")
Related
in my spark application I have the following scala code
val restService = new RestService(schemaRegistryUrl)
val props = Map(
"basic.auth.credentials.source" -> "USER_INFO",
"basic.auth.user.info" -> "%s:%s".format(key, secret)
).asJava
val schemaRegistryClient = new CachedSchemaRegistryClient(restService, 100, props)
// later:
schemaRegistryClient.getSchemaById(id) // fails with 401
I've verified I'm able to get a schema by the same id, from the rest API using the same basic auth credentials. But with the code, I get an auth error - 401. So obviously I'm missing something here and I'll be glad for help here.
Version of both restService and SchemaRegistry 6.2.1 (the latest I believe)
Note this works locally when I run this code with the same credentials via the same schema registry, it works from databricks when I use python with the same connection variables. just with Scala from databricks I get 401
So the answer turned out to be, my assembly build dropped some important stuff when I dropped META-INF, not dropping it, but choosing wisely what I need took some effort, but worked eventually. and then it worked.
Also turns out that the version of the Confluent schema registry client I've used avro version (1.10) isn't compatible with existing versions of spark in Databricks (3/3.1) which support only 1.8
We have spring boot application running in azure app service to do some ETL operations on CSV files.
A file will be put into instance local directory from where application will be picking the file and process it. We are facing an issue if the uploaded file is bigger than >10 MB. Reader is not able to read the file and returning null . We are using supercsv to process the csv
file .
FileReader fr = new FileReader(filePath);
BufferedReader bufferedReader = new BufferedReader(fr);
CsvListReader reader = new CsvListReader(bufferedReader, CsvPreference.EXCEL_NORTH_EUROPE_PREFERENCE);
List<String> read = reader.read();
`
reader.read() method returns null.The issue is happening only in azure app service (linux) and working in local perfectly with the same.
Can anyone help me to find out what the issue is here ?
I am trying to write data to a csv files and store the file on Azure Data Lake Gen2 and run into job aborted error message. This same code used to work fine previously.
Error Message:
org.apache.spark.SparkException: Job aborted.
Code:
import requests
response = requests.get('https://myapiurl.com/v1/data', auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])
df.write.format(source).mode("overwrite").save(path) #error line
I summarize the solution below
If you want to access Azure data lake gen2 in Azure databricks, you have two choices to do that.
Mount Azure data lake gen2 as Azure databricks's file system. After doing that, you can read and write files with the path /mnt/<>. And We just need to run the code one time.
a. Create a service principal and assign Storage Blob Data Contributor to the sp in the scope of the Data Lake Storage Gen2 storage account
az login
az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
--scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>
b. code
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<appId>",
"fs.azure.account.oauth2.client.secret": "<clientSecret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)
Access directly using the storage account access key.
We can add the code spark.conf.set( "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-name>") to our script. Then we can read and write files with path abfss://<file-system-name>#<storage-account-name>.dfs.core.windows.net/.
for example
from pyspark.sql.types import StringType
spark.conf.set(
"fs.azure.account.key.testadls05.dfs.core.windows.net", "<account access key>")
df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
df.show()
df.coalesce(1).write.format('csv').option('header', True).mode('overwrite').save('abfss://test#testadls05.dfs.core.windows.net/result_csv')
For more details, please refer to here
I have a windows service which seems to be unable to use isolated storage (most likely because I'm doing something wrong).
The service is running under local system.
Everytime I run the code there is no change to isolated storage folder.
Can someone please tell me why this is so.
I have tried on windows 7 and windows 8.1
IsolatedStorageFileStream configFile = new IsolatedStorageFileStream("UIsolate.cfg", FileMode.Create);
// create a writer to write to the stream
StreamWriter writer = new StreamWriter(configFile);
// write some data to the config. file
writer.WriteLine("test");
// flush the buffer and clean up
sr.Close();
writer.Close();
configFile.Close();
Trying to find out how to access the GAE Cloud Datastore from outside GAE. Ideally using a standalone python script (Mac) to run GQL like
q = GqlQuery("SELECT * FROM user WHERE email = 'test#example.com'")
or maybe even a local instance of GAE using GoogleAppEngineLauncher if a standalone script is not possible.
I have done the following
Accessing an existing App Engine Datastore from another platform (https://cloud.google.com/datastore/docs/activate) - permissions with service account + private key
Installed the Python SDK - confirmed SDK files are files are in
/usr/local/lib/python2.7/site-packages/google_appengine
/usr/local/google_appengine/google/appengine/*** (api,base,client,datastore,...,ext,etc.)
Running a print sys.path shows (among other paths)
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/google_appengine,
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gtk-2.0', '/Library/Python/2.7/site-packages,
/usr/local/lib/python2.7/site-packages,
/usr/local/lib/python2.7/site-packages/google_appengine, /usr/local/lib/python2.7/site-packages/gtk-2.0
Did the export
export DATASTORE_SERVICE_ACCOUNT=...
export DATASTORE_PRIVATE_KEY_FILE=... (full path to .p12 file)
export DATASTORE_DATASET=...
export PYTHONPATH="$PYTHONPATH:/usr/local/google_appengine"
Ran the example adams.py file
Created a new entity called "Trivia" with name= hgtg, answer=42 as a record in PROD
However, running a standalone script
from appengine.ext.db import GqlQuery (or from google.appengine.ext.db import GqlQuery)
gives me ImportError:
No module named appengine.ext.db
I then tried to use a local GAE instance but can't figure out how to connect the PROD GAE database instance.
===
The existing GAE application us using the Datastore (Java) in PRODUCTION. In the Developer Console (https://console.developers.google.com/project), this would be under Storage > Cloud Datastore > Query where I would see the "Entities" (or kind). Obviously, there is a fairly limited amount of things you can do here and I don't really want to touch PRODUCTION code to run a query.
Thanks,
Chris