Failed to run spark query in databricks notebook after storage configurations - azure-blob-storage

I already set up key vault scope in the notebooks and I established the connection to the storage account using the following steps:
spark.conf.set("fs.azure.account.auth.type."+StorageAccountName+".dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type."+StorageAccountName+".dfs.core.windows.net","org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id."+StorageAccountName+".dfs.core.windows.net",clientId)
spark.conf.set("fs.azure.account.oauth2.client.secret."+StorageAccountName+".dfs.core.windows.net",clientSecret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint."+StorageAccountName+".dfs.core.windows.net","https://login.microsoftonline.com/mytenantid/oauth2/token")
The values of "StorageAccountName", "clientId", "clientSecret" all come from key vault and I am able to get their value properly. In my storage account access control I also assigned the
Storage Blob Data Contributor role to my service principal.
After these configurations, I assigned a connection variable:
var apptable = "abfss://container#"+StorageAccountName+".dfs.core.windows.net/path/to/data"
If I run the following command, I am able to see the files in the blob storage
display(dbutils.fs.ls(apptable))
I am also able to check the schema:
var df = spark.read.format("delta").load(apptable)
df.printSchema()
but if I tried to run the following query:
var last_appt = spark.sql(s"""select max(updateddate) from apptable""").collect()(0).getTimestamp(0)
I got the error:
KeyProviderException: Failure to initialize configuration
Caused by: InvalidConfigurationValueException: Invalid configuration value detected for fs.azure.account.key
I researched online and seems there are some issues in the spark configs. But if it failed to get access to the storage, how come the above display command is running well? What could be possibly missing in such scenario?
I have limited experience on databricks. Appreciate any help.

I tried to reproduce the same in my environment and got the below results and I configure same as mentioned above.
Please follow below code:
Read spark dataframe df.
var df = spark.read.format("delta").load(apptable)
Create temp table:
%scala
temp_table_name = "demtb"
df.createOrReplaceTempView(temp_table_name)
Now, using below code. I got this output.
%scala
val aa= spark.sql("""select max(marks) from demtb""")
display(aa)
Update:
As mentioned, in below comment its working fine for me.
df1.write.mode("overwrite").format("parquet").option("path","/FileStore/dd/").option("overwriteschema","true").saveAsTable("app")
And also, you can try this syntax for configuring azure gen2.As per requirement you can change file format. For demo I'm using csv.
spark.conf.set("fs.azure.account.key.<storage_account_name>.dfs.core.windows.net","Access_key")
Scala
%scala
val df1 = spark.read.format("csv").option("header", "true").load("abfss://pool#vamblob.dfs.core.windows.net/")
display(df1)
Python
df1 = spark.read.format("csv").option("header", "true").load("abfss://pool#vamblob.dfs.core.windows.net/")
display(df1)

Related

Append to an Azure Blob using SAS URL

I need to continuously append to a blob in a container for which I have been provided SAS URL
I am doing this
var blobClient = new AppendBlobClient(mySASUri);
var blobContentInfo = blobClient.CreateIfNotExists();
but Create or CreateIfNotExists do not take a blob name parameter. which is strange for a create method.
and I get Authentication exception when using the following
mySASUri="https://[myaccount].blob.core.windows.net/[my container]?sp=racwl&st=2022-02-03T08:29:46Z&se=2022-02-03T16:29:46Z&spr=https&sv=2020-08-04&sr=c&sig=[the signature]"
I have been reading a lot of stuff on use of Azure SAS but everything talks about generating SAS or stops at very basic level.
Thanks to anyone who looks at this and can provide either a reading reference or guidance on what api combinations should work for this use case.
Thanks,
Tauqir
Considering your SAS URL is for the container, it would be be better if you create an instance of BlobContainerClient first and then get an instance of AppendBlobClient using GetAppendBlobClientCore method.
Something like:
var blobContainerClient = new BlobContainerClient(new Uri(mySASUri));
var appendBlobClient = blobContainerClient.GetAppendBlobClientCore("append-blob-name");
...do the append blob operations here...

PyArrow get Metadata from file in S3

I want to get Parquet file statistics (such as Min/Max) from file in S3 using PyArrow.
I am able to fetch it using
pq.ParquetDataset(s3_path, filesystem=s3)
and get the statistics if I download and read it using:
ParquetFile(full_path).metadata.row_group(0).column(col_idx).statistics
hope there is a way to achieve it without download the whole file.
Thanks
I came to this post looking for a similar answer few days ago. In the end I found a simple solution that works for me.
import pyarrow.parquet as pq
from pyarrow import fs
s3_files = fs.S3FileSystem(access_key) # whatever need to connect to s3
# fetch the dataset
dataset = pq.ParquetDataset(s3_path, filesystem=s3_files)
metadata = {}
for fragment in dataset.fragments:
meta = fragment.metadata
metadata[fragment.path] = meta
print(meta)
The metadata is store in the dictionary where the keys are the path to the fragment in the s3 and the values is the metadata of that particular fragment.
to acces the statistics just use
meta.row_group(0).column(col_idx).statistics
something like this will be printed for every fragment
<pyarrow._parquet.FileMetaData object at 0x7fb5a045b5e0>
created_by: parquet-cpp-arrow version 8.0.0
num_columns: 6
num_rows: 10
num_row_groups: 1
format_version: 1.0
serialized_size: 3673

Elastic Search and Spark

I am trying to set up spark and Eleastic search using the elasticsearch-spark library with the sbt artifact: "org.elasticsearch" %% "elasticsearch-spark" % "2.3.2". When I try to configure eleastic search with this code:
val sparkConf = new SparkConf().setAppName("test").setMaster("local[2]")
.set("es.index.auto.create", "true")
.set("es.resource", "test")
.set("es.nodes", "test.com:9200")
I keep getting the error: illegal character for all of the set statements above for elastic search. Anyone know the issue?
You must have copied the code from any website or any other blog. It contains unreadable characters that are actually giving you trouble.
Simple solution: Delete all the content. Type one by one manually, and run it.. Let me know if you faced any problems again, i will help you out.
You might want to set the http.publish_host in your elasticsearch.yml to HOST_NAME. The es-hadoop connector is sniffing the nodes from the _nodes/transport API so it checks what the published http address is.

Unable to query using a pointer using Android SDK

I am trying to run a query using com.parse:parse-android:1.13.0 version of the parse android SDK. I am creating a query on the local storage and using ParseQuery$whereMatchesQuery() method to match a column storing pointer to another class in my database. The code that I have is the following:
ParseQuery<PracticeSessionDetails> query = ParseQuery.getQuery(PracticeSessionDetails.class);
query.fromLocalDatastore();
query.ignoreACLs();
query.whereEqualTo("user", ParseUser.getCurrentUser());
ParseQuery courseQuery = new ParseQuery("Course");
courseQuery.whereEqualTo("objectId",courseId);
query.whereMatchesQuery("course", courseQuery);
When I run the query using query.getFirst(), I do not get anything retrieved from the local storage. I have already checked running the courseQuery separately and it fetches me Course object that I need. Is this a known issue? I proceeded with this way by getting help from this post.
I think you are mixing query while storing pointer to another class in your database. Use following code to resolve your problem:
ParseQuery<PracticeSessionDetails> query = ParseQuery.getQuery(PracticeSessionDetails.class);
query.fromLocalDatastore();
query.ignoreACLs();
query.whereEqualTo("user", ParseUser.getCurrentUser());
ParseQuery<ParseObject> courseQuery = ParseQuery.getQuery("Course");
courseQuery.whereMatchesQuery("course", query);

No signature of method: groovy.lang.MissingMethodException.makeKey()

I've installed titan-0.5.0-hadoop2 with hbase and elasticsearch support
I've loaded the graph with
g = TitanFactory.open('conf/titan-hbase-es.properties')
==>titangraph[hbase:[127.0.0.1]]
and a then I loaded the test application
GraphOfTheGodsFactory.load(g)
Now when I'm trying to create a new index key with:
g.makeKey('userId').dataType(String.class).indexed(Vertex.class).unique().make()
and I got this error:
No signature of method: groovy.lang.MissingMethodException.makeKey() is applicable for argument types: () values: []
Possible solutions: every(), any()
Display stack trace? [yN]
Can someone help me with this ?
when I want to see the indexed keys I see this
g.getIndexedKeys(Vertex.class)
==>reason
==>age
==>name
==>place
I'm not completely following what you are trying to do. It appears that you loaded Graph of the Gods to g and then you want to add userId as a new property to the schema. If that's right, then i think your syntax is wrong, given the Titan 0.5 API. The method for managing the schema is very different from previous versions. Changes to the schema are performed through the ManagementSystem interface which you can get an instance of through:
mgmt = g.getManagementSystem()
The syntax for adding a property then looks something like:
birthDate = mgmt.makePropertyKey('birthDate').dataType(Long.class).cardinality(Cardinality.SINGLE).make()
mgmt.commit()
Note that g.getIndexKeys(Class) is not the appropriate way to get schema information either. You should use the ManagementSystem for that too.
Please see the documentation here for more information.

Resources