Write Data to SQL DW from Apache Spark in Azure Synapse - azure-databricks

When I write data to SQL DW in Azure from Databricks I use the following code:
example1.write.format("com.databricks.spark.sqldw").option("url", sqlDwUrlSmall).option("dbtable", "SampleTable12").option("forward_spark_azure_storage_credentials","True") .option("tempdir", tempDir).mode("overwrite").save()
This won't work with with Notebook in Synapse Notebook. I get the error:
Py4JJavaError: An error occurred while calling o174.save.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.sqldw. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:656) Caused by: java.lang.ClassNotFoundException: com.databricks.spark.sqldw.DefaultSource
Basically, I need to know the equivalent of com.databricks.spark.sqldw for Apache Spark in Azure Synapse.
Thanks

If you are writing to a dedicated SQL pool within the same Synapse workspace as your notebook, then it's as simple as calling the synapsesql method. A simple parameterised example in Scala, using the parameter cell feature of Synapse notebooks.
// Read the table
val df = spark.read.synapsesql(s"${pDatabaseName}.${pSchemaName}.${pTableName}")
// do some processing ...
// Write it back with _processed suffixed to the table name
df.write.synapsesql(s"${pDatabaseName}.${pSchemaName}.${pTableName}_processed", Constants.INTERNAL)
If you are trying to write from your notebook to a different dedicated SQL pool, or old Azure SQL Data Warehouse then it's a bit different but there some great examples here.
UPDATE: The items in curly brackets with the dollar-sign (eg ${pDatabaseName}) are parameters. You can designate a parameter cell in your notebook so parameters can be passed in externally eg from Azure Data Factory (ADF) or Synapse Pipelines using the Execute Notebook activity, and reused in the notebook, as per my example above. Find out more about Synapse Notebook parameters here.

Related

Setup Athena JDBC connector in Glue 4.0

I have a data source in Glue, which is configured with partition projection. I can query the data in Athena, however when I load this data source in a Glue 4.0 job, the Spark dataframe returns empty. It seems that partition projection is an Athena-only feature.
To workaround the issue, I would like to setup a JDBC connector for Athena in my Glue job, so I can access the data via Athena, instead of directly querying the Glue catalog. AWS provides instructions and a jar file here: https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html.
So I'm adding the latest jar file (at the time of writing, AthenaJDBC42-2.0.35.1000.jar) into Spark using the --extra-jars argument, but I'm getting this error:
java.lang.SecurityException: class "org.apache.logging.log4j.core.lookup.JndiLookup"'s signer information does not match signer information of other classes in the same package
Does anyone know how I can address this error?

Querying data from external hive metastore

I am trying to configure an external hive metastore for my azure synapse spark pool. The rational behind using external meta store is to share table definitions across databricks and synapse workspaces.
However, I am wondering if its possible to access backend data via the metastore. For example, can clients like PowerBI,tableau connect to external metastore and retrieve not just the metadata, but also the business data in the underlying tables?
Also what additional value does an external metastore provides ?
You can configure the external Hive Metadata in Synapse by creating a Linked Service for that external source and then query it in Synapse Serverless Pool.
Follow the below steps to connect with External Hive Metastore.
In Synapse Portal, go to the Manage symbol on the left side of the of the page. Click on it and then click on Linked Services. To create the new Linked Service, click on + New.
Search for Azure SQL Database or Azure Database for MySQL for the external Hive Metastore. Synapse supports these two Hive external metastore. Select and Continue.
Fill in all the required details like Name, Subscription, Server name, Database name, Username and Password and Test the connection.
You can test the connection with the Hive metadata using below code.
%%spark
import java.sql.DriverManager
/** this JDBC url could be copied from Azure portal > Azure SQL database > Connection strings > JDBC **/
val url = s"jdbc:sqlserver://<servername>.database.windows.net:1433;database=<databasename>;user=utkarsh;password=<password>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
try {
val connection = DriverManager.getConnection(url)
val result = connection.createStatement().executeQuery("select * from dbo.persons")
result.next();
println(s"Successful to test connection. Hive Metastore version is ${result.getString(1)}")
} catch {
case ex: Throwable => println(s"Failed to establish connection:\n $ex")
}
Check the same in below snippet for your reference.
can clients like PowerBI,tableau connect to external metastore and retrieve not just the metadata, but also the business data in the underlying tables?
Yes, Power BI allows us to connect with Azure SQL Database using in-built connector.
In Power BI Desktop, go to Get Data, click on Azure and select Azure SQL Database. Click connect.
In the next step, go give the Server name in this format <utkarsh.database.windows.net>, database name, Username and Password and you can now access data in Power BI. Refer below image.

Oracle to Databricks Connection

I'm trying to read Oracle Database data on Azure Databricks platform.
Can someone share the step to step process on how I can connect Oracle data to Databricks? I've probably searched the whole internet and read documentation but I can't find a solution that actually works. Not sure if it's because I have the incorrect driver or what.
Here's my process:
Uploaded the ojdbc8.jar file on the cluster libraries (the instant client 19
Tried to connect the data on databricks notebook and it didn't work
Can anyone share their process?
Which jar to upload in the library and where can I find this file?
How to connect? Sample code?
Any better way to do this?
To install library use
pip install cx_Oracle
Then use below code snippet to read in data from an Oracle database
CREATE TABLE oracle_table
USING org.apache.spark.sql.jdbc
OPTIONS (
dbtable 'table_name',
driver 'oracle.jdbc.driver.OracleDriver',
user 'username',
password 'pasword',
url 'jdbc:oracle:thin://#<hostname>:1521/<db>')
To read data from oracle database in pyspark you can follow this article - Reading Data From Oracle Database With Apache Spark
Refer for more information - Oracle | Databricks

Can I Query a SQL Server Database from Azure Synapse Without Using a Pipeline?

Is it possible to perform a "SELECT" statement query to a SQL server database from an Azure synapse workbook using Pyspark+SQL?
The only way I've been able to ingest data from a SQL Server database into Azure Synapse is by creating an integration pipeline.
I'm new to using Azure Synapse as well as Apache Spark, so any advice you can provide is much appreciated.
This is possible in theory and I have tested with an Azure SQL Database. I'm not 100% sure it would work with a SQL Server. It would require the network security to be right and there should be a line of sight between the two databases. Is your SQL Server in Azure for example, are they on the same vnet or peered vnets?
A simple example in a Synapse notebook:
import pyodbc
sqlQuery = "select ##version v"
try:
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=someSynapseDB.sql.azuresynapse.net;'
'DATABASE=yourDatabaseName;UID=someReadOnlyUser;'
'PWD=youWish;', autocommit = True )
cursor = conn.cursor()
cursor.execute(sqlQuery)
row = cursor.fetchone()
while row:
print(row[0])
row = cursor.fetchone()
except:
raise
finally:
# Tidy up
cursor.close()
conn.close()
My results:
Inspired by this post by Jovan Popovic:
https://techcommunity.microsoft.com/t5/azure-synapse-analytics/query-serverless-sql-pool-from-an-apache-spark-scala-notebook/ba-p/2250968
Just out of interest is there a particular reason you are doing this in notebooks? Synapse pipelines are a perfectly good way of doing it, and a typical pattern would be to stage the data in a data lake eg is there some special functionality you need to use notebooks for?

Create external data source in Azure Synapse Analytics (Azure SQL Data warehouse) to Oracle

I am trying to create external data source in Azure Synapse Analytics (Azure SQL Data warehouse) to Oracle external database. I am using the following code in SSMS to do that:
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'myPassword';
CREATE DATABASE SCOPED CREDENTIAL MyCred WITH IDENTITY = 'myUserName', Secret = 'Mypassword';
CREATE EXTERNAL DATA SOURCE MyEXTSource
WITH (
LOCATION = 'oracle://<myIPAddress>:1521',
CREDENTIAL = MyCred
)
I am getting the following error:
CREATE EXTERNAL DATA SOURCE statement failed because the 'TYPE' option is not specified. Specify a value for the 'TYPE' option and try again.
I understand from the below that TYPE is not a required option for Oracle databases.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=azure-sqldw-latest
Not sure how what the problem is here, is this feature still not supported in Azure Synapse Analytics (Azure DW) when it is already available in MS SQL Server 2019? Any ideas are welcome.
Polybase has different versions across the different products with different capabilities. Most of these are described here:
The ability to connect to Oracle is only present in the SQL Server versions, currently 2019. The documentation is quite clear that is only applies to SQL Server and not to Azure Synapse Analytics (formerly Azure SQL Data Warehouse):
https://learn.microsoft.com/en-us/sql/relational-databases/polybase/polybase-configure-oracle?view=sql-server-ver15
In summary, Azure Synapse Analytics and its version of Polybase does not currently support to access external Oracle tables at this time.

Resources