Databricks SQL workspace configuration with External MetaStore - azure-databricks

I have Azure Databricks setup with External hive metaStore (with Azure SQL) and the database connection URL is setup in Databricks Cluster's advanced Settings. In this way I am able to see/access the database tables from delta lake (which is on azure storage account adls) in Databricks's Data section.
Now, I want my users to access these tables through Databricks's 'SQL workspace'. I have configured the 'Data access' using service principle in 'SQL warehouse' section.
Per databricks SQL documentation, I am supposed to see the delta lake tables which I can see through 'Data Science and Engineering' section. But I cant see schema or tables in meta store.
Problem, I am not able to see the tables through the 'SQL workspace' > Data. And I am puzzled how it will know where is my External metaStore and what are schema definition?
In SQL workspace should be setup to indicate the hive metastore connection. But I am not sure as databricks documentation is not very clear on this point.
Please suggest
Below are 'data access' details for service principal:
spark.hadoop.fs.azure.account.auth.type.<adlsContainer>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<adlsContainer>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<adlsContainer>.dfs.core.windows.net <CLIENT_ID>
spark.hadoop.fs.azure.account.oauth2.client.secret.<adlsContainer>.dfs.core.windows.net <CLIENT_SECRET>
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<adlsContainer>.dfs.core.windows.net https://login.microsoftonline.com/<TENANT_ID>/oauth2/token

As mentioned in the data access documentation you need to configure the same Spark configuration properties for external hive metastore as for "normal" Spark clusters - see documentation on this topic.

Related

Querying data from external hive metastore

I am trying to configure an external hive metastore for my azure synapse spark pool. The rational behind using external meta store is to share table definitions across databricks and synapse workspaces.
However, I am wondering if its possible to access backend data via the metastore. For example, can clients like PowerBI,tableau connect to external metastore and retrieve not just the metadata, but also the business data in the underlying tables?
Also what additional value does an external metastore provides ?
You can configure the external Hive Metadata in Synapse by creating a Linked Service for that external source and then query it in Synapse Serverless Pool.
Follow the below steps to connect with External Hive Metastore.
In Synapse Portal, go to the Manage symbol on the left side of the of the page. Click on it and then click on Linked Services. To create the new Linked Service, click on + New.
Search for Azure SQL Database or Azure Database for MySQL for the external Hive Metastore. Synapse supports these two Hive external metastore. Select and Continue.
Fill in all the required details like Name, Subscription, Server name, Database name, Username and Password and Test the connection.
You can test the connection with the Hive metadata using below code.
%%spark
import java.sql.DriverManager
/** this JDBC url could be copied from Azure portal > Azure SQL database > Connection strings > JDBC **/
val url = s"jdbc:sqlserver://<servername>.database.windows.net:1433;database=<databasename>;user=utkarsh;password=<password>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
try {
val connection = DriverManager.getConnection(url)
val result = connection.createStatement().executeQuery("select * from dbo.persons")
result.next();
println(s"Successful to test connection. Hive Metastore version is ${result.getString(1)}")
} catch {
case ex: Throwable => println(s"Failed to establish connection:\n $ex")
}
Check the same in below snippet for your reference.
can clients like PowerBI,tableau connect to external metastore and retrieve not just the metadata, but also the business data in the underlying tables?
Yes, Power BI allows us to connect with Azure SQL Database using in-built connector.
In Power BI Desktop, go to Get Data, click on Azure and select Azure SQL Database. Click connect.
In the next step, go give the Server name in this format <utkarsh.database.windows.net>, database name, Username and Password and you can now access data in Power BI. Refer below image.

How to connect to data source in HUE?

I have been given access to HUE Hive platform by my client. I have also raised all the access requests for the database and all of them are approved also. But I can't see any databases or tables in the Hive interface. Is there any procedure to connect to a database or it should reflect on the Hive interface automatically?

'Use Credential File' in oracle data integrator in data server what is it used for?

I want to explore Oracle data integrator , i am not able to understand what does 'Use credential File' option in Data server does in Oracle data integrator. If anyone can explain it would be helpful and i want to improve performance of my oracle data integrator script as well, any ideas on that as well.
Ok, now I think that I understood. You run ODI in Cloud.
You will need a credential File in order to connect to your database.
The way you obtain that credential file, is:
Credential files are downloaded from the ADW console to the ODI host in the Oracle Cloud Infrastructure (OCI).
Note: When ODI is deployed from the Marketplace, client credential folders are downloaded from autonomous databases that exist in the OCI compartment containing ODI.
If ADW is in a different compartment than ODI follow the steps below.
Download the Credentials
Connect to the ODI host using VNC. Refer to the Deployment blog above
for details.
Launch Firefox from the Applications>Favorites list.
Follow the steps in Downloading Autonomous Data Warehouse Credentials
to obtain the client credentials compressed folder containing the
wallet and network configuration files used by ODI to make the
connections.
The entire way of connecting is described here.

Create external data source in Azure Synapse Analytics (Azure SQL Data warehouse) to Oracle

I am trying to create external data source in Azure Synapse Analytics (Azure SQL Data warehouse) to Oracle external database. I am using the following code in SSMS to do that:
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'myPassword';
CREATE DATABASE SCOPED CREDENTIAL MyCred WITH IDENTITY = 'myUserName', Secret = 'Mypassword';
CREATE EXTERNAL DATA SOURCE MyEXTSource
WITH (
LOCATION = 'oracle://<myIPAddress>:1521',
CREDENTIAL = MyCred
)
I am getting the following error:
CREATE EXTERNAL DATA SOURCE statement failed because the 'TYPE' option is not specified. Specify a value for the 'TYPE' option and try again.
I understand from the below that TYPE is not a required option for Oracle databases.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=azure-sqldw-latest
Not sure how what the problem is here, is this feature still not supported in Azure Synapse Analytics (Azure DW) when it is already available in MS SQL Server 2019? Any ideas are welcome.
Polybase has different versions across the different products with different capabilities. Most of these are described here:
The ability to connect to Oracle is only present in the SQL Server versions, currently 2019. The documentation is quite clear that is only applies to SQL Server and not to Azure Synapse Analytics (formerly Azure SQL Data Warehouse):
https://learn.microsoft.com/en-us/sql/relational-databases/polybase/polybase-configure-oracle?view=sql-server-ver15
In summary, Azure Synapse Analytics and its version of Polybase does not currently support to access external Oracle tables at this time.

AWS DMS - incremental migration from Oracle to Redshift

I am new to AWS DMS service. Plans are to migrate on-prem Oracle to Redshift. Before going into production environment, currently trying out a test Oracle RDS in AWS which is a small subset of actual database as source. So far have been successful in the bulk load and incremental migration from RDS to Redshift.
When it comes to on-prem oracle , particularly for the incremental load
1) As per document : http://docs.aws.amazon.com/dms/latest/sbs/CHAP_On-PremOracle2Aurora.Steps.ConfigureOracle.html, the on-prem needs to be enabled with supplemental logging. Plans are to use the following two commands.
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS;
The production database has multiple logging locations. Is there any other log settings other than the above two that I should be looking into for DMS to pick up multiple log locations?
2) In the same link given, point 4 says 'Create or configure a database account to be used by AWS DMS.'
Where should I create this user? on-prem oracle or AWS?
How do I configure DMS to use this user?
You need to read this documentation:
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.Oracle.html
For your second question; You need to create a user in the Oracle source database, the section 'Working with a Self-Managed Oracle Database as a Source for AWS DMS' tells you all of the grants you need to give.
For your first question, if you look at the SQL Server documentation;
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
It specifies the limitation of; 'SQL Server backup to multiple disks isn't supported. If the backup is defined to write the database backup to multiple files over different disks, AWS DMS can't read the data and the AWS DMS task fails.'
I can't see a similar stipulation in the oracle documentation, first link, I would hazard a guess that DMS is able, in the case of oracle, to determine and cope with multiple logging locations from a configuration value inside the database.

Resources