Is it possible to import data from SQL database using Sqoop into a different blob storage, other than the default HDInsight cluster blob storage?
Even if I set azure storage access to "Public Blob", I get an error message "Container testcontainer in account nondefaultstorage.blob.core.windows.net not found, and we can't create it using anoynomous credentials."
This is the sqoop command I am running:
import
--connect jdbc:sqlserver://sqlServerName;user=sqlLogin;password=sqlPass;database=sqlDbName
--table tableName
--target-dir wasb://testcontainer#nondefaultstorage.blob.core.windows.net/data/csv
It shall work with linked storage accounts, or public containers. Public blob won't work because container data is not available. For more information on the 3 access types, see https://azure.microsoft.com/en-us/documentation/articles/storage-manage-access-to-resources/#restrict-access-to-containers-and-blobs
Please note that PublicContainer and PublicBlob only grant Read access to all people, you still need Shared Access Signature or Shared Key when writing.
Related
I have Azure Databricks setup with External hive metaStore (with Azure SQL) and the database connection URL is setup in Databricks Cluster's advanced Settings. In this way I am able to see/access the database tables from delta lake (which is on azure storage account adls) in Databricks's Data section.
Now, I want my users to access these tables through Databricks's 'SQL workspace'. I have configured the 'Data access' using service principle in 'SQL warehouse' section.
Per databricks SQL documentation, I am supposed to see the delta lake tables which I can see through 'Data Science and Engineering' section. But I cant see schema or tables in meta store.
Problem, I am not able to see the tables through the 'SQL workspace' > Data. And I am puzzled how it will know where is my External metaStore and what are schema definition?
In SQL workspace should be setup to indicate the hive metastore connection. But I am not sure as databricks documentation is not very clear on this point.
Please suggest
Below are 'data access' details for service principal:
spark.hadoop.fs.azure.account.auth.type.<adlsContainer>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<adlsContainer>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<adlsContainer>.dfs.core.windows.net <CLIENT_ID>
spark.hadoop.fs.azure.account.oauth2.client.secret.<adlsContainer>.dfs.core.windows.net <CLIENT_SECRET>
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<adlsContainer>.dfs.core.windows.net https://login.microsoftonline.com/<TENANT_ID>/oauth2/token
As mentioned in the data access documentation you need to configure the same Spark configuration properties for external hive metastore as for "normal" Spark clusters - see documentation on this topic.
I am trying to configure an external hive metastore for my azure synapse spark pool. The rational behind using external meta store is to share table definitions across databricks and synapse workspaces.
However, I am wondering if its possible to access backend data via the metastore. For example, can clients like PowerBI,tableau connect to external metastore and retrieve not just the metadata, but also the business data in the underlying tables?
Also what additional value does an external metastore provides ?
You can configure the external Hive Metadata in Synapse by creating a Linked Service for that external source and then query it in Synapse Serverless Pool.
Follow the below steps to connect with External Hive Metastore.
In Synapse Portal, go to the Manage symbol on the left side of the of the page. Click on it and then click on Linked Services. To create the new Linked Service, click on + New.
Search for Azure SQL Database or Azure Database for MySQL for the external Hive Metastore. Synapse supports these two Hive external metastore. Select and Continue.
Fill in all the required details like Name, Subscription, Server name, Database name, Username and Password and Test the connection.
You can test the connection with the Hive metadata using below code.
%%spark
import java.sql.DriverManager
/** this JDBC url could be copied from Azure portal > Azure SQL database > Connection strings > JDBC **/
val url = s"jdbc:sqlserver://<servername>.database.windows.net:1433;database=<databasename>;user=utkarsh;password=<password>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
try {
val connection = DriverManager.getConnection(url)
val result = connection.createStatement().executeQuery("select * from dbo.persons")
result.next();
println(s"Successful to test connection. Hive Metastore version is ${result.getString(1)}")
} catch {
case ex: Throwable => println(s"Failed to establish connection:\n $ex")
}
Check the same in below snippet for your reference.
can clients like PowerBI,tableau connect to external metastore and retrieve not just the metadata, but also the business data in the underlying tables?
Yes, Power BI allows us to connect with Azure SQL Database using in-built connector.
In Power BI Desktop, go to Get Data, click on Azure and select Azure SQL Database. Click connect.
In the next step, go give the Server name in this format <utkarsh.database.windows.net>, database name, Username and Password and you can now access data in Power BI. Refer below image.
We have RDS (Oracle) instance, I need to export specific Schema into dumpfile. Export works and copies dump file into DATA_PUMP_DIR. Issue is that RDS do not have file directory access.
I need exported DMP file either on S3 or copy to another EC2 instance.
The article: LINK talks about copying data dump file between two RDS instances but not to S3 or EC2.
Third option. I am using it.
Take a look at alexandria-plsql-utils project, and especially look at: amazon_aws_auth_pkg, amazon_aws_s3_pkg and ftp_util_pkg packages.
Install required packages and dependencies.
Do your dump, then with such example code below you can copy file from Amazon RDS Oracle into S3 bucket.
declare
b_blob blob;
begin
b_blob := file_util_pkg.get_blob_from_file ('DATA_PUMP_DIR', 'my_dump.dmp');
amazon_aws_auth_pkg.init ('aws_key_id','aws_secret', p_gmt_offset => 0);
amazon_aws_s3_pkg.new_object('my-bucket-name', 'my_dump.dmp', b_blob, 'application/octet-stream');
end;
`
There are several ways to solve this problem.
First option.
Install a free database version of the Oracle XE version on EC2
instance(It is very easy and fast)
Export a schema from the RDS instance to DATA_PUMP_DIR
directory. Use DBMS_DATAPUMP package or run expdp user/pass#rds on EC2 to create a dump file.
Create database link on RDS instance between RDS DB and Oracle XE
DB.
If you are creating a database link between two DB instances inside
the same VPC or peered VPCs the two DB instances should have a valid
route between them.
Adjusting Database Links for Use with DB Instances in a VPC
Copy the dump files from RDS instance to Oracle XE DB on EC2 uses
the DBMS_FILE_TRANSFER.PUT_FILE via database link
Copy files from the DATA_PUMP_DIR directory Oracle XE on EC2 instance to the S3.
Second option.
Use the obsolete utility exp to export. It has restrictions on the export of certain types of data and is slower.
Run exp user/password#rds on EC2 instance.
Copy files from the directory Oracle XE on EC2 instance to the S3
Original export is desupported for general use as of Oracle Database
11g. The only supported use of Original Export in 11g is backward
migration of XMLType data to a database version 10g release 2 (10.2)
or earlier. Therefore, Oracle recommends that you use the new Data
Pump Export and Import utilities, except in the following situations
which require Original Export and Import:
Original Export and Import
It's now possible to directly access a S3 bucket from a Oracle database. Please have a look at the following documentation: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/oracle-s3-integration.html
And here the official news that this is supported: https://aws.amazon.com/about-aws/whats-new/2019/02/Amazon-RDS-for-Oracle-Now-Supports-Amazon-S3-Integration/?nc1=h_ls
It seems that the first post was a little bit to early to get this news. But anyway this post lists further good solutions like the database link.
I need to migrate an existing application's database into oracle RDS database in Amazon Web services.
I have the dump file which is residing on an EC2 instance. The dump has not been taken by me.Also I would like to know how can I take the dump so that it can be imported successfully. The EC2 instance has an oracle regular client.
I have set up the oracle RDS instance in AWS and I am able to connect to the server.
I would like to know how can I import the database dump on RDS.
I am using this command :
imp rdsuser#oracledb FILE=fulldb.dmp TOUSER=rdsuser FROMUSER=SYSTEM log=test.log buffer=100000
Any lead is appreciated.
Also I would like to know what is the best method to import an existing database:
1. to take dump.
2. Or to take the clone of all files of database ( that will require the downtime in the server).
Best strategy is to take dump then import it into RDS . If your DB size is too big then contact AWS guys for help .
floks
Here i have a client question. I want to get the tables from sql server(RDBMS) to my hdfs (hadoop cluster). But the servers are in different location.
1)Which is the best way to access the serve,but data is in huge amount.
2)connecting with one sever is okay, we have many servers around the globe we have to get the data from this servers.
3)Can we connect with sqoop remotly to get the data to HDFS.
Your question is a little bit unclear, but yes, you can use sqoop to import the data from your servers into HDFS. You need to specify the connection parameters when importing the data:
sqoop import --connect <JDBC connection string> --table <tablename> --username <username> --password <password>
If you need to do multiple imports from multiple servers, I suggest you try Oozie to automate these imports. You can find a tutorial to achieve that here.
Before writing down the sqoop import, you need to have a user for each of the remote node which is to be identified by your local DB. For Ex:
create user 'username'#'<ip of remote node>' IDENTIFIED BY 'password';
You must also have to ensure about the grant permissions to these users depending on your requirement.
It's then you can frame the sqoop import, one example as below:
$SQOOP_HOME/bin/sqoop import --connect jdbc:mysql://<ip address of remote server node> :port_number/<database_name> --username user --password password --table <table to import>
This question is 5 months old for this answer so I'm hoping the issue would have been resolved but in case someone wanted to go to a step by step procedure for this requirement.
Regards,
Adil