I am trying to connect to Teradata and DB2 from Pyspark.
I am using the below jars :
tdgssconfig-15.10.00.14.jar
teradata-connector-1.4.1.jar
terajdbc4-15.10.00.14.jar
&
db2jcc4.jar
Connection string:
df1 = sqlContext.load(source="jdbc", driver="com.teradata.jdbc.TeraDriver", url=db_url,user="db_user",TMODE="TERA",password="db_pwd",dbtable="U114473.EMPLOYEE")
df = sqlContext.read.format('jdbc').options(url='jdbc:db2://10.123.321.9:50000/DB599641',user='******',password='*****',driver='com.ibm.db2.jcc.DB2Driver', dbtable='DSN1.EMPLOYEE')
Both gives me Driver not found error.
Can we use JDBC drivers for pyspark?
Like James Tobin said, use the pyspark2 --jars /jarpath option when you start your pyspark sessioni or when you submit your py to spark
Related
I'm trying to get the data from my Oracle Database to a Databricks Cluster. But I think I'm doing it wrong:
On the cluster library I just installed the ojdbc8.jar and then after that I opened a notebook and did this to connect:
CREATE TABLE oracle_table
USING org.apache.spark.sql.jdbc
OPTIONS (
dbtable 'table_name',
driver 'oracle.jdbc.driver.OracleDriver',
user 'username',
password 'pasword',
url 'jdbc:oracle:thin://#<hostname>:1521/<db>')
And it says:
java.sql.SQLException: Invalid Oracle URL specified
Can someone help? I've been reading documentations but there's no clear instruction on how I should actually install this jar step by step. I might be using the wrong jar? Thanks!
I have managed to set this up in Python/PySpark as follows:
jdbcUrl = "jdbc:oracle:thin:#//hostName:port/databaseName"
connectionProperties = {
"user" : username,
"password" : password,
"driver" : "oracle.jdbc.driver.OracleDriver"
}
query = "(select * from mySchema.myTable )"
df = spark.read.jdbc(url=jdbcUrl, table=query1, properties=connectionProperties)
I am using the Oracle JDBC Thin Driver instantclient-basic-linux.x64-21.5.0.0.0, as available on the Oracle webpages. The current version is 21.7 I think, but it should work the same way.
Check this link to understand the two different notations for jdbc URLs
Update the question 6/21
Background about Simba:
The Simba Google BigQuery JDBC Connector is delivered in a ZIP archive named SimbaBigQueryJDBC42-[Version].zip, where [Version] is the version number of the connector.
The archive contains the connector supporting the JDBC API version indicated in the archive name, as well as release notes and third-party license information.
I'm trying to connect to BigQuery from pyspark (docker) using simba jdbc with no success. I had reviewed many posts here but couldn't find clue
my code which I just submit from VC within spark docker image
import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext, SparkSession
import os
from glob import glob
my_jar = glob('/root/Downloads/BigQuery/simba_jdbc_1.2.4.1007/*.jar')
my_jar_str = ','.join(my_jar)
print(my_jar_str)
sc_conf = SparkConf()
sc_conf.setAppName("testApp")
sc_conf.setMaster('local[*]')
sc_conf.set("spark.jars", my_jar_str)
sc = pyspark.SparkContext(conf=sc_conf)
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config("spark.executor.extraClassPath",my_jar_str) \
.config("spark.driver.extraClassPath",my_jar_str) \
.config("spark.jars", my_jar_str)\
.getOrCreate()
myJDBC = '''
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType={OAuthType};ProjectId={ProjectId};OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};
'''.format(OAuthType=0,
ProjectId='ProjectId',
OAuthServiceAcctEmail="etl#dProjectId.iam.gserviceaccount.com",
OAuthPvtKeyPath="/workspaces/code/secrets/etl.json")
pgDF = spark.read \
.format("jdbc") \
.option("url", myJDBC) \
.option("driver", "com.simba.googlebigquery.jdbc42.Driver") \
.option("dbtable", my_query) \
.load()
I'm getting error:
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
Is that missing jars or it is wrong logic?
Please any clue is appreciated
To anyone who might have the same thought. I just found that SIMBA is not supporting spark but rather I have to follow the steps in https://github.com/GoogleCloudDataproc/spark-bigquery-connector.
The open issue (as of 6/23) that I don't use Dataproc but rather standalone spark, so I need to figure how to collect consistent support jars
If ODBC also works for you, maybe this can help.
First, download and configure the ODBC driver from here:
Next - use the connection like this (note the IgnoreTransactions parameter):
import pyodbc
import pandas as pd
conn = pyodbc.connect(r'Driver={Simba ODBC Driver for Google BigQuery};OAuthMechanism=0;Catalog=<projectID>;KeyFilePath=<path to json credentials>;Email=<email of service account>;IgnoreTransactions=1')
qry = 'select * from <path to your table>'
data = pd.read_sql(qry,conn)
I had a problem with error: Error converting value to long
And my solution is creating a jar file from java which include jdbc dialect
https://github.com/Fox-sv/spark-bigquery
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import
user_email = "EMAIL"
project_id = "PROJECT_ID"
creds = "PATH_TO_FILE"
jdbc_conn = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthServiceAcctEmail={user_email};ProjectId={project_id};OAuthPvtKeyPath={creds};"
spark = SparkSession.builder.getOrCreate()
jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "MyDialect")
jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(jvm.MyDialect().change_dialect())
df = spark.read.jdbc(url=jdbc_conn,table='(SELECT * FROM babynames.names_2014) AS table')
I try to load table into R using h2o but had the following error
my_data <- h2o.import_sql_table(my_sql_conn, table, username, password)
ERROR: Unexpected HTTP Status code: 500 Server Error (url = http://localhost:54321/99/ImportSQLTable)
java.lang.RuntimeException [1] "java.lang.RuntimeException: SQLException: No suitable driver found for jdbc:mysql://10.140.20.29/MySQL?&useSSL=false\nFailed to connect and read from SQL database with connection_url: jdbc:mysql://10.140.20.29/MySQL?&useSSL=false"
Can someone help me with this? Thank you so much!
You need a supported JDBC (Build on JDBC 42 Core) driver to connect from H2O to SQL Server. You can download Microsoft JDBC Driver 4.2 for SQL Server from the link below first:
https://www.microsoft.com/en-us/download/details.aspx?id=54671
After that please follow the article below to first test JDBC driver from R/Python H2O client and then connect to your database:
https://aichamp.wordpress.com/2017/03/20/building-h2o-glm-model-using-postgresql-database-and-jdbc-driver/
Above article is for postgres however you can use it with SQL server using an appropriate driver.
For Windows, remember to use ; instead : for the -cp argument.
java -Xmx4g -cp sqljdbc42.jar;h2o.jar water.H2OApp -port 3333
water.H2OApp is the main class in h2o.jar.
Important Note: SQL Server is not supported so far( August/2017).
You may use MariaDB to load datasets:
From Windows console:
java -Xmx4G -cp mariadb-java-client-2.1.0.jar;h2o.jar water.H2OApp -port 3333
Note. For Linux, replace ";" with ":"
From R:
sqlConn <- "jdbc:mariadb://10.106.7.46:3306/DBName"
userName <- "dbuser"
userPass <- "dbpass."
sql_Query <- "SELECT * FROM dbname.tablename;"
mydata <- h2o.import_sql_select( sqlConn, sql_Query, userName, userPass )
I cannot connect to my Google Bigquery dataset via Simba JDBC driver.
I want to connect from R application using RJDBC package. I set the parameters as follows:
library(RJDBC)
driver <- JDBC(driverClass = "com.simba.googlebigquery.jdbc42.Driver", classPath = "~/JDBC/GoogleBigQueryJDBC42.jar", identifier.quote = "'")
conn <- dbConnect(driver,"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=My_project_Id;OAuthType=1;")
but I receive an error saying:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.lang.NoClassDefFoundError: com/google/api/client/json/JsonFactory
Please tell me what I am doing wrong?
I found the problem, I should have add required libraries to Java class-path. So in R I executed the following commands:
.jaddClassPath("jackson-core-2.1.3.jar")
.jaddClassPath("google-oauth-client-1.22.0.jar")
.jaddClassPath("google-http-client-jackson2-1.22.0.jar")
.jaddClassPath("google-http-client-1.22.0.jar")
.jaddClassPath("GoogleBigQueryJDBC41.jar")
.jaddClassPath("google-api-services-bigquery-v2-rev320-1.22.0.jar")
.jaddClassPath("google-api-client-1.22.0.jar")
I use getLastProcessedVal2 UDF in hive to get the latest partitions from table. This UDF is written in java . I would like to use the same UDF from pyspark using hive context.
dfsql_sel_nxt_batch_id_ini=sqlContext.sql(''' select l4_xxxx_seee.**getLastProcessedVal2**("/data/l4/work/hive/l4__stge/proctl_stg","APP_AMLMKTE_L1","L1_AMLMKT_MDWE","TRE_EXTION","2.1")''')
Error:
ERROR exec.FunctionRegistry: Unable to load UDF class:
java.lang.ClassNotFoundException:
start your pyspark shell as:
pyspark --jars /path/to.udf.jar <all-other-param>
OR
submit your pyspark job with --jars option as:
spark-submit --jars /path/to/udf.jar <all-other-param>
You could register that user defined function using SQLContext method udf, there you can see that you have to pass a string as the first parameter and it will represent the name of your udf while using SQL queries.
e.g.
sqlContext.udf().register("slen",
(String arg1) -> arg1.length(),
DataTypes.IntegerType);
sqlContext.sql("SELECT slen(name) FROM user").show();