How to connect to Teradata from pyspark? - hadoop

I am trying to connect to Teradata and DB2 from Pyspark.
I am using the below jars :
tdgssconfig-15.10.00.14.jar
teradata-connector-1.4.1.jar
terajdbc4-15.10.00.14.jar
&
db2jcc4.jar
Connection string:
df1 = sqlContext.load(source="jdbc", driver="com.teradata.jdbc.TeraDriver", url=db_url,user="db_user",TMODE="TERA",password="db_pwd",dbtable="U114473.EMPLOYEE")
df = sqlContext.read.format('jdbc').options(url='jdbc:db2://10.123.321.9:50000/DB599641',user='******',password='*****',driver='com.ibm.db2.jcc.DB2Driver', dbtable='DSN1.EMPLOYEE')
Both gives me Driver not found error.
Can we use JDBC drivers for pyspark?

Like James Tobin said, use the pyspark2 --jars /jarpath option when you start your pyspark sessioni or when you submit your py to spark

Related

How to install JDBC driver on Databricks Cluster?

I'm trying to get the data from my Oracle Database to a Databricks Cluster. But I think I'm doing it wrong:
On the cluster library I just installed the ojdbc8.jar and then after that I opened a notebook and did this to connect:
CREATE TABLE oracle_table
USING org.apache.spark.sql.jdbc
OPTIONS (
dbtable 'table_name',
driver 'oracle.jdbc.driver.OracleDriver',
user 'username',
password 'pasword',
url 'jdbc:oracle:thin://#<hostname>:1521/<db>')
And it says:
java.sql.SQLException: Invalid Oracle URL specified
Can someone help? I've been reading documentations but there's no clear instruction on how I should actually install this jar step by step. I might be using the wrong jar? Thanks!
I have managed to set this up in Python/PySpark as follows:
jdbcUrl = "jdbc:oracle:thin:#//hostName:port/databaseName"
connectionProperties = {
"user" : username,
"password" : password,
"driver" : "oracle.jdbc.driver.OracleDriver"
}
query = "(select * from mySchema.myTable )"
df = spark.read.jdbc(url=jdbcUrl, table=query1, properties=connectionProperties)
I am using the Oracle JDBC Thin Driver instantclient-basic-linux.x64-21.5.0.0.0, as available on the Oracle webpages. The current version is 21.7 I think, but it should work the same way.
Check this link to understand the two different notations for jdbc URLs

Connect to BigQuery from pyspark using simba JDBC

Update the question 6/21
Background about Simba:
The Simba Google BigQuery JDBC Connector is delivered in a ZIP archive named SimbaBigQueryJDBC42-[Version].zip, where [Version] is the version number of the connector.
The archive contains the connector supporting the JDBC API version indicated in the archive name, as well as release notes and third-party license information.
I'm trying to connect to BigQuery from pyspark (docker) using simba jdbc with no success. I had reviewed many posts here but couldn't find clue
my code which I just submit from VC within spark docker image
import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext, SparkSession
import os
from glob import glob
my_jar = glob('/root/Downloads/BigQuery/simba_jdbc_1.2.4.1007/*.jar')
my_jar_str = ','.join(my_jar)
print(my_jar_str)
sc_conf = SparkConf()
sc_conf.setAppName("testApp")
sc_conf.setMaster('local[*]')
sc_conf.set("spark.jars", my_jar_str)
sc = pyspark.SparkContext(conf=sc_conf)
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config("spark.executor.extraClassPath",my_jar_str) \
.config("spark.driver.extraClassPath",my_jar_str) \
.config("spark.jars", my_jar_str)\
.getOrCreate()
myJDBC = '''
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType={OAuthType};ProjectId={ProjectId};OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};
'''.format(OAuthType=0,
ProjectId='ProjectId',
OAuthServiceAcctEmail="etl#dProjectId.iam.gserviceaccount.com",
OAuthPvtKeyPath="/workspaces/code/secrets/etl.json")
pgDF = spark.read \
.format("jdbc") \
.option("url", myJDBC) \
.option("driver", "com.simba.googlebigquery.jdbc42.Driver") \
.option("dbtable", my_query) \
.load()
I'm getting error:
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
Is that missing jars or it is wrong logic?
Please any clue is appreciated
To anyone who might have the same thought. I just found that SIMBA is not supporting spark but rather I have to follow the steps in https://github.com/GoogleCloudDataproc/spark-bigquery-connector.
The open issue (as of 6/23) that I don't use Dataproc but rather standalone spark, so I need to figure how to collect consistent support jars
If ODBC also works for you, maybe this can help.
First, download and configure the ODBC driver from here:
Next - use the connection like this (note the IgnoreTransactions parameter):
import pyodbc
import pandas as pd
conn = pyodbc.connect(r'Driver={Simba ODBC Driver for Google BigQuery};OAuthMechanism=0;Catalog=<projectID>;KeyFilePath=<path to json credentials>;Email=<email of service account>;IgnoreTransactions=1')
qry = 'select * from <path to your table>'
data = pd.read_sql(qry,conn)
I had a problem with error: Error converting value to long
And my solution is creating a jar file from java which include jdbc dialect
https://github.com/Fox-sv/spark-bigquery
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import
user_email = "EMAIL"
project_id = "PROJECT_ID"
creds = "PATH_TO_FILE"
jdbc_conn = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthServiceAcctEmail={user_email};ProjectId={project_id};OAuthPvtKeyPath={creds};"
spark = SparkSession.builder.getOrCreate()
jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "MyDialect")
jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(jvm.MyDialect().change_dialect())
df = spark.read.jdbc(url=jdbc_conn,table='(SELECT * FROM babynames.names_2014) AS table')

How to load table from SQL server using H2o in R?

I try to load table into R using h2o but had the following error
my_data <- h2o.import_sql_table(my_sql_conn, table, username, password)
ERROR: Unexpected HTTP Status code: 500 Server Error (url = http://localhost:54321/99/ImportSQLTable)
java.lang.RuntimeException [1] "java.lang.RuntimeException: SQLException: No suitable driver found for jdbc:mysql://10.140.20.29/MySQL?&useSSL=false\nFailed to connect and read from SQL database with connection_url: jdbc:mysql://10.140.20.29/MySQL?&useSSL=false"
Can someone help me with this? Thank you so much!
You need a supported JDBC (Build on JDBC 42 Core) driver to connect from H2O to SQL Server. You can download Microsoft JDBC Driver 4.2 for SQL Server from the link below first:
https://www.microsoft.com/en-us/download/details.aspx?id=54671
After that please follow the article below to first test JDBC driver from R/Python H2O client and then connect to your database:
https://aichamp.wordpress.com/2017/03/20/building-h2o-glm-model-using-postgresql-database-and-jdbc-driver/
Above article is for postgres however you can use it with SQL server using an appropriate driver.
For Windows, remember to use ; instead : for the -cp argument.
java -Xmx4g -cp sqljdbc42.jar;h2o.jar water.H2OApp -port 3333
water.H2OApp is the main class in h2o.jar.
Important Note: SQL Server is not supported so far( August/2017).
You may use MariaDB to load datasets:
From Windows console:
java -Xmx4G -cp mariadb-java-client-2.1.0.jar;h2o.jar water.H2OApp -port 3333
Note. For Linux, replace ";" with ":"
From R:
sqlConn <- "jdbc:mariadb://10.106.7.46:3306/DBName"
userName <- "dbuser"
userPass <- "dbpass."
sql_Query <- "SELECT * FROM dbname.tablename;"
mydata <- h2o.import_sql_select( sqlConn, sql_Query, userName, userPass )

Connect to Google BigQuery in R using Simba JDBC driver

I cannot connect to my Google Bigquery dataset via Simba JDBC driver.
I want to connect from R application using RJDBC package. I set the parameters as follows:
library(RJDBC)
driver <- JDBC(driverClass = "com.simba.googlebigquery.jdbc42.Driver", classPath = "~/JDBC/GoogleBigQueryJDBC42.jar", identifier.quote = "'")
conn <- dbConnect(driver,"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=My_project_Id;OAuthType=1;")
but I receive an error saying:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.lang.NoClassDefFoundError: com/google/api/client/json/JsonFactory
Please tell me what I am doing wrong?
I found the problem, I should have add required libraries to Java class-path. So in R I executed the following commands:
.jaddClassPath("jackson-core-2.1.3.jar")
.jaddClassPath("google-oauth-client-1.22.0.jar")
.jaddClassPath("google-http-client-jackson2-1.22.0.jar")
.jaddClassPath("google-http-client-1.22.0.jar")
.jaddClassPath("GoogleBigQueryJDBC41.jar")
.jaddClassPath("google-api-services-bigquery-v2-rev320-1.22.0.jar")
.jaddClassPath("google-api-client-1.22.0.jar")

How to call a hive UDF written in Java using Pyspark from Hive Context

I use getLastProcessedVal2 UDF in hive to get the latest partitions from table. This UDF is written in java . I would like to use the same UDF from pyspark using hive context.
dfsql_sel_nxt_batch_id_ini=sqlContext.sql(''' select l4_xxxx_seee.**getLastProcessedVal2**("/data/l4/work/hive/l4__stge/proctl_stg","APP_AMLMKTE_L1","L1_AMLMKT_MDWE","TRE_EXTION","2.1")''')
Error:
ERROR exec.FunctionRegistry: Unable to load UDF class:
java.lang.ClassNotFoundException:
start your pyspark shell as:
pyspark --jars /path/to.udf.jar <all-other-param>
OR
submit your pyspark job with --jars option as:
spark-submit --jars /path/to/udf.jar <all-other-param>
You could register that user defined function using SQLContext method udf, there you can see that you have to pass a string as the first parameter and it will represent the name of your udf while using SQL queries.
e.g.
sqlContext.udf().register("slen",
(String arg1) -> arg1.length(),
DataTypes.IntegerType);
sqlContext.sql("SELECT slen(name) FROM user").show();

Resources