Connect to BigQuery from pyspark using simba JDBC - jdbc

Update the question 6/21
Background about Simba:
The Simba Google BigQuery JDBC Connector is delivered in a ZIP archive named SimbaBigQueryJDBC42-[Version].zip, where [Version] is the version number of the connector.
The archive contains the connector supporting the JDBC API version indicated in the archive name, as well as release notes and third-party license information.
I'm trying to connect to BigQuery from pyspark (docker) using simba jdbc with no success. I had reviewed many posts here but couldn't find clue
my code which I just submit from VC within spark docker image
import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext, SparkSession
import os
from glob import glob
my_jar = glob('/root/Downloads/BigQuery/simba_jdbc_1.2.4.1007/*.jar')
my_jar_str = ','.join(my_jar)
print(my_jar_str)
sc_conf = SparkConf()
sc_conf.setAppName("testApp")
sc_conf.setMaster('local[*]')
sc_conf.set("spark.jars", my_jar_str)
sc = pyspark.SparkContext(conf=sc_conf)
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config("spark.executor.extraClassPath",my_jar_str) \
.config("spark.driver.extraClassPath",my_jar_str) \
.config("spark.jars", my_jar_str)\
.getOrCreate()
myJDBC = '''
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType={OAuthType};ProjectId={ProjectId};OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};
'''.format(OAuthType=0,
ProjectId='ProjectId',
OAuthServiceAcctEmail="etl#dProjectId.iam.gserviceaccount.com",
OAuthPvtKeyPath="/workspaces/code/secrets/etl.json")
pgDF = spark.read \
.format("jdbc") \
.option("url", myJDBC) \
.option("driver", "com.simba.googlebigquery.jdbc42.Driver") \
.option("dbtable", my_query) \
.load()
I'm getting error:
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
Is that missing jars or it is wrong logic?
Please any clue is appreciated

To anyone who might have the same thought. I just found that SIMBA is not supporting spark but rather I have to follow the steps in https://github.com/GoogleCloudDataproc/spark-bigquery-connector.
The open issue (as of 6/23) that I don't use Dataproc but rather standalone spark, so I need to figure how to collect consistent support jars

If ODBC also works for you, maybe this can help.
First, download and configure the ODBC driver from here:
Next - use the connection like this (note the IgnoreTransactions parameter):
import pyodbc
import pandas as pd
conn = pyodbc.connect(r'Driver={Simba ODBC Driver for Google BigQuery};OAuthMechanism=0;Catalog=<projectID>;KeyFilePath=<path to json credentials>;Email=<email of service account>;IgnoreTransactions=1')
qry = 'select * from <path to your table>'
data = pd.read_sql(qry,conn)

I had a problem with error: Error converting value to long
And my solution is creating a jar file from java which include jdbc dialect
https://github.com/Fox-sv/spark-bigquery
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import
user_email = "EMAIL"
project_id = "PROJECT_ID"
creds = "PATH_TO_FILE"
jdbc_conn = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthServiceAcctEmail={user_email};ProjectId={project_id};OAuthPvtKeyPath={creds};"
spark = SparkSession.builder.getOrCreate()
jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "MyDialect")
jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(jvm.MyDialect().change_dialect())
df = spark.read.jdbc(url=jdbc_conn,table='(SELECT * FROM babynames.names_2014) AS table')

Related

Delta Lake in EMR

I'm trying to use a delta lake through a python program that is called by a step on an EMR cluster, but the step always fails with an unknown error. I suppose the error could be related to the delta.tables import as the code is very simple.
Python program: test.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# Spark Session creation
spark = (SparkSession.builder.appName("DeltaExercise")
.config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
)
# Importing delta
from delta.tables import *
# Reading
enem = (
spark.read.format("csv")
.option("inferSchema", True)
.option("header", True)
.option("delimiter", ";")
.load("MyBucket/raw-data/microdados_enem_2020.csv")
)
#Writing
(
enem
.write
.mode("overwrite")
.format("delta")
.partitionBy("year")
.save("MyBucket/staging/test")
)
Step in EMR cluster:
spark-submit --deploy-mode cluster --packages io.delta:delta-core_2.12:1.0.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --master yarn MYBUCKET/emr-code/pyspark/test.py
EMR config screens:
If anyone has any tips on how to fix this, I'd appreciate it.
I found the error. It was a mistake in the EMR cluster configuration. Delta files were created successfully.

How to use Hadoop Credential provider in Spark to connect to Oracle database?

I am trying to establish a secure connection between Spark and Oracle as well as Sqoop and Oracle. After my research I have found two different option for two different setup.
Connecting Spark to Oracle where passwords are encrypted using spark.jdbc.b64password and further it has been decrypted in spark code and used it in jdbc url.
Using Hadoop credential provider to create the password file and further it has been used in Sqoop to connect to Oracle.
Now keeping password in two different files doesn't seems like a good practice. My question is can we use Hadoop credential provider in spark to use the same credential profile created for Sqoop?
If you have any other option to make this better please help.
The recommended way is to use Kerberos authentication both in Spark and Hadoop and with Oracle. The Oracle JDBC thin driver supports Kerberos authentication. A single Kerberos principal is then used to authenticate the user all the way from Spark or Hadoop to the Oracle database.
You could use all languages supported by Spark to read the jecks password from inside your code:
Python:
spark1 = SparkSession.builder.appName("xyz").master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
x = spark1.sparkContext._jsc.hadoopConfiguration()
x.set("hadoop.security.credential.provider.path", "jceks://file///localpathtopassword")
a = x.getPassword("<password alias>")
passw = ""
for i in range(a.__len__()):
passw = passw + str(a.__getitem__(i))
In the above code you shall get the password string in passw
Scala:
import org.apache.hadoop.security.alias.CredentialProvider
import org.apache.hadoop.security.alias.CredentialProvider.CredentialEntry
import org.apache.hadoop.security.alias.CredentialProviderFactory
import org.apache.hadoop.conf.Configuration
val conf_H: Configuration = new org.apache.hadoop.conf.Configuration()
val alias = password_alias
val jceksPath = security_credential_provider_path`enter code here`
conf_H.set(CredentialProviderFactory.CREDENTIAL_PROVIDER_PATH, jceksPath)
val pass = conf_H.getPassword(alias).mkString
if (pass != null && !pass.isEmpty() && !pass.equalsIgnoreCase("")) {
jdbcPassword = pass
}
you can also allow spark to set hadoop.security.credential.provider.path in hadoop configuration in such way:
"""
Create java key store with following command:
> keytool -genseckey -alias duke -keypass 123456 -storetype jceks -keystore keystore.jceks
> export HADOOP_CREDSTORE_PASSWORD=123456
"""
jceks = os.path.join(os.path.dirname(__file__), "keystore.jceks")
print(jceks)
assert os.path.isfile(jceks)
spark_session = lambda: (SparkSession
.builder
.enableHiveSupport()
.config('spark.ui.enabled', False)
.config("spark.hadoop.hadoop.security.credential.provider.path",
"jceks://file//" + jceks)
.getOrCreate())
with spark_session() as spark:
hc = spark.sparkContext._jsc.hadoopConfiguration()
jo = hc.getPassword("duke")
expected_password = ''.join(jo)
assert len(retrieved_password) > 0
spark.hadoop.hadoop.security.credential.provider.path is little weird but spark cuts off spark.hadoop. prefix when loads hadoop settings

How to connect to Teradata from pyspark?

I am trying to connect to Teradata and DB2 from Pyspark.
I am using the below jars :
tdgssconfig-15.10.00.14.jar
teradata-connector-1.4.1.jar
terajdbc4-15.10.00.14.jar
&
db2jcc4.jar
Connection string:
df1 = sqlContext.load(source="jdbc", driver="com.teradata.jdbc.TeraDriver", url=db_url,user="db_user",TMODE="TERA",password="db_pwd",dbtable="U114473.EMPLOYEE")
df = sqlContext.read.format('jdbc').options(url='jdbc:db2://10.123.321.9:50000/DB599641',user='******',password='*****',driver='com.ibm.db2.jcc.DB2Driver', dbtable='DSN1.EMPLOYEE')
Both gives me Driver not found error.
Can we use JDBC drivers for pyspark?
Like James Tobin said, use the pyspark2 --jars /jarpath option when you start your pyspark sessioni or when you submit your py to spark

sparkSession/sparkContext can not get hadoop configuration

I am running spark 2, hive, hadoop at local machine, and I want to use spark sql to read data from hive table.
It works all fine when I have hadoop running at default hdfs://localhost:9000, but if I change to a different port in core-site.xml:
<name>fs.defaultFS</name>
<value>hdfs://localhost:9099</value>
Running a simple sql spark.sql("select * from archive.tcsv3 limit 100").show(); in spark-shell will give me the error:
ERROR metastore.RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
.....
From local/147.214.109.160 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused;
.....
I get the AlreadyExistsException before, which doesn't seem to influence the result.
I can make it work by creating a new sparkContext:
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
sc.stop()
var sc = new SparkContext()
val session = SparkSession.builder().master("local").appName("test").enableHiveSupport().getOrCreate()
session.sql("show tables").show()
My question is, why the initial sparkSession/sparkContext did not get the correct configuration? How can I fix it? Thanks!
If you are using SparkSession and you want to set configuration on the the spark context then use session.sparkContext
val session = SparkSession
.builder()
.appName("test")
.enableHiveSupport()
.getOrCreate()
import session.implicits._
session.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
You don't need to import SparkContext or created it before the SparkSession

Spark 2.0: Relative path in absolute URI (spark-warehouse)

I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do:
Spark 1.6
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.option('header', 'true') \
.load('file:///C:/path/to/my/file.csv', schema=mySchema)
In the latest release I think it should look like this:
Spark 2.0
spark = SparkSession.builder \
.master('local[*]') \
.appName('My App') \
.getOrCreate()
df = spark.read \
.format('csv') \
.option('header', 'true') \
.load('file:///C:/path/to/my/file.csv', schema=mySchema)
But I am getting this error no matter how many different ways I try to adjust the path:
IllegalArgumentException: 'java.net.URISyntaxException: Relative path in
absolute URI: file:/C:/path//to/my/file/spark-warehouse'
Not sure if this is just an issue with Windows or there is something I am missing. I was excited that the spark-csv package is now a part of Spark right out of the box, but I can't seem to get it to read any of my local files anymore. Any ideas?
I was able to do some digging around in the latest Spark documentation, and I notice they have a new configuration setting that I hadn't noticed before:
spark.sql.warehouse.dir
So I went ahead and added this setting when I set up my SparkSession:
spark = SparkSession.builder \
.master('local[*]') \
.appName('My App') \
.config('spark.sql.warehouse.dir', 'file:///C:/path/to/my/') \
.getOrCreate()
That seems to set the working directory, and then I can just feed my filename directly into the csv reader:
df = spark.read \
.format('csv') \
.option('header', 'true') \
.load('file.csv', schema=mySchema)
Once I set the spark warehouse, Spark was able to locate all of my files and my app finishes successfully now. The amazing thing is that it runs about 20 times faster than it did in Spark 1.6. So they really have done some very impressive work optimizing their SQL engine. Spark it up!

Resources