Connecting Glue Pyspark to oracle using SSL certificate - oracle

I am using Spark readwrite operations for reading/writing to oracle database
Below is the code snippet:
empDF = spark.read \
.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("ssl", True) \
.option("sslmode", "require" ) \
.option("dbtable", query) \
.option("user", "******") \
.option("password", "******") \
.load()
But I need to add oracle ssl certificate for connecting to the data base.I tried using wallet which I added to /tmp location along with the tnsnames.ora file. I have added in the URL in the below format.
url = "jdbc:oracle:thin:#apm_url?TNS_ADMIN=/tmp"
But still am getting the below error and not able to connect
An error occurred while calling o104.load. IO Error: IO Error PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target, connect lapse 30 ms., Authentication lapse 0 ms.

What is the version of the Oracle JDBC driver that you are using? Check out QuickStart guide for using Oracle wallets. You need to have oraclepki.jar, osdt_core.jar, and osdt_cert.jar in the classpath.

Related

Connect to BigQuery from pyspark using simba JDBC

Update the question 6/21
Background about Simba:
The Simba Google BigQuery JDBC Connector is delivered in a ZIP archive named SimbaBigQueryJDBC42-[Version].zip, where [Version] is the version number of the connector.
The archive contains the connector supporting the JDBC API version indicated in the archive name, as well as release notes and third-party license information.
I'm trying to connect to BigQuery from pyspark (docker) using simba jdbc with no success. I had reviewed many posts here but couldn't find clue
my code which I just submit from VC within spark docker image
import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext, SparkSession
import os
from glob import glob
my_jar = glob('/root/Downloads/BigQuery/simba_jdbc_1.2.4.1007/*.jar')
my_jar_str = ','.join(my_jar)
print(my_jar_str)
sc_conf = SparkConf()
sc_conf.setAppName("testApp")
sc_conf.setMaster('local[*]')
sc_conf.set("spark.jars", my_jar_str)
sc = pyspark.SparkContext(conf=sc_conf)
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config("spark.executor.extraClassPath",my_jar_str) \
.config("spark.driver.extraClassPath",my_jar_str) \
.config("spark.jars", my_jar_str)\
.getOrCreate()
myJDBC = '''
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType={OAuthType};ProjectId={ProjectId};OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};
'''.format(OAuthType=0,
ProjectId='ProjectId',
OAuthServiceAcctEmail="etl#dProjectId.iam.gserviceaccount.com",
OAuthPvtKeyPath="/workspaces/code/secrets/etl.json")
pgDF = spark.read \
.format("jdbc") \
.option("url", myJDBC) \
.option("driver", "com.simba.googlebigquery.jdbc42.Driver") \
.option("dbtable", my_query) \
.load()
I'm getting error:
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
Is that missing jars or it is wrong logic?
Please any clue is appreciated
To anyone who might have the same thought. I just found that SIMBA is not supporting spark but rather I have to follow the steps in https://github.com/GoogleCloudDataproc/spark-bigquery-connector.
The open issue (as of 6/23) that I don't use Dataproc but rather standalone spark, so I need to figure how to collect consistent support jars
If ODBC also works for you, maybe this can help.
First, download and configure the ODBC driver from here:
Next - use the connection like this (note the IgnoreTransactions parameter):
import pyodbc
import pandas as pd
conn = pyodbc.connect(r'Driver={Simba ODBC Driver for Google BigQuery};OAuthMechanism=0;Catalog=<projectID>;KeyFilePath=<path to json credentials>;Email=<email of service account>;IgnoreTransactions=1')
qry = 'select * from <path to your table>'
data = pd.read_sql(qry,conn)
I had a problem with error: Error converting value to long
And my solution is creating a jar file from java which include jdbc dialect
https://github.com/Fox-sv/spark-bigquery
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import
user_email = "EMAIL"
project_id = "PROJECT_ID"
creds = "PATH_TO_FILE"
jdbc_conn = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthServiceAcctEmail={user_email};ProjectId={project_id};OAuthPvtKeyPath={creds};"
spark = SparkSession.builder.getOrCreate()
jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "MyDialect")
jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(jvm.MyDialect().change_dialect())
df = spark.read.jdbc(url=jdbc_conn,table='(SELECT * FROM babynames.names_2014) AS table')

Errors in Karaf upgrade from 4.2.0.M1 to 4.2.0.M2

We were upgrading Karaf and in the transition from 4.2.0.M1 to 4.2.0.M2 we noticed several errors like this related to BootFeatures:
2021-02-04T15:43:17,674 | ERROR | activator-1-thread-2 | BootFeaturesInstaller | 11 - org.apache.karaf.features.core - 4.2.1 | Error installing boot features
org.apache.felix.resolver.reason.ReasonException: Unable to resolve root: missing requirement [root] osgi.identity; osgi.identity=ssh; type=karaf.feature; version="[4.3.1.SNAPSHOT,4.3.1.SNAPSHOT]"; filter:="(&(osgi.identity=ssh)(type=karaf.feature)(version>=4.3.1.SNAPSHOT)(version<=4.3.1.SNAPSHOT))" [caused by: Unable to resolve ssh/4.3.1.SNAPSHOT: missing requirement [ssh/4.3.1.SNAPSHOT] osgi.identity; osgi.identity=org.apache.karaf.shell.ssh; type=osgi.bundle; version="[4.3.1.SNAPSHOT,4.3.1.SNAPSHOT]"; resolution:=mandatory [caused by: Unable to resolve org.apache.karaf.shell.ssh/4.3.1.SNAPSHOT: missing requirement [org.apache.karaf.shell.ssh/4.3.1.SNAPSHOT] osgi.wiring.package; filter:="(&(osgi.wiring.package=org.apache.karaf.jaas.boot.principal)(version>=4.3.0)(!(version>=5.0.0)))" [caused by: Unable to resolve org.apache.karaf.jaas.boot/4.3.1.SNAPSHOT: missing requirement [org.apache.karaf.jaas.boot/4.3.1.SNAPSHOT] osgi.wiring.package; filter:="(&(osgi.wiring.package=org.osgi.framework)(version>=1.9.0)(!(version>=2.0.0)))"]]]
The error always looks similar although the name of the feature that gives the error is different every time (for example kar and ssh), so it seems that all the BootFeatures are failing and one at random just shows the error first. It seems as if something has changed from 4.2.0.M1 to 4.2.0.M2 in how Karaf features are managed.
We use Java 8 and OSGi 6. Besides that, we use Gradle as a build system and the Aether library (org.ops4j.pax.url.mvn) to handle Maven artifactory/packages resolution.
This is the content of our org.apache.karaf.features.cfg file:
featuresRepositories = \
mvn:org.apache.karaf.features/framework/4.2.0.M2/xml/features, \
mvn:org.apache.karaf.features/spring/4.2.0.M2/xml/features, \
mvn:org.apache.karaf.features/standard/4.2.0.M2/xml/features, \
mvn:org.apache.karaf.features/enterprise/4.2.0.M2/xml/features, \
mvn:org.apache.activemq/activemq-karaf/5.16.1/xml/features, \
mvn:org.apache.cxf.karaf/apache-cxf/3.2.7/xml/features, \
mvn:org.apache.cxf.dosgi/cxf-dosgi/2.3.0/xml/features, \
mvn:org.ops4j.pax.jdbc/pax-jdbc-features/1.4.5/xml/features, \
file:/opt/data/features/feature.xml
featuresBoot = \
(instance, \
package, \
log, \
ssh, \
aries-blueprint, \
framework, \
system, \
eventadmin, \
feature, \
shell, \
management, \
service, \
jaas, \
shell-compat, \
deployer, \
diagnostic, \
wrap, \
bundle, \
config, \
kar, \
jndi, \
jdbc, \
transaction, \
pax-jdbc-config, \
pax-jdbc-pool-common, \
pax-jdbc-postgresql, \
pax-jdbc-pool-c3p0, \
cxf-core, \
cxf-jaxrs, \
cxf-jaxws, \
cxf-dosgi-provider-rs, \
cxf-dosgi-provider-ws, \
activemq-broker-noweb), \
(local_bundle_1, ..., local_bundle_N)
featuresBootAsynchronous=false
Does anyone have any idea about what could be the cause of these errors after upgrading from 4.2.0.M1 to 4.2.0.M2?
Thanks in advance
Karaf is resolving the most current version of the ssh feature for you. To correct, add blacklist entries for the versions that are showing up in the log
file: etc/org.apache.karaf.features.xml
<blacklistedRepositories>
...
<repository>mvn:org.apache.karaf.features/standard/4.3.1-SNAPSHOT/xml/features</repository>
...
</blacklistedRepositories>

Spark - Oracle timezone error

I am running a spark job to load to Oracle. But I am getting following error.
java.sql.SQLException: ORA-00604: error occurred at recursive SQL level 1
ORA-01882: timezone region not found
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:450)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:392)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:385)
at oracle.jdbc.driver.T4CTTIfun.processError(T4CTTIfun.java:1018)
at oracle.jdbc.driver.T4CTTIoauthenticate.processError(T4CTTIoauthenticate.java:501)
Here is what I have in my code
val oracleProps = new java.util.Properties()
oracleProps.put("driver", oracleDriver)
oracleProps.put("driver", oracleDriver)
oracleProps.put("user", oracleUser)
oracleProps.put("password", oraclePwd)
oracleProps.put("batchsize", oracleBatchSize)
dataframe.write.mode("overwrite").jdbc(oracleUrl, oracleBaseTable, oracleProps)
The same code works from Spark-Shell but not from spark-submit.
The same spark-submit works on other clusters.
Appreciate you help!
I had this error in pyspark oracle jdbc connection - "ORA-01882: timezone region not found". I was able to connect after setting oracle.jdbc.timezoneAsRegion to false.
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0.
JDBC driver used - ojdbc8.jar
df.write \
.format("jdbc") \
.option("url", "JDBC_URL") \
.option('driver', 'oracle.jdbc.driver.OracleDriver') \
.option("oracle.jdbc.timezoneAsRegion", "false") \
.option("dbtable", "SCHEMA.TABLE") \
.option("user", "USERID") \
.option("password", "PASSWORD") \
.mode("overwrite") \
.save()
I write a program to insert data from a file to oracle database using Spark [version 2.3.0.cloudera3]. As per my program, Oracle database version is "Oracle Database 11g Enterprise Edition Release 11.2.0.1.0".
I was using Oracle JDBC ojdbc8.jar. So I encountered the following problem :
java.sql.SQLException: ORA-00604: error occurred at recursive SQL level 1
ORA-01882: timezone region not found.
Now I change my Oracle JDBC to:ojdbc6.jar, which is compatible with Oracle 11.2.0.1.0. And now it is working perfectly.

Sqoop error for java.io.charconversionException which is non UTF-8 charactor

I was trying to sqoop import the data from db2.ibm but stuck up with the error which is
java.io.charconversionException: SQL exception in nextKeyValue
And caused by [jcc][t4][1065]..... Caught java.io.CharConversionException ERRORCODE=-4220, SQLSTATE=null
I've tried
sqoop import --driver com.ibm.db2.jcc.DB2Driver --connect jdbc:db2://host:port/db --verbose table.views_data -m 1 --target-dir /tmp/data
It sounds like there is a bad character in the table you're loading per this IBM article: http://www-01.ibm.com/support/docview.wss?uid=swg21684365
If you want to try and workaround it without fixing the data as suggested above, the DataDirect DB2 JDBC driver has a property to override code page with one of these values: http://media.datadirect.com/download/docs/jdbc/alljdbc/help.html#page/jdbcconnect%2Fcodepageoverride.html%23

Spark 2.0: Relative path in absolute URI (spark-warehouse)

I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do:
Spark 1.6
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.option('header', 'true') \
.load('file:///C:/path/to/my/file.csv', schema=mySchema)
In the latest release I think it should look like this:
Spark 2.0
spark = SparkSession.builder \
.master('local[*]') \
.appName('My App') \
.getOrCreate()
df = spark.read \
.format('csv') \
.option('header', 'true') \
.load('file:///C:/path/to/my/file.csv', schema=mySchema)
But I am getting this error no matter how many different ways I try to adjust the path:
IllegalArgumentException: 'java.net.URISyntaxException: Relative path in
absolute URI: file:/C:/path//to/my/file/spark-warehouse'
Not sure if this is just an issue with Windows or there is something I am missing. I was excited that the spark-csv package is now a part of Spark right out of the box, but I can't seem to get it to read any of my local files anymore. Any ideas?
I was able to do some digging around in the latest Spark documentation, and I notice they have a new configuration setting that I hadn't noticed before:
spark.sql.warehouse.dir
So I went ahead and added this setting when I set up my SparkSession:
spark = SparkSession.builder \
.master('local[*]') \
.appName('My App') \
.config('spark.sql.warehouse.dir', 'file:///C:/path/to/my/') \
.getOrCreate()
That seems to set the working directory, and then I can just feed my filename directly into the csv reader:
df = spark.read \
.format('csv') \
.option('header', 'true') \
.load('file.csv', schema=mySchema)
Once I set the spark warehouse, Spark was able to locate all of my files and my app finishes successfully now. The amazing thing is that it runs about 20 times faster than it did in Spark 1.6. So they really have done some very impressive work optimizing their SQL engine. Spark it up!

Resources