PySpark Oracle sequence generation using JDBC.Read

PySpark Oracle sequence generation using JDBC.Read - oracle

Was trying to generate the sequence in Oracle for the records via PySpark JDBC read.
from pyspark.sql import SparkSession
if __name__ == '__main__':
spark = SparkSession.builder.master("local")\
.appName("Seq_num")\
.getOrCreate()
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "URL") \
.option("dbtable", "(SELECT SEQ.NEXTVAL FROM DUAL) qry") \
.option("user", "USER") \
.option("password", "PWD") \
.load()
jdbcDF.show(2)
This gives me an error
py4j.protocol.Py4JJavaError: An error occurred while calling o50.load.
: java.sql.SQLSyntaxErrorException: ORA-02287: sequence number not allowed here
The alternative was to have the above run using the cur.execute(query) which would provide me with tuples.
Is there any other way to achieve the above so that I can save time in converting tuples to Dataframe.

Related

HDFS caching with SparkSQL

env
spark 3.1.2
hive 3.1.2
hadoop 3.2.1
problem
spark sql with hive metastore connection
for example, make table A with hdfs caching data in memory.
and make table B with non-hdfs caching data in hdfs.( but data content is same. )
I executed same query with table A and B.
I expected A table's query execution time must be faster than B table's query, but it didn't.
Actually query execution time between A and B table was almost same.
Is there anything to do enable HDFS caching with SparkSQL using hive metastore?
example code
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as Fd
from pyspark.sql.window import Window as W
spark_executor_cores = 8
spark_executor_memory = '3g'
spark_instances = 50
spark_shuffle_partitions = 1000
spark_default_parallelism = 1000
conf = SparkConf()
conf.setAppName("test application")
conf.set('spark.yarn.queue', 'default')
conf.set('spark.executor.memory', str(spark_executor_memory))
conf.set('spark.executor.instances', str(spark_instances))
conf.set('spark.shuffle.sort.bypassMergeThreshold', spark_instances * int(spark_executor_cores))
conf.set("spark.dynamicAllocation.enabled", "false")
conf.set("spark.sql.shuffle.partitions", str(spark_shuffle_partitions))
conf.set("spark.default.parallelism", str(spark_default_parallelism))
conf.set('spark.sql.adaptive.enabled', 'true')
conf.set('spark.sql.adaptive.coalescePartitions.enabled', 'true')
conf.set('spark.sql.adaptive.localShuffleReader.enabled', 'true')
conf.set('spark.sql.adaptive.skewJoin.enabled', 'true')
conf.set('spark.sql.execution.arrow.pyspark.enabled', 'true')
conf.set('spark.sql.execution.arrow.pyspark.fallback.enabled', 'true')
conf.set('spark.sql.warehouse.dir', metastore_dir)
conf.set('spark.hadoop.javax.jdo.option.ConnectionURL', metastore_url)
ss = SparkSession.builder.enableHiveSupport().config(conf=conf).getOrCreate()
sc = SparkContext.getOrCreate()
spark = ss
sql = spark.sql
# example query
# partition = hour, logtype
sql("""
SELECT <columns>,
...
...
FROM <table name>
WHERE hour = <hour>
AND logtype = <logtype>
group by logtype
""").show()

Unable to view column headers when read in data from Databricks

I am reading in data from databricks using the following code:
acct = spark.read.csv("/mnt/syn/account/2018-06.csv", inferSchema = True, header=True)
However, I am unable to see the column headers
The printSchema() is as follows:
6d4cd0fe-dd7a-e811-a95c-00224800c9ff:string
5/19/2022 4:25:38 PM1:string
5/19/2022 4:25:38 PM2:string
0:string
14:string
_c5:string
_c6:string
_c7:string
18:string
_c9:string
_c10:string
_c11:string
71775000112:string
930580000:string
_c14:string
_c15:string
_c16:string
117:string
_c18:string
However, when I query the data as a table in Azure Synapse I get successfully get the headers as follows:
I'm pretty sure there is a simple explanation, but I can't think why this is case with Databricks

Please follow this Sample code.
file_location = "/FileStore/tables/export.csv"
df = spark.read.format("csv") \
.option("inferSchema", "true") \
.option("header", "true") \
.option("sep", ",") \
.load(file_location)
display(df)
Output:
for more detail refer this official document.

Pyspark Fetch data from database periodically if certain duration has elapsed

I am trying to read data from database periodically in Pyspark based on the condition, current_time - lastReadTime > refresh_interval.
The refresh_interval that I have provided is 5min.
It's a structured streaming with Kafka and I join the data coming from postgres later.
However, whenever I change the data in database within 5min, I am getting the new data from database even though 5min has not passed.
Below is the code I am using.
def __init__(self, config,spark):
self.refresh_frequency_sec = config.getint('postgres-config', 'refresh-frequency-sec')
self.spark = spark
self.lastMetaReadTime = time()
self.rules = self.fetchRules()
def fetchRules(self):
jdbcDF = self.spark.read \
.format("jdbc") \
.option("driver", "org.postgresql.Driver")\
.option("url", self.connection_url) \
.option("dbtable", self.dbtable) \
.option("user", self.user) \
.option("password", self.password) \
.load()
return jdbcDF
def getRules(self):
if time() - self.lastMetaReadTime > self.refresh_frequency_sec:
self.rules = self.fetchRules()
self.lastMetaReadTime = time()
return self.rules
What am I doing wrong?

You are calling the wrong method for 'self.rules' in your innit() method.
You innit method should rather be:
def __init__(self, config,spark):
self.refresh_frequency_sec = config.getint('postgres-config', 'refresh-frequency-sec')
self.spark = spark
self.lastMetaReadTime = time()
self.rules = self.getRules()

Connection timed out with JDBC connection from AWS Glue to RDS

I'm trying to connect to my PosgreSQL RDS directly from my AWS Glue script. I've tried connecting using the generated code and it works. But it doesn't work using JDBC type of connection. Here's the code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
import pyspark.sql.functions as F
from pyspark.sql.functions import *
## Initialize
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df = spark \
.read \
.format('jdbc') \
.option('url', 'jdbc:postgresql://host/database_name') \
.option('dbtable', "(SELECT * FROM table WHERE name = 'abcd') AS t") \
.option('user', 'username') \
.option('password', 'password') \
.load()
job.commit()
Part of the errors:
An error occurred while calling o74.load. : java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection timed out. at com.amazon.redshift.client.PGClient.connect ....
Extra information:
It actually worked before, but I can't figure out what changed since then
I've tested the connections by using the "Test connection" in AWS Glue and it works
I've configured the VPC security group for the RDS to open inbound/outbound to the same security group (this is based on this guide: https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)
Thanks in advance and let me know if you need more information.

I just found out the reason. It's because I didn't specify the port. I don't remember putting the port before. Everything works after that.
df = spark \
.read \
.format('jdbc') \
.option('url', 'jdbc:postgresql://host:5432/database_name') \
.option('dbtable', "(SELECT * FROM table WHERE name = 'abcd') AS t") \
.option('user', 'username') \
.option('password', 'password') \
.load()

Update Static Dataframe with Streaming Dataframe in Spark structured streaming

I have a similar use case as per post How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming. You can take the same data as example from the said post.
static_df = spark.read.schema(schemaName).json(fileName)
streaming_df = spark.readStream(....)
new_reference_data = update_reference_df(streaming_df, static_df)
def update_reference_df(df, static_df):
query: StreamingQuery = df \
.writeStream \
.outputMode("append") \
.foreachBatch(lambda batch_df, batchId: update_static_df(batch_df, static_df)) \
.start()
return query
def update_static_df(batch_df, static_df):
df1: DataFrame = static_df.union(batch_df.join(static_df,
(batch_df.SITE == static_df.SITE)
"left_anti"))
return df1
I want to know how will the static_df get refreshed with the new values from the data processed via foreachBatch. As I know foreachBatch returns nothing (VOID). I need to use the new values from static_df in further processing. Appreciate for your help.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

PySpark Oracle sequence generation using JDBC.Read - oracle

Related

HDFS caching with SparkSQL

Unable to view column headers when read in data from Databricks

Pyspark Fetch data from database periodically if certain duration has elapsed

Connection timed out with JDBC connection from AWS Glue to RDS

Update Static Dataframe with Streaming Dataframe in Spark structured streaming

Categories

Resources