Was trying to generate the sequence in Oracle for the records via PySpark JDBC read.
from pyspark.sql import SparkSession
if __name__ == '__main__':
spark = SparkSession.builder.master("local")\
.appName("Seq_num")\
.getOrCreate()
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "URL") \
.option("dbtable", "(SELECT SEQ.NEXTVAL FROM DUAL) qry") \
.option("user", "USER") \
.option("password", "PWD") \
.load()
jdbcDF.show(2)
This gives me an error
py4j.protocol.Py4JJavaError: An error occurred while calling o50.load.
: java.sql.SQLSyntaxErrorException: ORA-02287: sequence number not allowed here
The alternative was to have the above run using the cur.execute(query) which would provide me with tuples.
Is there any other way to achieve the above so that I can save time in converting tuples to Dataframe.
Related
env
spark 3.1.2
hive 3.1.2
hadoop 3.2.1
problem
spark sql with hive metastore connection
for example, make table A with hdfs caching data in memory.
and make table B with non-hdfs caching data in hdfs.( but data content is same. )
I executed same query with table A and B.
I expected A table's query execution time must be faster than B table's query, but it didn't.
Actually query execution time between A and B table was almost same.
Is there anything to do enable HDFS caching with SparkSQL using hive metastore?
example code
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as Fd
from pyspark.sql.window import Window as W
spark_executor_cores = 8
spark_executor_memory = '3g'
spark_instances = 50
spark_shuffle_partitions = 1000
spark_default_parallelism = 1000
conf = SparkConf()
conf.setAppName("test application")
conf.set('spark.yarn.queue', 'default')
conf.set('spark.executor.memory', str(spark_executor_memory))
conf.set('spark.executor.instances', str(spark_instances))
conf.set('spark.shuffle.sort.bypassMergeThreshold', spark_instances * int(spark_executor_cores))
conf.set("spark.dynamicAllocation.enabled", "false")
conf.set("spark.sql.shuffle.partitions", str(spark_shuffle_partitions))
conf.set("spark.default.parallelism", str(spark_default_parallelism))
conf.set('spark.sql.adaptive.enabled', 'true')
conf.set('spark.sql.adaptive.coalescePartitions.enabled', 'true')
conf.set('spark.sql.adaptive.localShuffleReader.enabled', 'true')
conf.set('spark.sql.adaptive.skewJoin.enabled', 'true')
conf.set('spark.sql.execution.arrow.pyspark.enabled', 'true')
conf.set('spark.sql.execution.arrow.pyspark.fallback.enabled', 'true')
conf.set('spark.sql.warehouse.dir', metastore_dir)
conf.set('spark.hadoop.javax.jdo.option.ConnectionURL', metastore_url)
ss = SparkSession.builder.enableHiveSupport().config(conf=conf).getOrCreate()
sc = SparkContext.getOrCreate()
spark = ss
sql = spark.sql
# example query
# partition = hour, logtype
sql("""
SELECT <columns>,
...
...
FROM <table name>
WHERE hour = <hour>
AND logtype = <logtype>
group by logtype
""").show()
I am reading in data from databricks using the following code:
acct = spark.read.csv("/mnt/syn/account/2018-06.csv", inferSchema = True, header=True)
However, I am unable to see the column headers
The printSchema() is as follows:
6d4cd0fe-dd7a-e811-a95c-00224800c9ff:string
5/19/2022 4:25:38 PM1:string
5/19/2022 4:25:38 PM2:string
0:string
14:string
_c5:string
_c6:string
_c7:string
18:string
_c9:string
_c10:string
_c11:string
71775000112:string
930580000:string
_c14:string
_c15:string
_c16:string
117:string
_c18:string
However, when I query the data as a table in Azure Synapse I get successfully get the headers as follows:
I'm pretty sure there is a simple explanation, but I can't think why this is case with Databricks
Please follow this Sample code.
file_location = "/FileStore/tables/export.csv"
df = spark.read.format("csv") \
.option("inferSchema", "true") \
.option("header", "true") \
.option("sep", ",") \
.load(file_location)
display(df)
Output:
for more detail refer this official document.
I am trying to read data from database periodically in Pyspark based on the condition, current_time - lastReadTime > refresh_interval.
The refresh_interval that I have provided is 5min.
It's a structured streaming with Kafka and I join the data coming from postgres later.
However, whenever I change the data in database within 5min, I am getting the new data from database even though 5min has not passed.
Below is the code I am using.
def __init__(self, config,spark):
self.refresh_frequency_sec = config.getint('postgres-config', 'refresh-frequency-sec')
self.spark = spark
self.lastMetaReadTime = time()
self.rules = self.fetchRules()
def fetchRules(self):
jdbcDF = self.spark.read \
.format("jdbc") \
.option("driver", "org.postgresql.Driver")\
.option("url", self.connection_url) \
.option("dbtable", self.dbtable) \
.option("user", self.user) \
.option("password", self.password) \
.load()
return jdbcDF
def getRules(self):
if time() - self.lastMetaReadTime > self.refresh_frequency_sec:
self.rules = self.fetchRules()
self.lastMetaReadTime = time()
return self.rules
What am I doing wrong?
You are calling the wrong method for 'self.rules' in your innit() method.
You innit method should rather be:
def __init__(self, config,spark):
self.refresh_frequency_sec = config.getint('postgres-config', 'refresh-frequency-sec')
self.spark = spark
self.lastMetaReadTime = time()
self.rules = self.getRules()
I'm trying to connect to my PosgreSQL RDS directly from my AWS Glue script. I've tried connecting using the generated code and it works. But it doesn't work using JDBC type of connection. Here's the code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
import pyspark.sql.functions as F
from pyspark.sql.functions import *
## Initialize
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df = spark \
.read \
.format('jdbc') \
.option('url', 'jdbc:postgresql://host/database_name') \
.option('dbtable', "(SELECT * FROM table WHERE name = 'abcd') AS t") \
.option('user', 'username') \
.option('password', 'password') \
.load()
job.commit()
Part of the errors:
An error occurred while calling o74.load. : java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection timed out. at com.amazon.redshift.client.PGClient.connect ....
Extra information:
It actually worked before, but I can't figure out what changed since then
I've tested the connections by using the "Test connection" in AWS Glue and it works
I've configured the VPC security group for the RDS to open inbound/outbound to the same security group (this is based on this guide: https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)
Thanks in advance and let me know if you need more information.
I just found out the reason. It's because I didn't specify the port. I don't remember putting the port before. Everything works after that.
df = spark \
.read \
.format('jdbc') \
.option('url', 'jdbc:postgresql://host:5432/database_name') \
.option('dbtable', "(SELECT * FROM table WHERE name = 'abcd') AS t") \
.option('user', 'username') \
.option('password', 'password') \
.load()
I have a similar use case as per post How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming. You can take the same data as example from the said post.
static_df = spark.read.schema(schemaName).json(fileName)
streaming_df = spark.readStream(....)
new_reference_data = update_reference_df(streaming_df, static_df)
def update_reference_df(df, static_df):
query: StreamingQuery = df \
.writeStream \
.outputMode("append") \
.foreachBatch(lambda batch_df, batchId: update_static_df(batch_df, static_df)) \
.start()
return query
def update_static_df(batch_df, static_df):
df1: DataFrame = static_df.union(batch_df.join(static_df,
(batch_df.SITE == static_df.SITE)
"left_anti"))
return df1
I want to know how will the static_df get refreshed with the new values from the data processed via foreachBatch. As I know foreachBatch returns nothing (VOID). I need to use the new values from static_df in further processing. Appreciate for your help.