I am reading in data from databricks using the following code:
acct = spark.read.csv("/mnt/syn/account/2018-06.csv", inferSchema = True, header=True)
However, I am unable to see the column headers
The printSchema() is as follows:
6d4cd0fe-dd7a-e811-a95c-00224800c9ff:string
5/19/2022 4:25:38 PM1:string
5/19/2022 4:25:38 PM2:string
0:string
14:string
_c5:string
_c6:string
_c7:string
18:string
_c9:string
_c10:string
_c11:string
71775000112:string
930580000:string
_c14:string
_c15:string
_c16:string
117:string
_c18:string
However, when I query the data as a table in Azure Synapse I get successfully get the headers as follows:
I'm pretty sure there is a simple explanation, but I can't think why this is case with Databricks
Please follow this Sample code.
file_location = "/FileStore/tables/export.csv"
df = spark.read.format("csv") \
.option("inferSchema", "true") \
.option("header", "true") \
.option("sep", ",") \
.load(file_location)
display(df)
Output:
for more detail refer this official document.
I'm trying to connect to my PosgreSQL RDS directly from my AWS Glue script. I've tried connecting using the generated code and it works. But it doesn't work using JDBC type of connection. Here's the code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
import pyspark.sql.functions as F
from pyspark.sql.functions import *
## Initialize
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df = spark \
.read \
.format('jdbc') \
.option('url', 'jdbc:postgresql://host/database_name') \
.option('dbtable', "(SELECT * FROM table WHERE name = 'abcd') AS t") \
.option('user', 'username') \
.option('password', 'password') \
.load()
job.commit()
Part of the errors:
An error occurred while calling o74.load. : java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection timed out. at com.amazon.redshift.client.PGClient.connect ....
Extra information:
It actually worked before, but I can't figure out what changed since then
I've tested the connections by using the "Test connection" in AWS Glue and it works
I've configured the VPC security group for the RDS to open inbound/outbound to the same security group (this is based on this guide: https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)
Thanks in advance and let me know if you need more information.
I just found out the reason. It's because I didn't specify the port. I don't remember putting the port before. Everything works after that.
df = spark \
.read \
.format('jdbc') \
.option('url', 'jdbc:postgresql://host:5432/database_name') \
.option('dbtable', "(SELECT * FROM table WHERE name = 'abcd') AS t") \
.option('user', 'username') \
.option('password', 'password') \
.load()
I'm trying to build a little mix of technologies to implement a solution on my work. Since I'm new to most of them, sometimes I got stuck, but could find solution to some of the problems I'm facing. Right now, both objects are running on Spark, but I can't seem to identify why the Streaming are not working.
Maybe is the way redis implements its sink on the writing to stream side, maybe is the way I'm trying to do the job. Almost all of the examples I found on streaming are related to Spark samples, like streaming text or TCP, and the only solution I found on relational databases are based on kafka connect, which I can't use right now because the company doesn't have the Oracle option to CDC on Kafka.
My scenario is as follows. Build a Oracle -> Redis Stream -> MongoDB Spark application.
I've built my code based on the examples of spark redis And used the sample code to try implement a solution to my case. I load the Oracle data day by day and send to a redis stream which later will be extracted from the stream and saved to Mongo. Right now the sample below is just trying to remove from the stream and show on console, but nothing is shown.
The little 'trick' I've tried was to create a CSV directory, read from it, and later grab the date from the csv and use to query the oracle db, then saving the oracle DataFrame on redis with the foreachBatch command. The data is saved, but I think not in the right way, because using the sample code to read the stream nothing is received.
Those are the codes:
** Writing to Stream **
object SendData extends App {
Logger.getLogger("org").setLevel(Level.INFO)
val oracleHost = scala.util.Properties.envOrElse("ORACLE_HOST", "<HOST_IP>")
val oracleService = scala.util.Properties.envOrElse("ORACLE_SERVICE", "<SERVICE>")
val oracleUser = scala.util.Properties.envOrElse("ORACLE_USER", "<USER>")
val oraclePwd = scala.util.Properties.envOrElse("ORACLE_PWD", "<PASSWD>")
val redisHost = scala.util.Properties.envOrElse("REDIS_HOST", "<REDIS_IP>")
val redisPort = scala.util.Properties.envOrElse("REDIS_PORT", "6379")
val oracleUrl = "jdbc:oracle:thin:#//" + oracleHost + "/" + oracleService
val userSchema = new StructType().add("DTPROCESS", "string")
val spark = SparkSession
.builder()
.appName("Send Data")
.master("local[*]")
.config("spark.redis.host", redisHost)
.config("spark.redis.port", redisPort)
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val csvDF = spark.readStream.option("header", "true").schema(userSchema).csv("/tmp/checkpoint/*.csv")
val output = csvDF
.writeStream
.outputMode("update")
.foreachBatch {(df :DataFrame, batchId: Long) => {
val dtProcess = df.select(col("DTPROCESS")).first.getString(0).take(10)
val query = s"""
(SELECT
<FIELDS>
FROM
TABLE
WHERE
DTPROCESS BETWEEN (TO_TIMESTAMP('$dtProcess 00:00:00.00', 'YYYY-MM-DD HH24:MI:SS.FF') + 1)
AND (TO_TIMESTAMP('$dtProcess 23:59:59.99', 'YYYY-MM-DD HH24:MI:SS.FF') + 1)
) Table
"""
val df = spark.read
.format("jdbc")
.option("url", oracleUrl)
.option("dbtable", query)
.option("user", oracleUser)
.option("password", oraclePwd)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.load()
df.cache()
if (df.count() > 0) {
df.write.format("org.apache.spark.sql.redis")
.option("table", "process")
.option("key.column", "PRIMARY_KEY")
.mode(SaveMode.Append)
.save()
}
if ((new DateTime(dtProcess).toLocalDate()).equals(new LocalDate()))
Seq(dtProcess).toDF("DTPROCESS")
.coalesce(1)
.write.format("com.databricks.spark.csv")
.mode("overwrite")
.option("header", "true")
.save("/tmp/checkpoint")
else {
val nextDay = new DateTime(dtProcess).plusDays(1)
Seq(nextDay.toString(DateTimeFormat.forPattern("YYYY-MM-dd"))).toDF("DTPROCESS")
.coalesce(1)
.write.format("com.databricks.spark.csv")
.mode("overwrite")
.option("header", "true")
.save("/tmp/checkpoint")
}
}}
.start()
output.awaitTermination()
}
** Reading from Stream **
object ReceiveData extends App {
Logger.getLogger("org").setLevel(Level.INFO)
val mongoPwd = scala.util.Properties.envOrElse("MONGO_PWD", "bpedes")
val redisHost = scala.util.Properties.envOrElse("REDIS_HOST", "<REDIS_IP>")
val redisPort = scala.util.Properties.envOrElse("REDIS_PORT", "6379")
val spark = SparkSession
.builder()
.appName("Receive Data")
.master("local[*]")
.config("spark.redis.host", redisHost)
.config("spark.redis.port", redisPort)
.getOrCreate()
val processes = spark
.readStream
.format("redis")
.option("stream.keys", "process")
.schema(StructType(Array(
StructField("FIELD_1", StringType),
StructField("PRIMARY_KEY", StringType),
StructField("FIELD_3", TimestampType),
StructField("FIELD_4", LongType),
StructField("FIELD_5", StringType),
StructField("FIELD_6", StringType),
StructField("FIELD_7", StringType),
StructField("FIELD_8", TimestampType)
)))
.load()
val query = processes
.writeStream
.format("console")
.start()
query.awaitTermination()
}
This code writes the dataframe to Redis as hashes (not to the Redis stream).
df.write.format("org.apache.spark.sql.redis")
.option("table", "process")
.option("key.column", "PRIMARY_KEY")
.mode(SaveMode.Append)
.save()
Spark-redis doesn't support writing to Redis stream out of the box.
I have a similar use case as per post How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming. You can take the same data as example from the said post.
static_df = spark.read.schema(schemaName).json(fileName)
streaming_df = spark.readStream(....)
new_reference_data = update_reference_df(streaming_df, static_df)
def update_reference_df(df, static_df):
query: StreamingQuery = df \
.writeStream \
.outputMode("append") \
.foreachBatch(lambda batch_df, batchId: update_static_df(batch_df, static_df)) \
.start()
return query
def update_static_df(batch_df, static_df):
df1: DataFrame = static_df.union(batch_df.join(static_df,
(batch_df.SITE == static_df.SITE)
"left_anti"))
return df1
I want to know how will the static_df get refreshed with the new values from the data processed via foreachBatch. As I know foreachBatch returns nothing (VOID). I need to use the new values from static_df in further processing. Appreciate for your help.
Was trying to generate the sequence in Oracle for the records via PySpark JDBC read.
from pyspark.sql import SparkSession
if __name__ == '__main__':
spark = SparkSession.builder.master("local")\
.appName("Seq_num")\
.getOrCreate()
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "URL") \
.option("dbtable", "(SELECT SEQ.NEXTVAL FROM DUAL) qry") \
.option("user", "USER") \
.option("password", "PWD") \
.load()
jdbcDF.show(2)
This gives me an error
py4j.protocol.Py4JJavaError: An error occurred while calling o50.load.
: java.sql.SQLSyntaxErrorException: ORA-02287: sequence number not allowed here
The alternative was to have the above run using the cur.execute(query) which would provide me with tuples.
Is there any other way to achieve the above so that I can save time in converting tuples to Dataframe.