Streaming from Azure Eventhub to Databricks delta table - spark-streaming

I am going to stream data through eventhub to Databricks deltatable .
the source is a very simple .net C# console program which generate random json payload . connect to azure eventhub . The evenhub has 10TU .
The source generation rate is about 10 event per seconds.
There is a databricks notebook and have some code like in the following :
val customEventhubParameters =
EventHubsConf(connStr.toString())
.setMaxEventsPerTrigger(5)
val incomingStream = spark.readStream.format("eventhubs").options(customEventhubParameters.toMap).option("maxBytesPerTrigger",100000).load()
var outputstream =
incomingStream
.select(get_json_object(($"body").cast("string"), "$.DeviceID").alias("DeviceID"), get_json_object(($"body").cast("string"), "$.detail").alias("detail"),get_json_object(($"body").cast("string"), "$.salesnumber").alias("Salesnumber").cast("Int"),get_json_object(($"body").cast("string"), "$.time").alias("Time"))
and write stream
outputstream.writeStream
.format("delta")
.outputMode("append").trigger(Trigger.ProcessingTime("10 seconds"))
.option("checkpointLocation", "/temp/checkpoint")
.start("/temp/eventdata").awaitTermination()
So it continuesly running in 10s interval. and I found that it only write 5 events to delta table at a time(which I think it's far less than the generation rate of the C# program) . How can make it a larger batch to write to delta table ??
also it cause the delta table so many small files.

Related

Once in a while Spark Structured Streaming write stream is getting IllegalStateException: Race while writing batch 4

I have multiple queries running on the same spark structured streaming session.
The queries are writing parquet records to Google Bucket and checkpoint to Google Bucket.
val query1 = df1
.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
.select("key","data.*")
.writeStream.format("parquet").option("path", path).outputMode("append")
.option("checkpointLocation", checkpoint_dir1)
.partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
.queryName("query1").start()
val query2 = df2.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
.select("key","data.*")
.writeStream.format("parquet").option("path", path).outputMode("append")
.option("checkpointLocation", checkpoint_dir2)
.partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
.queryName("query2").start()
Problem: Sometimes job fails with ava.lang.IllegalStateException: Race while writing batch 4
Logs:
Caused by: java.lang.IllegalStateException: Race while writing batch 4
at org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:67)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:187)
... 20 more
20/07/24 19:40:15 INFO SparkContext: Invoking stop() from shutdown hook
This error is because there are two writers writing to the output path. The file streaming sink doesn't support multiple writers. It assumes there is only one writer writing to the path. Each query needs to use its own output directory.
Hence, in order to fix this, you can make each query use its own output directory. When reading back the data, you can load each output directory and union them.
You can also use a streaming sink that supports multiple concurrent writers, such as the Delta Lake library. It's also supported by Google Cloud: https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc . This link has instructions about how to use Delta Lake on Google Cloud. It doesn't mention the streaming case, but what you need to do is changing format("parquet") to format("delta") in your codes.

Flink Hadoop Bucketing Sink performances with many parallel buckets

I'm investigating the performances of a Flink job that transports data from Kafka to an S3 Sink.
We are using a BucketingSink to write parquet files. The bucketing logic divides the messages having a folder per type of data, tenant (customer), date-time, extraction Id, etc etc. This results in each file is stored in a folder structure composed by 9-10 layers (s3_bucket:/1/2/3/4/5/6/7/8/9/myFile...)
If the data is distributed as bursts of messages for tenant-type we see good performances in writing, but when the data is more a white noise distribution on thousands of tenants, dozens of data types and multiple extraction IDs, we have an incredible loss of performances. (in the order of 300x times)
Attaching a debugger, it seems the issue is connected to the number of handlers open at the same time on S3 to write data. More specifically:
Researching in the hadoop libraries used to write to S3 I have found some possible improvements setting:
<name>fs.s3a.connection.maximum</name>
<name>fs.s3a.threads.max</name>
<name>fs.s3a.threads.core</name>
<name>fs.s3a.max.total.tasks</name>
But none of these made a big difference in throughput.
I also tried to flatten the folder structure to write to a single key like (1_2_3_...) but also this didn't bring any improvement.
Note: The tests have been done on Flink 1.8 with the Hadoop FileSystem (BucketingSink), writing to S3 using the hadoop fs libraries 2.6.x (as we use Cloudera CDH 5.x for savepoints), so we can't switch to StreamingFileSink.
After the suggestion from Kostas in https://lists.apache.org/thread.html/50ef4d26a1af408df8d9abb70589699cb6b26b2600ab6f4464e86ea4%40%3Cdev.flink.apache.org%3E
The culprit of the slow-down is this piece of code:
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L543-L551
This alone takes around 4-5 secs, with a total of 6 secs to open the file. Logs from an instrumented call:
2020-02-07 08:51:05,825 INFO BucketingSink - openNewPartFile FS verification
2020-02-07 08:51:09,906 INFO BucketingSink - openNewPartFile FS verification - done
2020-02-07 08:51:11,181 INFO BucketingSink - openNewPartFile FS - completed partPath = s3a://....
This together with the default setup of the bucketing sink with 60 secs inactivity rollover
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L195
means that with more than 10 parallel bucket on a slot by the time we finish creating the last bucket the first one became stale, so needs to be rotated generating a blocking situation.
We solved this by replacing the BucketingSink.java and deleting the FS check mentioned above:
LOG.debug("Opening new part file FS verification");
if (!fs.exists(bucketPath)) {
try {
if (fs.mkdirs(bucketPath)) {
LOG.debug("Created new bucket directory: {}", bucketPath);
}
}
catch (IOException e) {
throw new RuntimeException("Could not create new bucket path.", e);
}
}
LOG.debug("Opening new part file FS verification - done");
as we see that the sink works fine without it, now the file opening takes ~1.2sec.
Moreover we set the default inactive threshold to 5 mins. With this changes we can easily handle more than 200 buckets per slot (once the job takes speed it will ingest on all the slots so postponing the inactive timeout)

Cannot find sourceVersion Error while reading from EventHub and writing to delta lake

I am trying to read from a EventHub, and writing to a 2 delta lake table, pseudo code below
// read from event hub
inputDF = spark.readstream().format(“eventhubs”).option(“consumerGroup”,”myapp”)
//write to 1 delta lake
inputDF.writestream().format(“delta”).option(“checkpointLocation”,”loc1”).start(“table_1”)
//write to 2 delta lake
inputDF.writestream().format(“delta”).option(“checkpointLocation”,”loc2”).start(“table_2”)
when i start my job it fails with message "Cannot find sourceVersion" below message
ERROR: Query termination received for [id=5735eea9-a2c0-42bf-b368-0918985bff3e, runId=88c17d32-d5d9-46b6-bb9c-19f5ab8598c5], with exception: java.lang.IllegalStateException: Cannot find 'sourceVersion' in {"My_EventHub_Event_Name":{"2":25,"5":33,"4":35,"7":33,"1":26,"3":28,"6":30,"0":32}}
at com.databricks.sql.transaction.tahoe.sources.DeltaSourceOffset$.validateSourceVersion(DeltaSourceOffset.scala:91)
at com.databricks.sql.transaction.tahoe.sources.DeltaSourceOffset$.apply(DeltaSourceOffset.scala:74)
at com.databricks.sql.transaction.tahoe.sources.DeltaSource.getBatch(DeltaSource.scala:269)
Any idea how to fix it?

How to write data in real time to HDFS using Flume?

I am using Flume to store sensor data in HDFS. Once the data is received through MQTT. The subscriber posts the data in JSON format to Flume HTTP listener. It is currently working fine, but the problem is that flume is not writing to HDFS file till I stop it (or the size of the file reachs 128MB). I am using Hive to apply a schema on read. Unfortunately, the resulting hive table contains only 1 entry. This is normal because Flume did not write new coming data to file (loaded by Hive).
Is there any manner to force Flume to write new coming data to HDFS in a near-real time way? So, I don't need to restart it or to use small files?
here is my flume configuration:
# Name the components on this agent
emsFlumeAgent.sources = http_emsFlumeAgent
emsFlumeAgent.sinks = hdfs_sink
emsFlumeAgent.channels = channel_hdfs
# Describe/configure the source
emsFlumeAgent.sources.http_emsFlumeAgent.type = http
emsFlumeAgent.sources.http_emsFlumeAgent.bind = localhost
emsFlumeAgent.sources.http_emsFlumeAgent.port = 41414
# Describe the sink
emsFlumeAgent.sinks.hdfs_sink.type = hdfs
emsFlumeAgent.sinks.hdfs_sink.hdfs.path = hdfs://localhost:9000/EMS/%{sensor}
emsFlumeAgent.sinks.hdfs_sink.hdfs.rollInterval = 0
emsFlumeAgent.sinks.hdfs_sink.hdfs.rollSize = 134217728
emsFlumeAgent.sinks.hdfs_sink.hdfs.rollCount=0
#emsFlumeAgent.sinks.hdfs_sink.hdfs.idleTimeout=20
# Use a channel which buffers events in memory
emsFlumeAgent.channels.channel_hdfs.type = memory
emsFlumeAgent.channels.channel_hdfs.capacity = 10000
emsFlumeAgent.channels.channel_hdfs.transactionCapacity = 100
# Bind the source and sinks to the channel
emsFlumeAgent.sources.http_emsFlumeAgent.channels = channel_hdfs
emsFlumeAgent.sinks.hdfs_sink.channel = channel_hdfs
I think the tricky bit here is that you would like to write data to HDFS in near real time but don't want small files either (for obvious reasons) and this could be a difficult thing to a achieve.
You'll need to find optimal balance between the following two parameters:
hdfs.rollSize (Default = 1024) - File size to trigger roll, in bytes (0: never roll based on file size)
and
hdfs.batchSize (Default = 100) - Number of events written to file before it is flushed to HDFS
If your data is not likely to reach 128 MB in the preferred time duration, then you may need to reduce the rollSize but only to an extent that you don't run into the small files problem.
Since, you have not set any batch size in your HDFS sink, you should see the results of HDFS flush after every 100 records but once the size of the flushed records jointly reaches 128 MB, the contents would be rolled up in a 128 MB file. Is this also not happening? Could you please confirm?
Hope this helps!

S3 Flume HDFS SINK Compression

I am trying to write the flume events in Amaozn S3.The events written in S3 is in compressed format. My Flume configuration is given below. I am facing a data loss. Based on the configuration given below, if I publish 20000 events, I receive only 1000 events and all other data is lost. But When I disable the rollcount, rollSize and rollInterval configurations, all the events are received but there are 2000 small files created. Is there any wrong in my configuration settings? Should I add any other configurations?
injector.sinks.s3_3store.type = hdfs
injector.sinks.s3_3store.channel = disk_backed4
injector.sinks.s3_3store.hdfs.fileType = CompressedStream
injector.sinks.s3_3store.hdfs.codeC = gzip
injector.sinks.s3_3store.hdfs.serializer = TEXT
injector.sinks.s3_3store.hdfs.path = s3n://CID:SecretKey#bucketName/dth=%Y-%m-%d-%H
injector.sinks.s3_1store.hdfs.filePrefix = events-%{receiver}
# Roll when files reach 256M or after 10m, whichever comes first
injector.sinks.s3_3store.hdfs.rollCount = 0
injector.sinks.s3_3store.hdfs.idleTimeout = 600
injector.sinks.s3_3store.hdfs.rollSize = 268435456
#injector.sinks.s3_3store.hdfs.rollInterval = 3600
# Flush data to buckets every 1k events
injector.sinks.s3_3store.hdfs.batchSize = 10000
For starters: if you disable your setting for rollCount, rollSize and so on, flume will revert to defaults, hence the small files you receive, those are the default.
The relevant aspect is this:
injector.sinks.s3_3store.hdfs.batchSize = 10000
it basically tells your sink to collect 10.000 events before flushing. If you reduce that amount, you'll get smaller files too, because S3 in contrast to regular HDFS doesn't support file appends. Once you flush, the files will be closed and a new file will be created.
Try to determine which amount of events your sink will receive within a short time frame of a couple of minutes or so and set that value as you batch size.

Resources