Spark Structured Streaming - WAL performance degradation - performance

We have a spark structured streaming query that reads data from eventhub, does some processing and write data back to eventhub. We have checkpointing enabled - we store the checkpoint data in the Azure Datalake Gen2.
When we run the query, we see something weird - over time, our query's performance (latency) slowly degrades. When we run the query for the first time, the batch duration time is ~3 secs. After a day of run, the batch duration time is 20 secs and after 2 days, we get to a 40 secs+.. Interestingly, when we delete the checkpoint folder (or otherwisely reset the checkpoint), the latency goes back to normal (2 secs).
Looking at the query performance after 2 days of running on the same checkpoint directory, it is quite clear that it is the write-ahead-log / "walCommit", which grows and after some time accounts for the majority of the processing time.
My questions are: what drives this behaviour - is it natural for walCommit to take longer and longer? Could it be Azure Datalake Gen2 specific? Do we even need write-ahead-logs for eventhub? What are general ways how to improve this (not assuming disabling the WAL)..

I've wrote you via Slack, but I will share the answer also here.
I’ve experienced same behavior, the reason was leak of hidden crc files in checkpoint/offsets directory. It is a hadoop rename bug and is workarounded in Spark 2.4.4.
Link to Spark JIRA
If following find command executed in checkpoint directory returns number > ~1000, you're affected by this bug:
find . -name "*.crc" | wc -l
Workaround for Spark < 2.4.4 is to disable creating crc files (suggested in JIRA comments) :
--conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

Thanks #tomas-bartalos for the answer!
We found another issue, that was the real cause of our problem - properties of Azure Gen2 Storage (with hierarchical namespace enabled). It seems Azure Gen2 is slow when listing a lot of files. We tried to open the streaming checkpoint directory using the Azure Explorer and it took about 20 seconds (similar to the walCommit time). We switched to the Azure Blob Storage and the problem was gone. We haven't done anything with the crc files (tomas's answer) so we concluded that te storage mode was the main issue.

Related

Magic committer not improving performance in a Spark3+Yarn3+S3 setup

What I am trying to achieve?
I am trying to enable the S3A magic committer for my Spark3.3.0 application running on a Yarn (Hadoop 3.3.1) cluster, to see performance improvements in my app during S3 writes. IIUC, my Spark application is writing about 21GBs of data with 30 tasks in the corresponding Spark stage (see below image).
My setup
I have a server which has the Spark client. The Spark client submits the application on Yarn cluster via the client-mode with PySpark.
What I tried
I am using the following config (setting via PySpark Spark-conf) to enable the committer:
"spark.sql.sources.commitProtocolClass": "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class": "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
"spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a": "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
"spark.hadoop.fs.s3a.committer.name": "magic"
"spark.hadoop.fs.s3a.committer.magic.enabled": "true"
I also downloaded the spark-hadoop-cloud jar to the jars/ directory of the Spark-Home on the Nodemanagers and my Spark-client servers.
Changes that I see after applying the aforementioned configs:
I see PRE __magic/ directory if I run aws s3 ls <write-path> when the job is running.
I don't see the warning WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe. anymore.
A _SUCCESS file gets created with (JSON) content. One of the key-value that I see in that file is "committer" : "magic".
Hence, I believe my configs are getting applied correctly.
What I expect
I have read in multiple articles that this committer is expected to show a performance boost (e.g. this article claims 57-77% time reduction). Hence, I expect to see significant reduction (from 39s) in the "duration" column of my "paruqet" stage, when I use the above shared configs.
Some other point that might be of value
When I use "spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol", my app fails with the error java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol.
I have not looked into enabling S3gaurd, as S3 now provides strong consistency.
correct. you don't need s3guard
the com.hortonworks binding was for the wip committer work. the binding classes for wiring up spark/parquet are all in spark-hadoop-cloud and have org.spark prefixes. you seem to be ok there
the simple test for what committer is live is to print the JSON _SUCCESS file. If that is a 0 byte file, you are still using the old committer. it does sound like you are.
grab the latest spark+hadoop build you can get, there's always ongoing improvements, with hadoop 3.3.5 doing a big enhancement there.
you should see performance improvements compared to the v1 committer, with commit speed O(files) rather than O(data). it is also correct, which the v1 algorithm doesn't offer on s3 (and which v2 doesn't offer anywhere

Azure Databricks stream fails with StorageException: Could not verify copy source

We have a Databricks job that has suddenly started to consistently fail. Sometimes it runs for an hour, other times it fails after a few minutes.
The inner exception is
ERROR MicroBatchExecution: Query [id = xyz, runId = abc] terminated with error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Could not verify copy source.
The job targets a notebook which consumes from event-hub with PySpark structured streaming, calculates some values based on the data, and streams data back to another event-hub topic.
The cluster is a pool with 2 workers and 1 driver running on standard Databricks 9.1 ML.
We've tried to restart job many times, also with clean input data and checkpoint location.
We struggle to determine what is causing this error.
We cannot see any 403 Forbidden errors in logs, which is sometimes mentioned on forums as a reason
.
Any assistance is greatly appreciated.
Issue resolved by moving checkpointing (used internally by Spark) location from standard storage to premium. I don't know why it suddenly started failing after months of running hardly without hiccup.
Premium storage might be a better place for checkpointing anyway since I/O is cheaper.

HDFS Showing 0 Blocks after cluster reboot

I've setup a small cluster for testing / academic proposes, I have 3 nodes, one of which is acting both as namenode and datanode (and secondarynamenode).
I've uploaded 60GB of files (about 6.5 Million files) and uploads started to get really slow, so I read on the internet that I could stop the secondary namenode service on the main machine, at the moment it had no effect on anything.
After I rebooted all 3 computers, two of my datanodes show 0 blocks (despite showing disk usage in web interface) even with both namenodes services running.
One of the nodes with problem is the one running the namenode as well so I am guessing it is not a network problem.
any ideas on how can I get these blocks to be recognized again? (without start it all over again which took about two weeks to upload all)
Update
After half an hour after another reboot this showed in the logs:
2018-03-01 08:22:50,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x199d1a180e357c12, containing 1 storage report(s), of which we sent 0. The reports had 6656617 total blocks and used 0 RPC(s). This took 679 msec to generate and 94 msecs for RPC and NN processing. Got back no commands.
2018-03-01 08:22:50,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.EOFException: End of File Exception between local host is: "Warpcore/192.168.15.200"; destination host is: "warpcore":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
And the EOF stack trace, after searching the web I discovered this [http://community.cloudera.com/t5/CDH-Manual-Installation/CDH-5-5-0-datanode-failed-to-send-a-large-block-report/m-p/34420] but still can't understand how to fix this.
The report block is too big and need to be split, but I don't know how or where should I configure this. I´m googling...
The problem seems to be low RAM on my namenode, as a workaround I added more directories to the namenode configuration as if I had multiple disks and rebalanced the files manually as instructed ins the comments here.
As hadoop 3.0 reports each disk separately the datenode was able to report and I was able to retrieve the files, this is an ugly workaround and not for production, but good enough for my academic purposes.
An interesting side effect was the datanode reporting multiple times the available disk space wich could lead into serious problems on production.
It seems a better solution is using HAR to reduce the number of blocks as described here and here

When are files closed in HDFS

I'm running into few issues when writing to HDFS (through flume's HDFS Sink). I think these are caused mostly because of the IO timeouts but not sure.
I end up with files that are open for write for a long long time and give the error "Cannot obtain block length for LocatedBlock{... }". It can be fixed if I explicitly recover the lease. I'm trying to understand what could cause this. I've been trying to reproduce this outside flume but have no luck yet. Could someone help me understand when such a situation could happen - A file on HDFS ends up not getting closed and stay like that until manual intervention to recover lease?
I thought the lease is recovered automatically based on the soft and hard limits. I've tried killing my sample code (I've also tried disconnecting network to make sure no shutdown hooks are executed) that is writing to HDFS to leave a file open for write but couldn't reproduce it.
We have had recurring problems with Flume, but it's substantially better with Flume 1.6+. We have an agent running on servers external to our Hadoop cluster with HDFS as the sink. The agent is configured to roll to new files (close current, and start a new one on the next event) hourly.
Once an event is queued on the channel, the Flume agent operates in a transaction manner -- file is sent, but not dequeued until the agent can confirm successful write to HDFS.
In the case where HDFS is unavailable to the agent (restart, network issue, etc.) there are files left on HDFS that are still open. Once connectivity is restored, Flume agent will find these stranded files and either continue writing to them, or close them normally.
However, we have found several edge cases where files seem to get stranded and left open, even after the hourly rolling has successfully renamed the file. I am not sure if this is a bug, a configuration issue, or just the way it is. When it happens, it completely messes up subsequent processing that needs to read the file.
We can find these files with hdfs fsck /foo/bar -openforwrite and can successfully hdfs dfs -mv them then hdfs dfs -cp from their new location back to their original one -- a horrible hack. We think (but have not confirmed) that hdfs debug recoverLease -path /foo/bar/openfile.fubar will cause the file to be closed, which is far simpler.
Recently we had a case where we stopped HDFS for a couple minutes. This broke the flume connections, and left a bunch of seemingly stranded open files in several different states. After HDFS was restarted, the recoverLease option would close the files, but moments later there would be more files open in some intermediate state. Within an hour or so, all the files had been successfully "handled" -- my assumption is that these files were reassociated with the agent channels. Not sure why it took so long -- not that many files. Another possibility is that it's pure HDFS cleaning up after expired leases.
I am not sure this is an answer to the question (which is also 1 year old now :-) ) but it might be helpful to others.

Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6.
Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive
Also we do understand the risk of using 'DirectParquetOutputCommitter' which is possibility of data corruption w/ speculative tasks enabled.
Now w/ Spark 2.0 this class has been deprecated and we're wondering what options do we have on the table so that we don't get to bear the ~4x slower writes when we upgrade to Spark 2.0. Any Thoughts/suggestions/recommendations would be highly appreciated.
One workaround that we can think of is - Save on HDFS and then copy it to S3 via s3DistCp (any thoughts on how can this be done in sane way as our Hive metadata-store points to S3?)
Also looks like NetFlix has fixed this -[3], any idea on when they're planning to open source it?
Thanks.
[1] - https://github.com/apache/spark/blob/21d5ca128bf3afd5c2d4c7fcc56240e28443474f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala
[2] -
https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
[3] -
https://www.youtube.com/watch?v=85sew9OFaYc&feature=youtu.be&t=8m39s
http://www.slideshare.net/AmazonWebServices/bdt303-running-spark-and-presto-on-the-netflix-big-data-platform
You can use: sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
since you are on EMR just use s3 (no need for s3a)
We are using Spark 2.0 and writing Parquet to S3 pretty fast (about as fast as HDFS)
if you want to read more check out this jira ticket SPARK-10063
I think the S3 committer from Netflix is already open sourced at: https://github.com/rdblue/s3committer.

Resources