Apache Spark - How to avoid failing slow tasks - performance

Dear fellow Apache Spark enthusiasts
I recently kicked off a sideline project with the goal of turning a couple of ODROID XU4 computers into a stand-alone Spark Cluster.
After setting up the cluster I ran into a problem that seems to be specific to heterogeneous multi processors. Spark executor tasks run extremely slow on the XU4 when using all 8 processors. The reason, as mentioned in a comment on my post below, is that Spark does not wait for the executors that have been kicked off on the slow processors.
http://forum.odroid.com/viewtopic.php?f=98&t=21369&sid=4276f7dc89a8d7825320e7f705011326&p=152415#p152415
One solution is to use fewer executor cores and to set the CPU affinity to not use the LITTLE processors. This is however a less than ideal solution.
Is there a way to ask Spark to wait a bit longer for feedback from slower executors? Obviously waiting too long will have a negative effect on performance. The positive effect of utilising all cores should however balance out the negative effect.
Thanks in advance for any help!

#Dikei response highlights two potential causes, but it turns out the problem is not the one he suspects. I have the same set up as the #TJVR, and it turns out the driver is missing heartbeats from executors. To address this, I added the following to spark-env.sh:
export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
This changes the default timeouts for executor heartbeats. Also set spark.shuffle.consolidateFiles to true to improve performance on my ext4 filesystem. These defaults changes allowed me to increased the core usage above one and not frequently lose executors.

Spark does not kill slow executors, but will mark an executor as dead in two cases:
If the driver doesn't receive a heartbeat signal in a while (default: 120s): The executor have to regularly (default: 10s) send a heartbeat message to notify the driver that it is still alive. Network issues or large GC pause can prevent these heartbeat from happening.
The executor has crashed due to exception in the code or JVM runtime error, most likely due to GC pause as well.
In my opinion, it's probably that GC overhead has killed your slowed executor and the driver has to redo the task on a different executor. If this is the case, you can try splitting your data into smaller partitions, so that each executor has to process less data at a time.
Secondly, you should NOT set spark.speculation to 'true' without testing. It's 'false' by default for a reason, I've seen it do more harm than good in some case.
Lastly, the following assumption might not hold true.
The positive effect of utilising all cores should however balance out
the negative effect.
Slow executors (straggles) can cause the program to perform much worse, depending on workload. It's entirely possible that avoiding the slow cores will provide the best result.

Related

Apache Nifi slow cluster issue

I am using a Apache nifi for one of my clickstream projects to do some ETL.
I am getting traffic around 300 messages per second currently with the following infra:
RAM - 16 GB
Swap - 6 GB
CPU - 16 cores
Disk - 100GB (Persistance not required)
Cluster - 6 nodes
The entire cluster UI has become extremely slow with the following issues
Processors giving back pressure when some failure happens, which consumes lot of threads
Provenance writing becomes very slow
Heartbeat across nodes becomes slow
Cluster Heart beat
I have the following questions on the setup
Is RPG use recommended, as it is a HTTP call, which i using to spread
across all the nodes, as there is an existing issue with EMQTT
process for consumer group.
What is the recommended value of thread count that should be allotted
per core?
What are the guidelines for infrastructure sizing
What are the tuning parameters for a large cluster with high incoming requests and lot of heavy JSON parsing for transformation
A couple of suggestions
Yes RPG usage is recommended, at least from what I've experienced, RPG seems to offer better distribution. Take a look at [3] below
Some processors are CPU intensive then others so there's no clear cut answer for what value can be set for Concurrent Tasks. This is more of trial and error or testing and fine tuning approach that you'd have to master. One suggestion is, if you set too many Concurrent Tasks for a CPU intensive processor, it will have serious impact on the nodes.
Hortonworks have made a detailed guide regarding this. I've provided the link below. [1]
Some best practices and handy guides:
https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html
http://ijokarumawak.github.io/nifi/2016/11/22/nifi-jolt/
https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/

Impact of more executors than cpu/cores in Storm Cluster

I have started using Apache Storm recently. Right now focusing on some performance testing and tuning for one of my applications (pulls data from a NoSQL database, formats, and publishes to a JMS Queue for consumption by the requester) to enable more parallel request processing at a time. I have been able to tune the topology in terms of changing no. of bolts, MAX_SPENDING_SPOUT etc. and to throttle data flow within topology using some ticking approach.
I wanted to know what happens when we define more parallelism than the no of cores we have. In my case I have a single node, single worker topology and the machine has 32 cores. But total no of executors (for all the spouts and bolts) = 60. So my questions are:
Does this high number really helps processing requests or is it actually degrades the performance, since I believe there will more context switch between bolt tasks to utilize cores.
If I define 20 (just a random selection) executors for a Bolt and my code flow never needs to utilize the Bolt, will this be impacting performance? How does storm handles this situation?
This is a very general question, so the answer is (as always): it depends.
If your load is large and a single executor fully utilizes a core completely, having more executors cannot give you any throughput improvements. If there is any impact, it might be negative (also with regard to contention of internally used queues to which all executers need to read from and write into for tuple transfer).
If you load is "small" and does not fully utilize your CPUs, it wound matter either -- you would not gain or loose anything -- as your cores are not fully utilized you have some left over head room anyway.
Furthermore, consider that Storm spans some more threads within each worker. Thus, if your executors fully utilize your hardware, those thread will also be impacted.
Overall, you should not run your topologies to utilize core completely anyway but leave form head room for small "spikes" etc. In operation, maybe 80% CPU utilization might be a good value. As a rule of thumb, one executor per core should be ok.

Recovery techniques for Spark Streaming scheduling delay

We have a Spark Streaming application that has basically zero scheduling delay for hours, but then suddenly it jumps up to multiple minutes and spirals out of control: This is happens after a while even if we double the batch interval.
We are not sure what causes the delay to happen (theories include garbage collection). The cluster has generally low CPU utilization regardless of whether we use 3, 5 or 10 slaves.
We are really reluctant to further increase the batch interval, since the delay is zero for such long periods. Are there any techniques to improve recovery time from a sudden spike in scheduling delay? We've tried seeing if it will recover on its own, but it takes hours if it even recovers at all.
Open the batch links, and identified which stages are in delay. Are there any external access to other DBs/application which are impacting this delay?
enter image description here
Go in each job, and see the data/records processed by each executor. you can find problems here.
enter image description here
There may be skewness in data partitions as well. If the application is reading data from kafka and processing it, then there can be skewness in data across cores if the partitioning is not well defined. Tune the parameters: # of kafka partitions, # of RDD partitions, # of executors, # of executor cores.

reuse JVM in Hadoop mapreduce jobs

I know we can set the property "mapred.job.reuse.jvm.num.tasks" to re-use JVM. My questions are:
(1) how to decide the number of tasks to be set here, -1 or some other positive integers?
(2) is it a good idea to already reuse JVMs and set this property to the value of -1 in mapreduce jobs?
Thank you very much!
If you have very small tasks that are definitely running after each other, it is useful to set this property to -1 (meaning that a spawned JVM will be reused unlimited times).
So you just spawn (number of task in your cluster available to your job)-JVMs instead of (number of tasks)-JVMs.
This is a huge performance improvement. In long running jobs the percentage of the runtime in comparision to setup a new JVM is very low, so it doesn't give you a huge performance boost.
Also in long running tasks it is good to recreate the task process, because of issues like heap fragmentation degrading your performance.
In addition, if you have some mid-time-running jobs, you could reuse just 2-3 of the tasks, having a good trade-off.
JVM reuse(only possible in MR1) should help with performance because it removes the startup lag of the JVM but it is only marginal and comes with a number of drawbacks(read side effects. Most tasks will run for a long time (tens of seconds or even minutes) and startup times are not the problem when you look at those task run times. You would like to start a new task on a clean slate. When you re-use a JVM there is a chance that the heap is not completely clean(it is fragmented from the previous runs). The fragmentation can lead to more GC's and nullify all the start up time gains. If there is a memory leak it could also affect the memory usage etc. So it's better to start a new JVM for the tasks(if the tasks are not reasonably small). In MR2(YARN) - new JVM is always started for the tasks. For Uber tasks - it will run the task in the local JVM only.

Is there a way to configure timeout for speculative execution in Hadoop?

I have hadoop job with tasks that are expected to run for significant length of fime (few minues). However hadoop starts speculative execution too soon. I do not want to turn speculative execution completely off but I want to increase duration of time hadoop waits before considering job for speculative execution. Is there a config option to control this timeout?
Thanks
I don't believe the speculative execution time is currently configurable. On the other hand, there's probably no need to adjust it. Speculative execution is meant to bail you out of slow running tasks (usually due to degraded hardware performance). If you have available cluster resources such that spec exec is kicking in, what's the harm in letting it do so? Note that minutes is not considered "significant" and is more than normal for medium or larger size jobs.
It's also worth noting that while mapper spec exec is almost always fine and low overhead to the system, reducer spec exec can hurt and probably should be disabled. The rationale is that if a mapper is progressing slowly and there are available resources where the data is local (normal), there's no shared overhead. If a reducer is performing slowly, starting another attempt of the same task will simply double the network load - normally the most painful part of reducer execution. If the network is what is causing the reducer to be "slow," starting a second attempt only hurts both attempts.
If you truly have a use case for adjusting the spec exec time, it might be worth filing a jira at http://issues.apache.org.
Hope this helps.

Resources