Apache Spark on EC2 massive slowdown on iterations - amazon-ec2

I have driver program that runs a set of 5 experiments - basically the driver program just tells the program which dataset to use (of which there are 5 and they're very similar).
The first iteration takes 3.5 minutes, the second 6 minutes, the third 30 minutes and the fourth has been running for over 30 minutes.
After each run the SparkContext object is stopped, it is then re-started for the next run - I thought this method would prevent slow down, as when sc.stop is called I was under the impression that the instances were cleared of all their RDD data - this is at least how it works in local mode. The dataset is quite small and according to Spark UI only 20Mb of data on 2 nodes is used.
Does sc.stop not remove all data from a node? What would cause such a slow down?

call sc.stop after all iterations are complete. Whenever we stop SparkContenxt and invoke new, it require time to load spark configurations,jars and free driver port to execute the next job.
and
using config --executor-memory you can speed up the process, depending on how much memory you have in each node.

Stupidly, I had used T2 instances. Their burstable performance means they only work on full power for a small amount of time. Read the documentation thoroughly - lesson learnt!

Related

How to improve AWS Glue's performance?

I have a simple job on AWS that takes more than 25 minutes. I changed the number of DPUs from 10 to 100 (the max allowed), the job still takes 13 minutes.
Any other suggestions on improving the performance?
I've noticed the same behavior.
My understanding is that the job time includes spinning up an EMR cluster, which takes several minutes. So if it takes.. say 8 minutes (just a guess), then your job time went from 17 -> 5.
Unless CPU or memory was a bottleneck for your existing job, adding more DPUs (i.e. more CPU and memory) wouldn't benefit your job significantly. At least the benefits will not be linear, i.e. 10 times more DPU doesn't mean that the job will run 10 times faster.
I suggest that you gradually increase the number of DPUs to look at performance gains, and you will notice that after a certain point adding more DPUs doesn't have a major impact on performance and that probably is the right amount of DPUs for your job.
Can we take a look at your job? Sometimes simple may not be performant. We've found that simple things like using the DynamicFrame.map transformation is really slow and you might be better off using a tmp table and mapping your data using the SQLContext

Pipeline inputs 8 billion lines from GCS and does a GroupByKey to prevent fusion, group step running very slow

I read 8 billion lines from GCS, do processing on each line, then output. My processing step can take a little time and to avoid worker leases expiring and getting below error; I do a GroupByKey on 8 billion and group by id to prevent fusion.
A work item was attempted 4 times without success. Each time the
worker eventually lost contact with the service. The work item was
attempted on:
The problem is GroupByKey step is taking forever to complete for 8 billion lines even on a 1000 high-mem-2 nodes.
I looked into the possible cause of slow processing being; large size of each value generated per key by GroupByKey. I don't think that's is possible because out of 8 billion inputs, one input id cannot be in that set more than 30 times. So clearly the problem of HotKeys is not here, something else is going on.
Any ideas on how to optimize this are appreciated. Thanks.
I did manage to solve this problem. There were a number of incorrect assumptions here on my part about dataflow wall times. I was looking at my pipeline and the step with highest wall time; which was in days, I thought is the bottleneck. But in Apache beam a step is usually fused together with steps downstream in the pipeline, and will only run as fast as the step down the pipeline runs. So a wall time that is significant is not enough to conclude that this step is the bottleneck in the pipeline. The real solution to the problem stated above came from this thread. I reduced the number of nodes my pipeline runs on. And changed node type from high-mem-2 to high-mem-4. I wish there was an easy way to get memory usage metrics for a dataflow pipeline. I had to ssh into VMs and do JMAP.

Recovery techniques for Spark Streaming scheduling delay

We have a Spark Streaming application that has basically zero scheduling delay for hours, but then suddenly it jumps up to multiple minutes and spirals out of control: This is happens after a while even if we double the batch interval.
We are not sure what causes the delay to happen (theories include garbage collection). The cluster has generally low CPU utilization regardless of whether we use 3, 5 or 10 slaves.
We are really reluctant to further increase the batch interval, since the delay is zero for such long periods. Are there any techniques to improve recovery time from a sudden spike in scheduling delay? We've tried seeing if it will recover on its own, but it takes hours if it even recovers at all.
Open the batch links, and identified which stages are in delay. Are there any external access to other DBs/application which are impacting this delay?
enter image description here
Go in each job, and see the data/records processed by each executor. you can find problems here.
enter image description here
There may be skewness in data partitions as well. If the application is reading data from kafka and processing it, then there can be skewness in data across cores if the partitioning is not well defined. Tune the parameters: # of kafka partitions, # of RDD partitions, # of executors, # of executor cores.

Individual Spark Task consume more time on computation if more cores are assigned

I am running a spark job with input file of size 6.6G (hdfs) with master as local. My Spark Job with 53 partitions completed quickly when I assign local[6] than local[2], however the individual task takes more computation time when number of cores are more. Say if I assign 1 core(local[1]) then each task takes 3 secs where the same goes up to 12 seconds if I assign 6 cores (local[6]). Where the time gets wasted? The spark UI shows increase in computation time for each task in local[6] case, I couldn't understand the reason why the same code takes different computation time when more cores are assigned.
Update:
I could see more %iowait in iostat output if I use local[6] than local[1]. Please let me know this is the only reason or any possible reasons. I wonder why this iowait is not reported in sparkUI. I see the increase in computing time than iowait time.
I am assuming you are referring to spark.task.cpus and not spark.cores.max
With spark.tasks.cpus each task get assigned more cores, but it doesn't necessarily have to use them. If you process is single threaded it really can't use them. You wind up with additional overhead without additional benefit and those cores are taken away from other single threaded tasks that can use them.
With spark.cores.max it is simply and overhead issue with transferring data around at the same time.

Performance issue for batch insertion into marklogic

I have the requirement to insert 10,000 docs into marklogic in less than 10 seconds.
I tested in one single-node marklogic server in the following way:
use xdmp:spawn to pass the doc insertion task to task server;
use xdmp:document-insert without specify forest explicitly;
the task server has 8 theads to process tasks;
We have enabled CPF.
The performance is very bad: it took 2 minutes to finish the 10,000 doc creation.
I'm sure the performance will be better if I tested it in a cluster environment, but I'm not sure whether it can finish in less than 10 seconds.
Please advise the way of improving the performance.
I would start by gathering more information. What version of MarkLogic is this? What OS is it running on? What's the CPU? RAM? What's the storage subsystem? How many forests are attached to the database?
Then gather OS-level metrics, to see if one of the subsystems is an obvious bottleneck. For now I won't speculate beyond that.
If you need a fast load, I wouldn't use xdmp:spawn for each individual document, nor use CPF. But 2 minutes for 10k docs doesn't necessarily sound slow. On the other hand, I have reached up to 3k/sec, but without range indexes, transforms, whatsoever. And a very fast disk (e.g. ssd)..
HTH!
Assuming 2 socket server, 128GB-256GB of ram, fast IO (400-800MB/sec sustained)
Appropriate number of forests (12 primary or 6 primary/6 secondary)
More than 8 threads assuming enough cores
CPF off
Turn on perf history, look in metrics, and you will see where the bottleneck is.
SSD is not required - just IO throughput...which multiple spinning disks provide without issue.

Resources