Spark Executors hang after Out of Memory - hadoop

I have a spark application running on EMR (16 nodes, 1 master, 15 core, r3.2xlarge instances). For spark executor configuration, we use dynamic Allocation.
While loading the data into the RDD, I see that sometimes when there's a huge amount of data (700 Gb), then Spark runs Out of Memory, but it does not fail the App. Rather the app sits there hung. I'm not sure why this happens but here is my theory :-
We use dataframes which might be caching things.
The spark flag spark.dynamicAllocation.cachedExecutorIdleTimeout is set to infinity
My theory is that it might be caching things while creating dataframes but the cache is never relinquished and this leads to a Spark hang.
There are two solutions
Increase cluster size (worse case)
Figure out a way to add a timeout to Spark app.
Programatically kill the EMR step (could not find an API which does this)
Any leads about how to go about it ?

There could be two other possibilities. Either the partitions are too big, or you have sever skewness (size of partitions varies a lot).
Try to increase the number of partitions (anf hence, reduce their size) using repartition. This will randomly reshuffle the data throughout your executors (good to reduce skewness, but slow). Ideally, I like my partitions to be around 64Mo, depending on your machines.

Related

On DSE 5.0.5 (Cassandra 3.0.11) which settings would reduce read latencies with constant writes in the background?

We have a 5 node DSE cassandra cluster and an application whose job is to write asynchronously to keyspace A (which is based on a HDD), and read synchronously from keyspace B (which is on an SSD). Reads from table
Additional info:
The table on A is using TWCS with 48h windows, while the table on keyspace B is using LCS with default settings
Spark jobs partition reads in chunks of 20h at most
Both tables are using TDE with AES256 keys and 1KB chunks
Azul Zing is being used as the JVM with default settings apart from heap sizing and GC logging
With this scenario alone the read latencies from keyspace B are fine throughout the day, but everyday we have a spark job that will read from keyspace A and write to B. The moment the spark executors "attack" keyspace A, read latencies from keyspace B suffer a bit (99th percentil goes from 8-12ms to 130ms for a few seconds).
My question is, which cassandra.yaml properties would likely help the most on reducing the read latencies on keyspace B just for this moment the spark job starts? I've been trying different memtable/commitlog settings, but haven't been able to lower the read latency to acceptable levels
It’s hard to generalize without knowing why your latency hurts, if we could we’d bake those defaults into the database
However, I’ll try to guess
Throttle down concurrent reads so there are fewer concurrent requests - this will trade throughout for more consistent performance
if your disk is busy, consider smaller compression chunk sizes
if you’re seeing GC pauses, consider tuning your jvm - the Cassandra-8150 jira has some good suggestions
if your sstables-per-read is more than a few, reconsider your data model to keep your partitions from spanning multiple TWCS windows
make sure your key cache is enabled. If you can spare the heap, raise it, it may help.
Jeff's answer should be your starting point but if that doesn't solve it, consider changing your spark job to off-peak time. Keep in mind that LCS is optimized for read-heavy tables, but from the moment that spark starts to "migrate" the data, that table using LCS, will for some time (until the spark job finishes) become a write-heavy table. This would be an anti-pattern for LCS utilization. I can't know for sure without looking into servers details, but I would say that due to the sheer number of SSTables that are created during the spark job, LCS is not able to keep up with the compaction to maintain the standard read latency.
If you can't schedule the spark job at an off-peak time, then you should consider changing the compaction strategy in the keyspace B to STCS.

50-60 gb of data in spark standalone mode

I am trying to analyze around 50-60 gb of data. I thought of using spark to do that, but I do not have access to multiple nodes in a cluster. Can this level of processing be done using spark standalone mode ? If yes, I would like to know the estimated time required to process the data.Thanks!
Short answer: yes.
Spark will partition this file in many smaller chunks. In your case only few chunks will be executed at a time. These few chunks should fit in memory (you need to play with the configurations to get this right)
To summarize, you will be able to do it, but it would be faster if you had more memory/cores so you can processes more things in parallel.

Tasks taking longer over time in Apache Spark

I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map transformation) gets done really fast.
However, after having processed around 500GB of data, that map transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.
I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?
I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.
It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try:
Cache the RDD that you process. Do not forget to unpersist it, when you don't need it anymore.
Understanding caching, persisting in Spark.
Check if the partitions are balanced, which doesn't seem to be
the case (that would explain why some local stages complete much
earlier than others). Having balanced partitions is the holy grail
in distributed-computing, isn't it? :)
How to balance my data across the partitions?
Reducing the communications costs, i.e. use less workers than you
use, and see what happens. Of course that heavily depends on your
application. You see, sometimes communication costs become so big,
they dominate, so using less machines for example, speeds up the
job. However, I would do that, only if steps 1 and 2 would not suffice.
Without any more info it would seem that at some point of the computation your data gets spilled to the disk because there is no more space in memory.
It's just a guess, you should check your Spark UI.

Spark SQL performance with Simple Scans

I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables), and I am running my queries through Spark SQL using hive context.
After checking the performance tuning documents on the spark page and some clips from latest spark summit, I decided to set the following configs in my spark-env:
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2500M
(As my tasks tend to be long so the overhead of starting multiple JVMs, one per worker is much less than the total query times). As I monitor the job progress, I realized that while the Worker memory is 2.5GB, the executors (one per worker) have max memory of 512MB (which is default). I enlarged this value in my application as:
conf.set("spark.executor.memory", "2.5g");
Trying to give max available memory on each worker to its only executor, but I observed that my queries are running slower than the prev case (default 512MB). Changing 2.5g to 1g improved the performance time, it is close to but still worse than 512MB case. I guess what I am missing here is what is the relationship between the "WORKER_MEMORY" and 'executor.memory'.
Isn't it the case that WORKER tries to split this memory among its executors (in my case its only executor) ? Or there are other stuff being done worker which need memory ?
What other important parameters I need to look into and tune at this point to get the best response time out of my HW ? (I have read about Kryo serializer, and I am about trying that - I am mainly concerned about memory related settings and also knobs related to parallelism of my jobs). As an example, for a simple scan-only query, Spark is worse than Hive (almost 3 times slower) while both are scanning the exact same table & file format. That is why I believe I am missing some params by leaving them as defaults.
Any hint/suggestion would be highly appreciated.
Spark_worker_cores is shared across the instances. Increase the cores to say 8 - then you should see the kind of behavior (and performance) that you had anticipated.

Why is Spark fast when word count? [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
Test case: word counting in 6G data in 20+ seconds by Spark.
I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.
I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.
I appreciate if anyone could explain RDD for the word counting case. Thanks!
First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM
Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.
Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks
All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question
Other than the factors mentioned by 0x0FFF, local combining of results also makes spark run word count more efficiently. Spark, by default, combines results on each node before sending the results to other nodes.
In case of word count job, Spark calculates the count for each word on a node and then sends the results to other nodes. This reduces the amount of data to be transferred over network. To achieve the same functionality in Hadoop Map-reduce, you need to specify combiner class job.setCombinerClass(CustomCombiner.class)
By using combineByKey() in Spark, you can specify a custom combiner.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.

Resources