Spark "ExecutorLostFailure" - how to solve? - memory-management

I've checked out some of the other answers on "ExecutorLostFailure" and most of them either:
** 1. Don't have an answer**
** 2. Insist on increasing the executor memory and the number of cores **
Here are some of the ones that I'm referring to: here here here
Is there any other solution to this? I've tried both, but it's unclear to me how to correctly gauge how much to allocate for each (memory and cores) in my SparkContext.
The error occurs within a saveAsTextFile action. Thanks.

From my experience, increasing the executor memory can help. But I'd suggest that this is a naive fix, and usually the underlying issue will remain.
The reason I say this is that one of Spark's most important features is that it allows you to perform computations on datasets that are too big to fit in memory. In theory, you could perform most calculations on a 1TB dataset with a single executor with 2GB memory.
In every case that I've encountered an OOM, it has been one of the following two reasons:
1. Insufficient executor memory overhead
This only applies if you are using a resource manager like Mesos or YARN). Check the Spark docs for guidance with this.
2. Something you are doing in your transformations is causing your RDD to become massively "horizontal".
Recall that I said Spark can handle datasets that are too big to fit in memory. The caveat to this is that the datasets must be vertically parallelizable - think a text file with 10^8 rows, where each row contains a relatively small data point (e.g. list of floats, JSON string, a single sentence etc.). Spark will then partition your dataset and send an appropriate number of rows to each executor.
The problem arises when a single row is very large. This is unlikely to occur through normal map-like operations (unless you are doing something quite weird), but is very easy to do through aggregation-type operations like groupByKey or reduceByKey. Consider the following example:
Dataset (name, age):
John 30
Kelly 36
Steve 48
Jane 36
If I then do a groupByKey with the age as key, I will get data in the form:
36 [Kelly, Jane]
30 [John]
48 [Steve]
If the number of rows in the initial dataset is very large, the rows in the resulting dataset could be very long. If they are long enough, they may be too large to fit into executor memory.
The solution?
It depends on your application. In some cases, it may indeed be unavoidable, and you may just have to increase executor memory. But usually it's possible to restructure your algorithm to avoid the issue, e.g. by replacing a groupByKey with a countByKey, or throwing away data points with a very high incidence rate (in one case I observed, it was a bot generating millions of requests that was responsible for the issue. These could be safely discarded without affecting the analysis).

Related

Physical memory usage keeps increasing for Spark application on YARN

I am running a Spark application in YARN-client mode with six executors (each four cores and executor memory = 6 GB and Overhead = 4 GB, Spark version: 1.6.3 / 2.1.0).
I find that my executor memory keeps increasing until getting killed by the node manager; and it gives out the information that tells me to boost spark.yarn.excutor.memoryOverhead.
I know that this parameter mainly control the size of memory allocated off-heap. But I don’t know when and how the Spark engine will use this part of memory. Also increasing that part of memory does not always solve my problem. Sometimes it works and sometimes not. It trends to be useless when the input data is large.
FYI, my application’s logic is quite simple. It means to combine the small files generated in one single day (one directory one day) into a single one and write back to HDFS. Here is the core code:
val df = spark.read.parquet(originpath)
.filter(s"m = ${ts.month} AND d = ${ts.day}")
.coalesce(400)
val dropDF = df.drop("hh").drop("mm").drop("mode").drop("y").drop("m").drop("d")
dropDF.repartition(1).write
.mode(SaveMode.ErrorIfExists)
.parquet(targetpath)
The source file may have hundreds to thousands level’s partition. And the total parquet file is around 1 to 5 GB.
Also I find that in the step that shuffle reading data from different machines, the size of shuffle read is about four times larger than the input size, Which is wired or some principle I don’t know.
Anyway, I have done some search myself for this problem. Some article said that it’s on the direct buffer memory (I don’t set myself).
Some article said that people solve it with more frequent full GC.
Also, I find one people on Stack Overflow with a very similar situation: Ever increasing physical memory for a Spark application in YARN
This guy claimed that it’s a bug with parquet, but a comment questioned him. People in this mail list may also receive an email hours ago from blondowski who described this problem while writing JSON: Executors - running out of memory
So it looks like to be common question for different output format.
I hope someone with experience about this problem could make an explanation about this issue. Why does this happen and what is a reliable way to solve this problem?
I just do some investigation in these days with my colleague. Here is my thought: from spark 1.2, we use Netty with off-heap memory to reduce GC during shuffle and cache block transfer. In my case, if I try to increase the memory overhead big enough. I will get the Max direct buffer exception. When Netty do block transferring, there will be five threads by default to grab the data chunk to target executor. In my situation, one single chunk is too big to fit into the buffer. So gc won’t help here. My final solution is to do another repartition before the repartition(1). Just to make 10x times more partitions than original’s. In this way, I can reduce the size of each chunk Netty transfer. In this way I finally make it.
Also I want to say that it’s not a good choice to repartition a big dataset into single file. This extremely unbalanced scenario is kind of waste your compute resources.
Welcome to any comment, I still don't understand this part well.

Tez. Slow reducers

I have strange behavior with TEZ mapreduce job.
I'm trying to read logs data from Hive, split it into some chunks by id, date and some other parameters and then write to another hive tables.
Map phase works fast enough and takes about 20 minutes, than reducers start to work and 453 from 458 reducers process all data within next 20 minutes. But last 5 reducers work for about 1 hour.
It happens because my input data includes some huge entries and processing these entries takes a lot of time.
What is the best practice for such cases? Should I make some hadoop/tez/hive tuning to allow kind of parallel processing for last reducers or it would be smarter to split input data by other parameters to avoid huge entries?
Thanks for any advice.
The magic word behind that not-so-strange behavior is skew. And it's a veeeery common issue. Usually people prefer ignoring the problem... until they really feel the pain (just like you do now).
With TEZ, since HIVE-7158 Use Tez auto-parallelism in Hive you can try to tinker with some specific properties:
hive.tez.auto.reducer.parallelism
hive.tez.max.partition.factor
hive.tez.min.partition.factor
But that "auto-parallelism" feature seems to apply when you have several abnormally small reduce datasets that can be merged, while your problem is the exact opposite (one abnormally large reduce dataset). So you should try also to tinker with
hive.exec.reducers.bytes.per.reducer
hive.exec.reducers.max
...to change the scale and make "large" the new "normal" (hence "normal" becoming "small"). But then, maybe all you will get will be 3 reducers all taking 1 hour to complete. Hard to say.
Good luck. This kind of performance tuning is more Art than Science.
Reference :
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.tez.auto.reducer.parallelism
https://www.mail-archive.com/user#tez.apache.org/msg00641.html
http://fr.slideshare.net/t3rmin4t0r/hivetez-a-performance-deep-dive/25
http://fr.slideshare.net/hortonworks/discoverhdp22faster-sql-queries-with-hive/28
~~~~~~
PS: of course, if you could remove the source of skewness by changing the way you organize your input dataset...

Cassandra partition size and performance?

I was playing around with cassandra-stress tool on my own laptop (8 cores, 16GB) with Cassandra 2.2.3 installed out of the box with having its stock configuration. I was doing exactly what was described here:
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema
And measuring its insert performance.
My observations were:
using the code from https://gist.github.com/tjake/fb166a659e8fe4c8d4a3 without any modifications I had ~7000 inserts/sec.
when modifying line 35 in the code above (cluster: fixed(1000)) to "cluster: fixed(100)", i. e. configuring my test data distribution to have 100 clustering keys instead of 1000, the performance was jumping up to ~11000 inserts/sec
when configuring it to have 5000 clustering keys per partition, the performance was reducing to just 700 inserts/sec
The documentation says however Cassandra can support up to 2 billion rows per partition. I don't need that much still I don't get how just 5000 records per partition can slow the writes 10 times down or am I missing something?
Supporting is a little different from "best performaning". You can have very wide partitions, but the rule-of-thumb is to try to keep them under 100mb for misc performance reasons. Some operations can be performed more efficiently when the entirety of the partition can be stored in memory.
As an example (this is old example, this is a complete non issue post 2.0 where everything is single pass) but in some versions when the size is >64mb compaction has a two pass process, that halves compaction throughput. It still worked with huge partitions. I've seen many multi gb ones that worked just fine. but the systems with huge partitions were difficult to work with operationally (managing compactions/repairs/gcs).
I would say target the rule of thumb initially of 100mb and test from there to find own optimal. Things will always behave differently based on use case, to get the most out of a node the best you can do is some benchmarks closest to what your gonna do (true of all systems). This seems like something your already doing so your definitely on the right path.

Fiware-Cosmos MapReduce

I have a Question regarding the MapReduce example explained here:
http://forge.fiware.org/plugins/mediawiki/wiki/fiware/index.php/BigData_Analysis_-_Quick_Start_for_Programmers
It is indeed the most common example of hadoop MapReduce, the WordCount.
I am able to execute it with no problems at the global instance of Cosmos, but even when I give it an small input (a file with 2 or 3 lines) it takes a lot to execute it (half a minute more or less). I assume this is its normal behavior but my question is: ¿Why does it takes so long even for an small input?
I guess this method increases its efectiveness with bigger datasets where this minimal delay is negligible.
First of all, you have to take into account the current instance of Cosmos at FIWARE LAB is a shared instance of Hadoop, thus many other user may be executing MapReduce jobs at the same time resulting in a "competition" for the computation resources.
Being said that, MapReduce is designed for large datasets and larga data files. It adds a lot of overhead that it's not necessary when processing a couple of lines (because for a couple of lines analsis you don't need MapReduce! :)) but which help a lot when those lines are thounsands, even millions. In those cases the processing time is proportional to the data size, of course, but not in a let's say 1:1 proportion.

Riak performance - unexpected results

In the last days I played a bit with riak. The initial setup was easier then I thought. Now I have a 3 node cluster, all nodes running on the same vm for the sake of testing.
I admit, the hardware settings of my virtual machine are very much downgraded (1 CPU, 512 MB RAM) but still I am a quite surprised by the slow performance of riak.
Map Reduce
Playing a bit with map reduce I had around 2000 objects in one bucket, each about 1k - 2k in size as json. I used this map function:
function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];
if (data.displayname.indexOf("max") !== -1) return [data];
return [];
}
And it took over 2 seconds just for performing the http request returning its result, not counting the time it took in my client code to deserialze the results from json. Removing 2 of 3 nodes seemed to slightly improve the performance to just below 2 seconds, but this still seems really slow to me.
Is this to be expected? The objects were not that large in bytesize and 2000 objects in one bucket isnt that much, either.
Insert
Batch inserting of around 60.000 objects in the same size as above took rather long and actually didnt really work.
My script which inserted the objects in riak died at around 40.000 or so and said it couldnt connect to the riak node anymore. In the riak logs I found an error message which indicated that the node ran out of memory and died.
Question
This is really my first shot at riak, so there is definately the chance that I screwed something up.
Are there any settings I could tweak?
Are the hardware settings too constrained?
Maybe the PHP client library I used for interacting with riak is the limiting factor here?
Running all nodes on the same physical machine is rather stupid, but if this is a problem - how can i better test the performance of riak?
Is map reduce really that slow? I read about the performance hit that map reduce has on the riak mailing list, but if Map Reduce is slow, how are you supposed to perform "queries" for data needed nearly in realtime? I know that riak is not as fast as redis.
It would really help me a lot if anyone with more experience in riak could help me out with some of these questions.
This answer is a bit late, but I want to point out that Riak's mapreduce implementation is designed primarily to work with links, not entire buckets.
Riak's internal design is actually pretty much optimized against working with entire buckets. That's because buckets are not considered to be sequential tables but a keyspace distributed across a cluster of nodes. This means that random access is very fast — probably O(log n), but don't quote me on that — whereas serial access is very, very, very slow. Serial access, the way Riak is currently designed, necessarily means asking all nodes for their data.
Incidentally, "buckets" in Riak terminology are, confusingly and disappointingly, not implemented the way you probably think. What Riak calls a bucket is in reality just a namespace. Internally, there is only one bucket, and keys are stored with the bucket name as a prefix. This means that no matter how small or large you bucket is, enumerating the keys in a single bucket of size n will take m time, where m is the total number of keys in all buckets.
These limitations are implementation choices by Basho, not necessarily design flaws. Cassandra implements the exact same partitioning model as Riak, but supports efficient sequential range scans and mapreduce across large amounts of keys. Cassandra also implements true buckets.
A recommendation I'd have now that some time has passed and several new versions of Riak have come about is this. Never rely on full bucket map/reduce, that's not an optimized operation, and chances are very good there are other ways to optimize your map/reduce so you don't have to look through so much data to pull out the singlets you need.
Secondary indices now available in newer versions of Riak are definitely the way to go in this regard. Put an index on the objects you want to find (perhaps named 'ismax_int' with a value of 0 or 1). You can map/reduce a secondary index with hundreds of thousands of keys in microseconds which a full bucket scan would have taken multiple seconds to consider.
I don't have direct experience of Riak, but have worked with Cassandra a little, which is similar.
Firstly, performance will probably depend a lot on the number of cores available, and the memory. These systems are usually heavily pipelined and concurrent and benefit from a lot of cores. 4+ cores and 4GB+ of RAM would be a good starting point.
Secondly, MapReduce is designed for batch processing, not realtime queries.
Riak and all similar Key-Value stores are designed for high write performance, high read performance for simple lookups, no complex querying at all.
Just for comparison, Cassandra on a single node (6 core, 6GB) can do 20,000 individual inserts per second.

Resources