Physical memory usage keeps increasing for Spark application on YARN

Physical memory usage keeps increasing for Spark application on YARN - hadoop

I am running a Spark application in YARN-client mode with six executors (each four cores and executor memory = 6 GB and Overhead = 4 GB, Spark version: 1.6.3 / 2.1.0).
I find that my executor memory keeps increasing until getting killed by the node manager; and it gives out the information that tells me to boost spark.yarn.excutor.memoryOverhead.
I know that this parameter mainly control the size of memory allocated off-heap. But I don’t know when and how the Spark engine will use this part of memory. Also increasing that part of memory does not always solve my problem. Sometimes it works and sometimes not. It trends to be useless when the input data is large.
FYI, my application’s logic is quite simple. It means to combine the small files generated in one single day (one directory one day) into a single one and write back to HDFS. Here is the core code:
val df = spark.read.parquet(originpath)
.filter(s"m = ${ts.month} AND d = ${ts.day}")
.coalesce(400)
val dropDF = df.drop("hh").drop("mm").drop("mode").drop("y").drop("m").drop("d")
dropDF.repartition(1).write
.mode(SaveMode.ErrorIfExists)
.parquet(targetpath)
The source file may have hundreds to thousands level’s partition. And the total parquet file is around 1 to 5 GB.
Also I find that in the step that shuffle reading data from different machines, the size of shuffle read is about four times larger than the input size, Which is wired or some principle I don’t know.
Anyway, I have done some search myself for this problem. Some article said that it’s on the direct buffer memory (I don’t set myself).
Some article said that people solve it with more frequent full GC.
Also, I find one people on Stack Overflow with a very similar situation: Ever increasing physical memory for a Spark application in YARN
This guy claimed that it’s a bug with parquet, but a comment questioned him. People in this mail list may also receive an email hours ago from blondowski who described this problem while writing JSON: Executors - running out of memory
So it looks like to be common question for different output format.
I hope someone with experience about this problem could make an explanation about this issue. Why does this happen and what is a reliable way to solve this problem?

I just do some investigation in these days with my colleague. Here is my thought: from spark 1.2, we use Netty with off-heap memory to reduce GC during shuffle and cache block transfer. In my case, if I try to increase the memory overhead big enough. I will get the Max direct buffer exception. When Netty do block transferring, there will be five threads by default to grab the data chunk to target executor. In my situation, one single chunk is too big to fit into the buffer. So gc won’t help here. My final solution is to do another repartition before the repartition(1). Just to make 10x times more partitions than original’s. In this way, I can reduce the size of each chunk Netty transfer. In this way I finally make it.
Also I want to say that it’s not a good choice to repartition a big dataset into single file. This extremely unbalanced scenario is kind of waste your compute resources.
Welcome to any comment, I still don't understand this part well.

Related

Apache NiFi tuning issues

I've developed a NiFi flow prototype for data ingestion in HDFS. Now I would like to improve the overall performances but it seems I cannot really move forward.

The flow takes in input csv files (each row has 80 fields), split them at row level, applies some transformations to the fields (using 4 custom processors executed sequentially), buffers the new rows into csv files, outputs them into HDFS. I've developed the processors in such a way the content of the flow file is accessed only once when each individual record is read and its fields are moved to flowfile attributes. Tests have been performed on a amazon EC2 m4.4xlarge instance (16 cores CPU, 64 GB RAM).
This is what I tried so far:
Moved the flowfile repository and the content repository on different SSD drives
Moved the provenance repository in memory (NiFi could not keep up with the events rate)
Configuring the system according to the configuration best practices
I've tried assigning multiple threads to each of the processors in order to reach different numbers of total threads
I've tried increasing the nifi.queue.swap.threshold and setting backpressure to never reach the swap limit
Tried different JVM memory settings from 8 up to 32 GB (in combination with the G1GC)
I've tried increasing the instance specifications, nothing changes
From the monitoring I've performed it looks like disks are not the bottleneck (they are basically idle a great part of the time, showing the computation is actually being performed in-memory) and the average CPU load is below 60%.
The most I can get is 215k rows/minute, which is 3,5k rows/second. In terms of volume, it's just 4,7 MB/s. I am aiming to something definitely greater than this.

Just as a comparison, I created a flow that reads a file, splits it in rows, merges them together in blocks and outputs on disk. Here I get 12k rows/second, or 17 MB/s. Doesn't look surprisingly fast too and let me think that probably I am doing something wrong.

Does anyone has suggestions about how to improve the performances? How much will I benefit from running NiFi on cluster instead of growing with the instance specs? Thank you all

It turned out the poor performances were a combination of both the custom processors developed, and the merge content built-in processor. The same question mirrored on the hortonworks community forum got interesting feedback.
Regarding the first issue, a suggestion is to add the SupportsBatching annotation to the processors. This allows the processors to batch together several commits, and allows the NiFi user to favor latency or throughput with the processor execution from the configuration menu. Additional info can be found on the documentation here.
The other finding was that the MergeContent built-in processor doesn't seem to have optimal performances itself, therefore if possible one should consider modifying the flow and avoid the merging phase.

Spark "ExecutorLostFailure" - how to solve?

I've checked out some of the other answers on "ExecutorLostFailure" and most of them either:
** 1. Don't have an answer**
** 2. Insist on increasing the executor memory and the number of cores **
Here are some of the ones that I'm referring to: here here here
Is there any other solution to this? I've tried both, but it's unclear to me how to correctly gauge how much to allocate for each (memory and cores) in my SparkContext.
The error occurs within a saveAsTextFile action. Thanks.

From my experience, increasing the executor memory can help. But I'd suggest that this is a naive fix, and usually the underlying issue will remain.
The reason I say this is that one of Spark's most important features is that it allows you to perform computations on datasets that are too big to fit in memory. In theory, you could perform most calculations on a 1TB dataset with a single executor with 2GB memory.
In every case that I've encountered an OOM, it has been one of the following two reasons:
1. Insufficient executor memory overhead
This only applies if you are using a resource manager like Mesos or YARN). Check the Spark docs for guidance with this.
2. Something you are doing in your transformations is causing your RDD to become massively "horizontal".
Recall that I said Spark can handle datasets that are too big to fit in memory. The caveat to this is that the datasets must be vertically parallelizable - think a text file with 10^8 rows, where each row contains a relatively small data point (e.g. list of floats, JSON string, a single sentence etc.). Spark will then partition your dataset and send an appropriate number of rows to each executor.
The problem arises when a single row is very large. This is unlikely to occur through normal map-like operations (unless you are doing something quite weird), but is very easy to do through aggregation-type operations like groupByKey or reduceByKey. Consider the following example:
Dataset (name, age):
John 30
Kelly 36
Steve 48
Jane 36
If I then do a groupByKey with the age as key, I will get data in the form:
36 [Kelly, Jane]
30 [John]
48 [Steve]
If the number of rows in the initial dataset is very large, the rows in the resulting dataset could be very long. If they are long enough, they may be too large to fit into executor memory.
The solution?
It depends on your application. In some cases, it may indeed be unavoidable, and you may just have to increase executor memory. But usually it's possible to restructure your algorithm to avoid the issue, e.g. by replacing a groupByKey with a countByKey, or throwing away data points with a very high incidence rate (in one case I observed, it was a bot generating millions of requests that was responsible for the issue. These could be safely discarded without affecting the analysis).

Optimal block size in HDFS - Can large block sizes hurt

I understand the disadvantages of small files and small block sizes in HDFS. I'm trying to understand the rationale behind the the default 64/128 MB block size. Are there any drawbacks of having a large block size (say 2GB. I read that larger values than that cause issues, the details of which I haven't yet dug into).
Issues I see with too large block sizes (please correct me, may be some or all of these issues don't really exist)-
Possibly, there could be issues with replicating a 1 Gig file when a data node goes down - which requires the cluster to transfer the whole file. This seems to be a problem when we are considering a single file - but we may have to transfer a lot many smaller files if we had smaller block sizes say 128 MB (which I think involves more overhead)
Could trouble mappers. Large blocks might end up with each mapper thus reducing the possible number of mappers. But this should not be an issue if we use a smaller split size?
It one sounded stupid when it occurred to me that this could be an issue but I thought I'll throw it in anyways - Since the namenode does not know the size of the file beforehand, it is possible for it to consider a data node not available since it does not have enough disk space for a new block (considering a large block size of may be 1-2 Gigs). But may be it solves this problem smartly by just cutting down the block size of that particular block (which probably is a bad solution resulting).
Block size may probably depend on the use case. I basically want to find an answer to the question - Is there a situation/use case where large block size setup can hurt?
Any help is appreciated. Thanks in advance.

I did extensive performance validations of high end clusters on hadoop and we varied the block sizes from 64 Meg up to 2GB. To answer the question: imagine workloads in which oftentimes smallish files need to be processed, say 10's of Megs. Which blocksize do you think will be more performant in that case - 64MEg or 1024Meg?
For the case of large files then yes the large block sizes tend towards better performance since the overhead of mappers is not negligible.

Cassandra running out of memory (heap space)

We are experimenting a bit with Cassandra lately (version 1.0.7) and we seem to have some problems with memory. We use EC2 as our test environment and we have three nodes with 3.7G of memory and 1 core # 2.4G, all running Ubuntu server 11.10.
The problem is that the node we hit from our thrift interface dies regularly (approximately after we store 2-2.5G of data). Error message: OutOfMemoryError: Java Heap Space and according to the log it in fact used all of the allocated memory.
The nodes are under relatively constant load and store about 2000-4000 row keys a minute, which are batched through the Trift interface in 10-30 row keys at once (with about 50 columns each). The number of reads is very low with around 1000-2000 a day and only requesting the data of a single row key. The is currently only one used column family.
The initial thought was that something was wrong in the cassandra-env.sh file. So, we specified the variables 'system_memory_in_mb' (3760) and the 'system_cpu_cores' (1) according to our nodes' specification. We also changed the 'MAX_HEAP_SIZE' to 2G and the 'HEAP_NEWSIZE' to 200M (we think the second is related to the Garbage Collection). Unfortunately, that did not solve the issue and the node we hit via thrift keeps on dying regularly.
In case you find this useful, swap is off and unevictable memory seems to be very high on all 3 servers (2.3GB, we usually observe the amount of unevictable memory on other Linux servers of around 0-16KB) (We are not quite sure how the unevictable memory ties into Cassandra, its just something we observed while looking into the problem). The CPU is pretty much idle the entire time. The heap memory is clearly being reduced once in a while according to nodetool, but obviously grows over the limit as time goes by.
Any ideas? Thanks in advance.

cassandra-env.sh defaults are perfect for almost all workloads, so until you know why this is happening best to put them back to their defaults or you may be making things worse without realizing.
I see concurrent reads and writes of 2k/sec/node on our cluster, so 2k-4k writes per minute is very little, although the fact that it's only the node accepting your connections that is dying is a little strange.
If you connect your app to the thrift endpoint on one of the other nodes is it then that one that dies?
Client connections use memory so might be worth double checking you're not connecting too many at a time. "netstat -A inet | grep 9160" on the dying cassandra node should tell you how many client connections you have. Depending heavily on your application you'd expect 10s or 100s rather than 1000s.
What do the writes look like?
Are you writing the same row keys repeatedly and if so are you appending new column names or overwriting the same ones?
How big is each write? Anything else you can tell me?
If you're overwriting the same column names in the same row keys constantly compaction may be struggling.
If you're appending new column names to the same row keys constantly you might be growing your rows too large to fit into memory.
the output of "nodetool -h localhost tpstats" on the dying node might also give some clues as to where you're falling down. Anything constantly pending is probably bad news, especially at such a low write rate.
If you're going to use cassandra in production you should get graphing of the internals to better understand what's going on. jmxtrans and graphite should be your new best friends.

There are some things you can try tweaking. First make sure you dont have row caching on your column family. Also worth while checking the log for errors and tpstats incase something died due to an error and something is getting backed up in a queue. The stack trace of the exception could be meaningful too since there are actually different types of OOMs that could just mean kernel tweaks.
If your just using too much memory per node then you want for the size of your data set try checking the cfstats, you can identify roughly how much space is spent on bloom filters. As you have more rows in a CF this can get linearly larger and is part of the base minimum memory your nodes are going to require.
nodetool cfstats | grep Bloom.*Used | awk '{ SUM += $5} END { print SUM " bytes" }'
Since you dont read very often you can probably increase the false positive rate on them. Each SSTable has a bloom filter it uses to check if a row exists in it or not. You can change with cqlsh
ALTER TABLE MyColumnFamily WITH bloom_filter_fp_chance = 0.1;
After that call an upgrade on that CF (this will be slow) per node
nodetool upgradesstables MyKeyspace MyColumnFamily
There are consequences to this where reads may take longer since there is a 10%-ish (the .1) chance it will check SSTables for rows that dont exist in it, resulting in extra disk seeks.
Another major memory sink if you have column families with large amount of rows is the sampling rate of the index. This can be modified per node level in the cassandra.yaml
http://www.datastax.com/docs/1.1/configuration/node_configuration#index-interval
If you have it set up to take heap dumps on OOM (-XX:+HeapDumpOnOutOfMemoryError on by default I believe) there should be some heap dumps available in the /var/lib/cassandra/data directory. You can open these up in visualvm or whatever tool you like to identify what part of the heaps is where.

Riak performance - unexpected results

In the last days I played a bit with riak. The initial setup was easier then I thought. Now I have a 3 node cluster, all nodes running on the same vm for the sake of testing.
I admit, the hardware settings of my virtual machine are very much downgraded (1 CPU, 512 MB RAM) but still I am a quite surprised by the slow performance of riak.
Map Reduce
Playing a bit with map reduce I had around 2000 objects in one bucket, each about 1k - 2k in size as json. I used this map function:
function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];
if (data.displayname.indexOf("max") !== -1) return [data];
return [];
}
And it took over 2 seconds just for performing the http request returning its result, not counting the time it took in my client code to deserialze the results from json. Removing 2 of 3 nodes seemed to slightly improve the performance to just below 2 seconds, but this still seems really slow to me.
Is this to be expected? The objects were not that large in bytesize and 2000 objects in one bucket isnt that much, either.
Insert
Batch inserting of around 60.000 objects in the same size as above took rather long and actually didnt really work.
My script which inserted the objects in riak died at around 40.000 or so and said it couldnt connect to the riak node anymore. In the riak logs I found an error message which indicated that the node ran out of memory and died.
Question
This is really my first shot at riak, so there is definately the chance that I screwed something up.
Are there any settings I could tweak?
Are the hardware settings too constrained?
Maybe the PHP client library I used for interacting with riak is the limiting factor here?
Running all nodes on the same physical machine is rather stupid, but if this is a problem - how can i better test the performance of riak?
Is map reduce really that slow? I read about the performance hit that map reduce has on the riak mailing list, but if Map Reduce is slow, how are you supposed to perform "queries" for data needed nearly in realtime? I know that riak is not as fast as redis.
It would really help me a lot if anyone with more experience in riak could help me out with some of these questions.

This answer is a bit late, but I want to point out that Riak's mapreduce implementation is designed primarily to work with links, not entire buckets.
Riak's internal design is actually pretty much optimized against working with entire buckets. That's because buckets are not considered to be sequential tables but a keyspace distributed across a cluster of nodes. This means that random access is very fast — probably O(log n), but don't quote me on that — whereas serial access is very, very, very slow. Serial access, the way Riak is currently designed, necessarily means asking all nodes for their data.
Incidentally, "buckets" in Riak terminology are, confusingly and disappointingly, not implemented the way you probably think. What Riak calls a bucket is in reality just a namespace. Internally, there is only one bucket, and keys are stored with the bucket name as a prefix. This means that no matter how small or large you bucket is, enumerating the keys in a single bucket of size n will take m time, where m is the total number of keys in all buckets.
These limitations are implementation choices by Basho, not necessarily design flaws. Cassandra implements the exact same partitioning model as Riak, but supports efficient sequential range scans and mapreduce across large amounts of keys. Cassandra also implements true buckets.

A recommendation I'd have now that some time has passed and several new versions of Riak have come about is this. Never rely on full bucket map/reduce, that's not an optimized operation, and chances are very good there are other ways to optimize your map/reduce so you don't have to look through so much data to pull out the singlets you need.
Secondary indices now available in newer versions of Riak are definitely the way to go in this regard. Put an index on the objects you want to find (perhaps named 'ismax_int' with a value of 0 or 1). You can map/reduce a secondary index with hundreds of thousands of keys in microseconds which a full bucket scan would have taken multiple seconds to consider.

I don't have direct experience of Riak, but have worked with Cassandra a little, which is similar.
Firstly, performance will probably depend a lot on the number of cores available, and the memory. These systems are usually heavily pipelined and concurrent and benefit from a lot of cores. 4+ cores and 4GB+ of RAM would be a good starting point.
Secondly, MapReduce is designed for batch processing, not realtime queries.
Riak and all similar Key-Value stores are designed for high write performance, high read performance for simple lookups, no complex querying at all.
Just for comparison, Cassandra on a single node (6 core, 6GB) can do 20,000 individual inserts per second.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio