What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle? - memory-management

I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job always got stuck in repartitioning. Namely, the job will constantly get following errors and retries:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer
org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /tmp/spark-...
I've tried to identify the problem but it seems like both memory and disk consumption of the machine throwing these errors are below 50%. I've also tried different configurations, including:
let driver/executor memory use 60% of total memory.
let netty to priortize JVM shuffling buffer.
increase shuffling streaming buffer to 128m.
use KryoSerializer and max out all buffers
increase shuffling memoryFraction to 0.4
But none of them works. The small job always trigger the same series of errors and max out retries (upt to 1000 times). How to troubleshoot this thing in such situation?
Thanks a lot if you have any clue.

Check your log if you get an error similar to this.
ERROR 2015-05-12 17:29:16,984 Logging.scala:75 - Lost executor 13 on node-xzy: remote Akka client disassociated
Every time you get this error is because you lose an executor. As why you lost an executor, that is another story, again check your log for clues.
One thing Yarn can kill your job, if it thinks that see you are using "too much memory"
Check for something like this:
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl - Container [<edited>] is running beyond physical memory limits. Current usage: 18.0 GB of 18 GB physical memory used; 19.4 GB of 37.8 GB virtual memory used. Killing container.
Also see: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
The current state of the art is to increase
spark.yarn.executor.memoryOverhead until the job stops failing. We do have
plans to try to automatically scale this based on the amount of memory
requested, but it will still just be a heuristic.

I was also getting error
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
and looking further in log I found
Container killed on request. Exit code is 143
After searching for the exit code, I realized that's its mainly related to memory allocation. So I checked the amount of memory I have configured for executors. I found that by mistake I had configured 7g to driver and only 1g for executor. After increasing the memory of executor my spark job ran successfully.

Seems like after I do the changeQueue operation used may cause this problem, the server has been changed after I changed the queue.

Related

Nifi memory continues to expand

I used a three-node nifi cluster, the nifi version is 1.16.3, the hardware is 8core 32G memory, and the solid-state high-speed hard disk is 2T. OS is CentOS7.9, ARM64 hardware architecture.
The initial configuration of nifi is xms12g and xmx12G(bootstrip.conf).
Native installation, docker is not used, and only nifi installed on all thoese machines, using integrated zookeeper.
Run 20 workflow everyday from 00:00 to 03:00, and the total data size is 1.2G. Collect csv documents to the greenplum database.
My problem now is that the memory usage of nifi is increasing every day, 0.2G per day, and all three nodes are like this. Then the memory is slowly full and then the machine is dead. This procedure is about a month(when the memory is set to 12G.).
That is to say, I need to restart the cluster every month. I use a native processor and workflow.
I can't locate the problem. Who can help me?
I may have any descriptions. Please feel to let me know,thanks.
I have made the following attempts:
I set the initial memory to 18G or 6G, and the speed of workflow processing has not changed. The difference is that, after setting it to 18G, it will freeze for a shorter time.
I used openjre1.8, and I tried to upgrade it to 11, but it was useless.
i add the following configuration, and is also useless:
java.arg.7=-XX:ReservedCodeCacheSize=256m
java.arg.8=-XX:CodeCacheMinimumFreeSpace=10m
java.arg.9=-XX:+UseCodeCacheFlushing
Every day's timing tasks consume little resources. Even if the memory is adjusted to 6G, 20 tasks run at the same time, the memory consumption is about 30%, and it will run out in half an hour.

HDFS nodes OOM too many files?

We have an HDFS cluster with five nodes. Too often when writing new files to the file system we get either "Not enough replicas" error or following:
2016-05-29 13:30:03,972 [Thread-486536] INFO  org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream
java.io.IOException: Got error, status message , ack with firstBadLink as 10.100.1.22:50010
at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:142) ~[hadoop-hdfs-2.7.1.jar!/:na]
...
2016-05-29 13:30:03,972 [Thread-486536] INFO  org.apache.hadoop.hdfs.DFSClient - Abandoning BP-1195099512-10.100.1.21-1454597789659:blk_1085523876_11792285
2016-05-29 13:30:03,977 [Thread-486536] INFO  org.apache.hadoop.hdfs.DFSClient - Excluding datanode DatanodeInfoWithStorage[10.100.1.22:50010,DS-2f34af8d-234a-4036-a810-908c3b2bd9cf,DISK]
2016-05-29 13:30:04,003 [pool-1272-thread-3] WARN  org.apache.hadoop.hdfs.DFSClient - Slow waitForAckedSeqno took 65098ms (threshold=30000ms)
We also experience a lot of these, which seems to be when big GC'ing occurs.
[pool-9-thread-23] WARN  org.apache.hadoop.hdfs.DFSClient - Slow waitForAckedSeqno took 34607ms (threshold=30000ms)
[pool-9-thread-30] WARN  org.apache.hadoop.hdfs.DFSClient - Slow waitForAckedSeqno took 34339ms (threshold=30000ms)
[pool-9-thread-5] WARN  org.apache.hadoop.hdfs.DFSClient - Slow waitForAckedSeqno took 34593ms (threshold=30000ms)
The filesystem holds 6.5 million small (4-20 kB) files and the nodes goes down with OOM when we write new files. New files are always written in batches and a batch can be a few hundred thousands.
The nodes have a lot of RAM at the moment to not OOM, 4 GB for name node and 3 GB for data nodes.
Is this really expected behavior? Why are the nodes eating such a uge amount of RAM?
I want to increase the number of nodes to see if we can run with more strict mem settings, like 1024 MB instead. Possible?
EDIT: We see a lot of GC happening and when the GC occurs nodes are not responding.
In the end the problem was the underlying storage system that failed. Moved the cluster to HW rack, added a second name node, periodically archive used files into HAR filesystems and HDFS is a happy panda.
If you ever see similar problems and your platform is based on any kind of virtualization. Move away from virtual mumbo jumbo immediately.

Tuning Hadoop job execution on YARN

A bit of intro - I'm learning about Hadoop. I have implemented machine learning algorithm on top of Hadoop (clustering) and tested it only on a small example (30MB).
A couple of days ago I installed Ambari and created a small cluster of four machines (master and 3 workers). Master has Resource manager and NameNode.
Now I'm testing my algorithm by increasing the amount of data (300MB, 3GB). I'm looking for a pointer how to tune up my mini-cluster. Concretely, I would like to know how to determine MapReduce2 and YARN settings in Ambari.
How to determine min/max memory for container, reserved memory for container, Sort Allocation Memory, map memory and reduce memory?
The problem is that execution of my jobs is very slow on Hadoop (and clustering is an iterative algorithm, which makes things worse).
I have a feeling that my cluster setup is not good, because of the following reason:
I run a job for a dataset of 30MB (I set-up block memory for this job to be 8MB, since data is small and processing is intensive) - execution time 30 minutes
I run the same job, but multiply same dataset 10 times - 300MB (same block size, 8MB) - execution time 2 hours
Now same amount of data - 300MB, but block size 128MB - same execution time, maybe even a bit greater than 2 hours
Size of blocks on HDFS is 128MB, so I thought that this will cause the speedup, but that is not the case. My doubts are that the cluster setup (min/max RAM size, map and reduce RAM) is not good, hence it cannot improve even though greater data locality is achieved.
Could this be the consequence of a bad setup, or am I wrong?
Please set the below properties in Yarn configuratins to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.user-limit-factor=0.33
If you need further info on this, please refer following link https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/

Out of memory error in Mapreduce shuffle phase

I am getting strange errors while running a wordcount-like mapreduce program. I have a hadoop cluster with 20 slaves, each having 4 GB RAM. I configured my map tasks to have a heap of 300MB and my reduce task slots get 1GB. I have 2 map slots and 1 reduce slot per node. Everything goes well until the first round of map tasks finishes. Then there progress remains at 100%. I suppose then the copy phase is taking place. Each map task generates something like:
Map output bytes 4,164,335,564
Map output materialized bytes 608,800,675
(I am using SnappyCodec for compression)
After stalling for about an hour the reduce tasks crach with the following exception:
Error: java.lang.OutOfMemoryError: Java heap space at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1703) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1563) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1401) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1333
I was googling and found this link but I don't really know what to make of it:
hadoop common link
I don't understand why hadoop would experience any problems in copying and merging if it is able to perform a terasort benchmark. It cannot be that all map output should fit into the RAM of the reducer thread. So what is going on here?
In the link provided above they have a discussion about tuning the following parameters:
mapreduce.reduce.shuffle.input.buffer.percent = 0.7
mapreduce.reduce.shuffle.memory.limit.percent = 0.25
mapreduce.reduce.shuffle.parallelcopies = 5
They claim that the fact that the product of the parameters is >1 allows for heapsize errors.
EDIT: Note that 5*1.25*0.7 is still <1 so focus om my second solution post!)
Before restarting this intensive simulation I would be very happy to hear about someone's opinion concerning the problem I am facing since it is bothering for almost a week now. I also seem to not completely understand what is happening in this copy phase, I'd expect a merge sort on disk not to require much heap size?
Thanks a lot in advance for any helpful comments and answers!
I think the clue is that the heapsize of my reduce task was required almost completely for the reduce phase. But the shuffle phase is competing for the same heapspace, the conflict which arose caused my jobs to crash. I think this explains why the job no longer crashes if I lower the shuffle.input.buffer.percent.
The parameter you cite mapred.job.shuffle.input.buffer.percent is apparently a pre Hadoop 2 parameter. I could find that parameter in the mapred-default.xml per the 1.04 docs but it's name has changed to mapreduce.reduce.shuffle.input.buffer.percent per the 2.2.0 docs.
Per the docs this parameter's description is:
The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
For a complete understanding of Sort and Shuffle see Chapter 6.4 of The Hadoop Definitive Guide. That book provides an alternate definition of the parameter mapred.job.shuffle.input.buffer.percent:
The proportion of total heap size to be allocated to the map outputs buffer during the copy phase of the shuffle.
Since you observed that decreasing the value of mapred.job.shuffle.input.buffer.percent from it's default of 0.7 to 0.2 solved your problem, it is pretty safe to say that you could have also solved your problem by increasing the value of the reducer's heap size.
Even after changing the shuffle.input.buffer.percent to 0.2 it doesn't work for me and got the same error.
After doing hit and trial on single node cluster, I found that there needs to be enough space in / directory as the process uses that space in case of spill.
The spill directory also needs to be changed.
Related bug - https://issues.apache.org/jira/browse/MAPREDUCE-6724
Can cause a NegativeArraySizeException if the calculated maxSingleShuffleLimit > MAX_INT

Reducer's Heap out of memory

So I have a few Pig scripts that keep dying in there reduce phase of the job with the errors that the Java heap keeps running out of space. To this date my only solution has been to increase Reducer counts, but that doesn't seem to be getting me anywhere reliable. Now part of this may be just the massive growth in data we are getting, but can't be sure.
I've thought about changing the spill threshold setting, can't recall the setting, but not sure if they would help any or just slow it down. What other things can I look at doing to solve this issue?
On a side note when this starts happening on occasion I also get errors about bash failing to get memory for what I assume is the spill operation. Would this be the Hadoop node running out of memory? If so would just turning down the heap size on these boxes be the solution?
Edit 1
1) Pig 0.8.1
2) The only UDF is an eval udf that just looks at single rows with no bags or maps.
3) I haven't noticed there being any hotspots with bad key distrobution. I have been using the prime number scale to reduce this issue as well.
Edit 2
Here is the error in question:
2012-01-04 09:58:11,179 FATAL org.apache.hadoop.mapred.TaskRunner: attempt_201112070707_75699_r_000054_1 : Map output copy failure : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
Here is the bash error I keep getting:
java.io.IOException: Task: attempt_201112070707_75699_r_000054_0 - The reduce copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:160)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2537)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
Obviously you are running out of memory somewhere. Increasing the number of reducers is actually quite reasonable. Take a look at the stats on the JobTracker Web GUI and see how many bytes are going out of the mapper. Divide that by the number of reduce tasks, and that is a pretty rough estimate of what each reducer is getting. Unfortunately, this only works in the long run if your keys are evenly distributed.
In some cases, JOIN (especially the replicated kind) will cause this type of issue. This happens when you have a "hot spot" of a particular key. For example, say you are doing some sort of join and one of those keys shows up 50% of the time. Whatever reducer gets lucky to handle that key is going to get clobbered. You may want to investigate which keys are causing hot spots and handle them accordingly. In my data, usually these hot spots are useless anyways. To find out what's hot, just do a GROUP BY and COUNT and figure out what's showing up a lot. Then, if it's not useful, just FILTER it out.
Another source of this problem is a Java UDF that is aggregating way too much data. For example, if you have a UDF that goes through a data bag and collects the records into some sort of list data structure, you may be blowing your memory with a hot spot value.
I found that the newer versions of Pig (.8 and .9 particularly) have far fewer memory issues. I had quite a few instances of running out of heap in .7. These versions have much better spill to disk detection so that if its about to blow the heap, it is smart enough to spill to disk.
In order for me to be more helpful, you could post your Pig script and also mention what version of Pig you are using.
I'm not an experienced user or anything, but I did run into a similar problem when runing pig jobs on a VM.
My particular problem, was that the VM had no swap space configured, it would eventually run out of memory. I guess you're trying this in a proper linux configuration, but it would't hurt to do a: free -m and see what you get in result, maybe the problem is due to you having too little swap memory configured.
Just a thought, let me know if it helps. Good luck with your problem!

Resources