Out of memory error in Mapreduce shuffle phase - hadoop

I am getting strange errors while running a wordcount-like mapreduce program. I have a hadoop cluster with 20 slaves, each having 4 GB RAM. I configured my map tasks to have a heap of 300MB and my reduce task slots get 1GB. I have 2 map slots and 1 reduce slot per node. Everything goes well until the first round of map tasks finishes. Then there progress remains at 100%. I suppose then the copy phase is taking place. Each map task generates something like:
Map output bytes 4,164,335,564
Map output materialized bytes 608,800,675
(I am using SnappyCodec for compression)
After stalling for about an hour the reduce tasks crach with the following exception:
Error: java.lang.OutOfMemoryError: Java heap space at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1703) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1563) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1401) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1333
I was googling and found this link but I don't really know what to make of it:
hadoop common link
I don't understand why hadoop would experience any problems in copying and merging if it is able to perform a terasort benchmark. It cannot be that all map output should fit into the RAM of the reducer thread. So what is going on here?
In the link provided above they have a discussion about tuning the following parameters:
mapreduce.reduce.shuffle.input.buffer.percent = 0.7
mapreduce.reduce.shuffle.memory.limit.percent = 0.25
mapreduce.reduce.shuffle.parallelcopies = 5
They claim that the fact that the product of the parameters is >1 allows for heapsize errors.
EDIT: Note that 5*1.25*0.7 is still <1 so focus om my second solution post!)
Before restarting this intensive simulation I would be very happy to hear about someone's opinion concerning the problem I am facing since it is bothering for almost a week now. I also seem to not completely understand what is happening in this copy phase, I'd expect a merge sort on disk not to require much heap size?
Thanks a lot in advance for any helpful comments and answers!

I think the clue is that the heapsize of my reduce task was required almost completely for the reduce phase. But the shuffle phase is competing for the same heapspace, the conflict which arose caused my jobs to crash. I think this explains why the job no longer crashes if I lower the shuffle.input.buffer.percent.

The parameter you cite mapred.job.shuffle.input.buffer.percent is apparently a pre Hadoop 2 parameter. I could find that parameter in the mapred-default.xml per the 1.04 docs but it's name has changed to mapreduce.reduce.shuffle.input.buffer.percent per the 2.2.0 docs.
Per the docs this parameter's description is:
The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
For a complete understanding of Sort and Shuffle see Chapter 6.4 of The Hadoop Definitive Guide. That book provides an alternate definition of the parameter mapred.job.shuffle.input.buffer.percent:
The proportion of total heap size to be allocated to the map outputs buffer during the copy phase of the shuffle.
Since you observed that decreasing the value of mapred.job.shuffle.input.buffer.percent from it's default of 0.7 to 0.2 solved your problem, it is pretty safe to say that you could have also solved your problem by increasing the value of the reducer's heap size.

Even after changing the shuffle.input.buffer.percent to 0.2 it doesn't work for me and got the same error.
After doing hit and trial on single node cluster, I found that there needs to be enough space in / directory as the process uses that space in case of spill.
The spill directory also needs to be changed.

Related bug - https://issues.apache.org/jira/browse/MAPREDUCE-6724
Can cause a NegativeArraySizeException if the calculated maxSingleShuffleLimit > MAX_INT

Related

Tuning Hadoop job execution on YARN

A bit of intro - I'm learning about Hadoop. I have implemented machine learning algorithm on top of Hadoop (clustering) and tested it only on a small example (30MB).
A couple of days ago I installed Ambari and created a small cluster of four machines (master and 3 workers). Master has Resource manager and NameNode.
Now I'm testing my algorithm by increasing the amount of data (300MB, 3GB). I'm looking for a pointer how to tune up my mini-cluster. Concretely, I would like to know how to determine MapReduce2 and YARN settings in Ambari.
How to determine min/max memory for container, reserved memory for container, Sort Allocation Memory, map memory and reduce memory?
The problem is that execution of my jobs is very slow on Hadoop (and clustering is an iterative algorithm, which makes things worse).
I have a feeling that my cluster setup is not good, because of the following reason:
I run a job for a dataset of 30MB (I set-up block memory for this job to be 8MB, since data is small and processing is intensive) - execution time 30 minutes
I run the same job, but multiply same dataset 10 times - 300MB (same block size, 8MB) - execution time 2 hours
Now same amount of data - 300MB, but block size 128MB - same execution time, maybe even a bit greater than 2 hours
Size of blocks on HDFS is 128MB, so I thought that this will cause the speedup, but that is not the case. My doubts are that the cluster setup (min/max RAM size, map and reduce RAM) is not good, hence it cannot improve even though greater data locality is achieved.
Could this be the consequence of a bad setup, or am I wrong?
Please set the below properties in Yarn configuratins to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.user-limit-factor=0.33
If you need further info on this, please refer following link https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/

Benchmarking Hadoop on EC2 gives identical performances

I am trying to benchmark Hadoop on EC2. I am using a 1GB file with 1 Master and 5 slaves. When I varied the dfs.blocksize like 1m, 64m, 128m, 500m. I was expecting the best performance at 128m since the file size is 1GB and there are 5 slaves. But to my surprise, irrespective of the block size, time taken falls more or less within the same range. How am I achieving this wierd performance?
Couple of things to think about most likely explanation first
Check you are correctly passing in the system variables to control the split size of the job, if you don't change this you won't alter the numbers of mappers (which you can check in the jobtracker UI). If you get the same number of mappers each time your not actually changing anything. To change the split size, use the system props mapred.min.split.size and mapred.max.split.size
Make sure you are really hitting the cluster and not accidentally running locally with 1 process
Be aware that (unlike Spark) Hadoop has a horrifying job initialization time. IME it's around 20 seconds, and therefore for only 1 GB of data your not really seeing much time difference as the majority of the job is spent in initialization.

why Hadoop shuffling time takes longer than expected

I am trying to figure out which steps takes how much time in simple hadoop wordcount example.
In this example 3 maps and 1 reducer is used where each map generates ~7MB shuffle data. I have a cluster which is connected via 1Gb switches. When I look at the job details, realized that shuffling takes ~7 sec after all map tasks are completed wich is more than expected to transfer such a small data. What could be the reason behind this? Thanks
Hadoop uses heartbeats to communicate with nodes. By default hadoop uses minimal heartbeat interval equals to 3seconds. Consequently hadoop completes your task within two heartbeats (roughly 6 seconds).
More details: https://issues.apache.org/jira/browse/MAPREDUCE-1906
The transfer is not the only thing to complete after the map step. Each mapper outputs their part of a given split locally and sorts it. The reducer that is tasked with a particular split then gathers the parts from each mapper output, each requiring a transfer of 7 MB. The reducer then has to merge these segments into a final sorted file.
Honestly though, the scale you are testing on is absolutely tiny. I don't know all the parts of the Hadoop shuffle step, which I understand has some involved details, but you shouldn't expect performance of such small files to be indicative of actual performance on larger files.
I think the shuffling started after first mapper started. But waited for the next two mappers.
There is option to start reduce phase (begins with shuffling) after all the mappers were finished. But that's not really speed up anything.
(BTW. 7 seconds is considered fast in Hadoop. Hadoop is poor in performance. Especially for small files. Unless somebody else is paying for this. Don't use Hadoop.)

Hadoop - Reduce the number of Spilled Records

I have an Ubuntu vm running in stand alone/pseudo mode with 4gb ram and 4 cores.
Everything is set to default except:
io.file.buffer.size=65536
io.sort.factor=50
io.sort.mb=500
mapred.tasktracker.map.tasks.maximum=4
mapred.tasktracker.reduce.tasks.maximum=4
This ofc will not be a production machine but I am fiddling with it to get the grips with the fine tuning.
My problem is that when I run my benchmark Hadoop Streaming job (get distinct records over a 1.8gb text file) I get quite a lot of spilled records and the above tweaks don't seem to reduce the spills. Also I have noticed that when I monitor the memory usage in the Ubuntu's System Monitor it never gets fully used and never goes above 2.2gb.
I have looked at chaging HADOOP_HEAP, mapred.map.child.java.opts and mapred.reduce.child.java.opts but I am not sure what to set these to as the defaults seem as though they should be enough.
Is there a setting I am missing that will allow Hadoop to utilise the remaining ram therefore reduce spilled records (hopefully speeding up jobs) or is this normal behaviour?
Many Thanks!
In addition to increasing memory, have you considered if you can run a combiner for your task after the map step, which will compress and reduce the amount of records that need to be kept in memory or spilled?
Unfortunately when you are using streaming, seems that this has to be coded in Java, and can't be in whatever language you're using.
http://wiki.apache.org/hadoop/HadoopStreaming
The default memory assigned to map/reduce task is 200mb. You can increase that value with -Dmapred.child.java.opts=-Xmx512M
Anyway, this is a very interesting material about hadoop tunning Hadoop Performance
Hope it helps!

Reducer's Heap out of memory

So I have a few Pig scripts that keep dying in there reduce phase of the job with the errors that the Java heap keeps running out of space. To this date my only solution has been to increase Reducer counts, but that doesn't seem to be getting me anywhere reliable. Now part of this may be just the massive growth in data we are getting, but can't be sure.
I've thought about changing the spill threshold setting, can't recall the setting, but not sure if they would help any or just slow it down. What other things can I look at doing to solve this issue?
On a side note when this starts happening on occasion I also get errors about bash failing to get memory for what I assume is the spill operation. Would this be the Hadoop node running out of memory? If so would just turning down the heap size on these boxes be the solution?
Edit 1
1) Pig 0.8.1
2) The only UDF is an eval udf that just looks at single rows with no bags or maps.
3) I haven't noticed there being any hotspots with bad key distrobution. I have been using the prime number scale to reduce this issue as well.
Edit 2
Here is the error in question:
2012-01-04 09:58:11,179 FATAL org.apache.hadoop.mapred.TaskRunner: attempt_201112070707_75699_r_000054_1 : Map output copy failure : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
Here is the bash error I keep getting:
java.io.IOException: Task: attempt_201112070707_75699_r_000054_0 - The reduce copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:160)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2537)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
Obviously you are running out of memory somewhere. Increasing the number of reducers is actually quite reasonable. Take a look at the stats on the JobTracker Web GUI and see how many bytes are going out of the mapper. Divide that by the number of reduce tasks, and that is a pretty rough estimate of what each reducer is getting. Unfortunately, this only works in the long run if your keys are evenly distributed.
In some cases, JOIN (especially the replicated kind) will cause this type of issue. This happens when you have a "hot spot" of a particular key. For example, say you are doing some sort of join and one of those keys shows up 50% of the time. Whatever reducer gets lucky to handle that key is going to get clobbered. You may want to investigate which keys are causing hot spots and handle them accordingly. In my data, usually these hot spots are useless anyways. To find out what's hot, just do a GROUP BY and COUNT and figure out what's showing up a lot. Then, if it's not useful, just FILTER it out.
Another source of this problem is a Java UDF that is aggregating way too much data. For example, if you have a UDF that goes through a data bag and collects the records into some sort of list data structure, you may be blowing your memory with a hot spot value.
I found that the newer versions of Pig (.8 and .9 particularly) have far fewer memory issues. I had quite a few instances of running out of heap in .7. These versions have much better spill to disk detection so that if its about to blow the heap, it is smart enough to spill to disk.
In order for me to be more helpful, you could post your Pig script and also mention what version of Pig you are using.
I'm not an experienced user or anything, but I did run into a similar problem when runing pig jobs on a VM.
My particular problem, was that the VM had no swap space configured, it would eventually run out of memory. I guess you're trying this in a proper linux configuration, but it would't hurt to do a: free -m and see what you get in result, maybe the problem is due to you having too little swap memory configured.
Just a thought, let me know if it helps. Good luck with your problem!

Resources