Reducer's Heap out of memory - hadoop

So I have a few Pig scripts that keep dying in there reduce phase of the job with the errors that the Java heap keeps running out of space. To this date my only solution has been to increase Reducer counts, but that doesn't seem to be getting me anywhere reliable. Now part of this may be just the massive growth in data we are getting, but can't be sure.
I've thought about changing the spill threshold setting, can't recall the setting, but not sure if they would help any or just slow it down. What other things can I look at doing to solve this issue?
On a side note when this starts happening on occasion I also get errors about bash failing to get memory for what I assume is the spill operation. Would this be the Hadoop node running out of memory? If so would just turning down the heap size on these boxes be the solution?
Edit 1
1) Pig 0.8.1
2) The only UDF is an eval udf that just looks at single rows with no bags or maps.
3) I haven't noticed there being any hotspots with bad key distrobution. I have been using the prime number scale to reduce this issue as well.
Edit 2
Here is the error in question:
2012-01-04 09:58:11,179 FATAL org.apache.hadoop.mapred.TaskRunner: attempt_201112070707_75699_r_000054_1 : Map output copy failure : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
Here is the bash error I keep getting:
java.io.IOException: Task: attempt_201112070707_75699_r_000054_0 - The reduce copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:160)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2537)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)

Obviously you are running out of memory somewhere. Increasing the number of reducers is actually quite reasonable. Take a look at the stats on the JobTracker Web GUI and see how many bytes are going out of the mapper. Divide that by the number of reduce tasks, and that is a pretty rough estimate of what each reducer is getting. Unfortunately, this only works in the long run if your keys are evenly distributed.
In some cases, JOIN (especially the replicated kind) will cause this type of issue. This happens when you have a "hot spot" of a particular key. For example, say you are doing some sort of join and one of those keys shows up 50% of the time. Whatever reducer gets lucky to handle that key is going to get clobbered. You may want to investigate which keys are causing hot spots and handle them accordingly. In my data, usually these hot spots are useless anyways. To find out what's hot, just do a GROUP BY and COUNT and figure out what's showing up a lot. Then, if it's not useful, just FILTER it out.
Another source of this problem is a Java UDF that is aggregating way too much data. For example, if you have a UDF that goes through a data bag and collects the records into some sort of list data structure, you may be blowing your memory with a hot spot value.
I found that the newer versions of Pig (.8 and .9 particularly) have far fewer memory issues. I had quite a few instances of running out of heap in .7. These versions have much better spill to disk detection so that if its about to blow the heap, it is smart enough to spill to disk.
In order for me to be more helpful, you could post your Pig script and also mention what version of Pig you are using.

I'm not an experienced user or anything, but I did run into a similar problem when runing pig jobs on a VM.
My particular problem, was that the VM had no swap space configured, it would eventually run out of memory. I guess you're trying this in a proper linux configuration, but it would't hurt to do a: free -m and see what you get in result, maybe the problem is due to you having too little swap memory configured.
Just a thought, let me know if it helps. Good luck with your problem!

Related

Debugging Hadoop reducer OutOfMemoryError

I'm trying to debug an OutOfMemoryError I'm getting in my Hadoop reducers. The mappers complete successfully. They generate small records that are less than 128 bytes. In my reducer, I collect records with the same key (there around 15 possible keys), and write them to separate output files with MultipleOutputs. The distribution of records per key isn't uniform.
In the middle of the reduce phase, I start getting OutOfMemoryErrors. I've checked a lot of things:
The reducer doesn't save data; once it gets a value, it writes it out to the corresponding output
I tried different values for the number of reduce tasks. Tuning this is a bit weird in my case because more than 15 won't help because there are only 15 keys
Instantiating MultipleOutputs and closing it in reduce(), thinking it holds onto resources for the output files. This only works because keys and output files have a one-to-one mapping.
I tried adding data to the end of the keys so the data would get distributed evenly between reduce tasks
Out of paranoia, mapreduce.reduce.shuffle.memory.limit.percent=0
Verified keys and values really are small
Disabled output compression, thinking there's a memory leak in the compressor
Blindly tuning things like mapreduce.reduce.shuffle.merge.percent
I'm not sure where else memory could be going other than aggressively buffering the shuffle output.
This is running on GCP Dataproc with Hadoop 3.2.2. A lot of guides recommend setting mapreduce.reduce.java.opts. I tried this unsuccessfully, but I also assume Google chose a reasonable default for the host size, and I don't have a convincing story about where the memory's going. My one other theory is something in GoogleHadoopOutputStream that writes to cloud store is buffering. I have some output files between 10GB and 100GB--larger than the memory of the machine.
What else should I look at? Are these other flags I should try to tune? Attaching VisualVM doesn't look easy, but would a heap dump help?
Each GoogleHadoopOutputStream consumes around ~70 MiB of JVM heap because it uploads data to Google Cloud Storage in 64 MiB chunks by default. That's why if you are writing many objects in the same MR task using MultipleOutputs, each task will need number of outputs x 70 MiB JVM heap.
You can reduce memory consumed by each GoogleHadoopOutputStream via fs.gs.outputstream.upload.chunk.size property but this will reduce upload speed to Google Cloud Storage too, that's why a better approach will be to re-factor your MR job to write a single/fewer files in each MR task.

Why does YARN takes a lot of memory for a simple count operation?

I have a standard configured HDP 2.2 environment with Hive, HBase and YARN.
I've used Hive (/w HBase) to perform a simple count operation on a table that has about 10 million rows and it resulted with a 10gb of memory consumption from YARN.
How can I reduce this memory consumption? Why does it need so much memory just to count rows?
A simple count operation involves a map reduce job at the back end. And that involves 10 million rows in your case. Look here for a better explanation. Well this is just for the things happening at the background and execution time and not your question regarding memory requirements. Atleast, it will give you a heads up for the places to look for. This has few solutions to speed up as well. Happy coding

Out of memory error in Mapreduce shuffle phase

I am getting strange errors while running a wordcount-like mapreduce program. I have a hadoop cluster with 20 slaves, each having 4 GB RAM. I configured my map tasks to have a heap of 300MB and my reduce task slots get 1GB. I have 2 map slots and 1 reduce slot per node. Everything goes well until the first round of map tasks finishes. Then there progress remains at 100%. I suppose then the copy phase is taking place. Each map task generates something like:
Map output bytes 4,164,335,564
Map output materialized bytes 608,800,675
(I am using SnappyCodec for compression)
After stalling for about an hour the reduce tasks crach with the following exception:
Error: java.lang.OutOfMemoryError: Java heap space at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1703) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1563) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1401) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1333
I was googling and found this link but I don't really know what to make of it:
hadoop common link
I don't understand why hadoop would experience any problems in copying and merging if it is able to perform a terasort benchmark. It cannot be that all map output should fit into the RAM of the reducer thread. So what is going on here?
In the link provided above they have a discussion about tuning the following parameters:
mapreduce.reduce.shuffle.input.buffer.percent = 0.7
mapreduce.reduce.shuffle.memory.limit.percent = 0.25
mapreduce.reduce.shuffle.parallelcopies = 5
They claim that the fact that the product of the parameters is >1 allows for heapsize errors.
EDIT: Note that 5*1.25*0.7 is still <1 so focus om my second solution post!)
Before restarting this intensive simulation I would be very happy to hear about someone's opinion concerning the problem I am facing since it is bothering for almost a week now. I also seem to not completely understand what is happening in this copy phase, I'd expect a merge sort on disk not to require much heap size?
Thanks a lot in advance for any helpful comments and answers!
I think the clue is that the heapsize of my reduce task was required almost completely for the reduce phase. But the shuffle phase is competing for the same heapspace, the conflict which arose caused my jobs to crash. I think this explains why the job no longer crashes if I lower the shuffle.input.buffer.percent.
The parameter you cite mapred.job.shuffle.input.buffer.percent is apparently a pre Hadoop 2 parameter. I could find that parameter in the mapred-default.xml per the 1.04 docs but it's name has changed to mapreduce.reduce.shuffle.input.buffer.percent per the 2.2.0 docs.
Per the docs this parameter's description is:
The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
For a complete understanding of Sort and Shuffle see Chapter 6.4 of The Hadoop Definitive Guide. That book provides an alternate definition of the parameter mapred.job.shuffle.input.buffer.percent:
The proportion of total heap size to be allocated to the map outputs buffer during the copy phase of the shuffle.
Since you observed that decreasing the value of mapred.job.shuffle.input.buffer.percent from it's default of 0.7 to 0.2 solved your problem, it is pretty safe to say that you could have also solved your problem by increasing the value of the reducer's heap size.
Even after changing the shuffle.input.buffer.percent to 0.2 it doesn't work for me and got the same error.
After doing hit and trial on single node cluster, I found that there needs to be enough space in / directory as the process uses that space in case of spill.
The spill directory also needs to be changed.
Related bug - https://issues.apache.org/jira/browse/MAPREDUCE-6724
Can cause a NegativeArraySizeException if the calculated maxSingleShuffleLimit > MAX_INT

Limiting non-dfs usage per data node

I am facing a strange problem due to Hadoop's crazy data distribution and management. one or two of my data nodes are completely filled up due to Non-DFS usage where as the others are almost empty. Is there a way I can make the non-dfs usage more uniform?
[I have already tried using dfs.datanode.du.reserved but that doesn't help either]
Example for the prob: I have 16 data nodes with 10 GB space each. Initially, each of the nodes have approx. 7 GB free space. When I start a job for processing 5 GB of data (with replication factor=1), I expect the job to complete successfully. But alas! when I monitor the job execution, I see suddenly one node runs out of space because the non-dfs usage is approx 6-7 GB and then it retries and another node now runs out of space. I don't really want to have higher retries because that's won't give the performance metric I am looking for.
Any idea how can I fix this issue.
It sounds like your input isn't being split up properly. You may want to choose a different InputFormat or write your own to better fit your data set. Also make sure that all your nodes are listed in your NameNode's slaves file.
Another problem can be serious data skew - case when big part of data is going to one reducer. You may need to create you own partitioner to solve it.
You can not restrict non-dfs usage, as far as I know. I would suggest to identify what exactly input file (or its split) cause the problem. Then you probably will be able to find solution.
Hadoop MR built under assumption that single split processing can be done using single node resources like RAM or disk space.

Idea's for balancing out a HDFS -> HBase map reduce job

For a client, I've been scoping out the short-term feasibility of running a Cloudera flavor hadoop cluster on AWS EC2. For the most part the results have been expected with the performance of the logical volumes being mostly unreliable, that said doing what I can I've got the cluster to run reasonably well for the circumstances.
Last night I ran a full test of their importer script to pull data from a specified HDFS path and push it into Hbase. Their data is somewhat unusual in that the records are less then 1KB's a piece and have been condensed together into 9MB gzipped blocks. All total there are about 500K text records that get extracted from the gzips, sanity checked, then pushed onto the reducer phase.
The job runs within expectations of the environment ( the amount of spilled records is expected by me ) but one really odd problem is that when the job runs, it runs with 8 reducers yet 2 reducers do 99% of the work while the remaining 6 do a fraction of the work.
My so far untested hypothesis is that I'm missing a crucial shuffle or blocksize setting in the job configuration which causes most of the data to be pushed into blocks that can only be consumed by 2 reducers. Unfortunately the last time I worked on Hadoop, another client's data set was in 256GB lzo files on a physically hosted cluster.
To clarify, my question; is there a way to tweak a M/R Job to actually utilize more available reducers either by lowering the output size of the maps or causing each reducer to cut down the amount of data it will parse. Even a improvement of 4 reducers over the current 2 would be a major improvement.
It seems like you are getting hotspots in your reducers. This is likely because a particular key is very popular. What are the keys as the output of the mapper?
You have a couple of options here:
Try more reducers. Sometimes, you get weird artifacts in the randomness of the hashes, so having a prime number of reducers sometimes helps. This will likely not fix it.
Write a custom partitioner that spreads out the work better.
Figure out why a bunch of your data is getting binned into two keys. Is there a way to make your keys more unique to split up the work?
Is there anything you can do with a combiner to reduce the amount of traffic going to the reducers?

Resources