Debugging Hadoop reducer OutOfMemoryError - hadoop

I'm trying to debug an OutOfMemoryError I'm getting in my Hadoop reducers. The mappers complete successfully. They generate small records that are less than 128 bytes. In my reducer, I collect records with the same key (there around 15 possible keys), and write them to separate output files with MultipleOutputs. The distribution of records per key isn't uniform.
In the middle of the reduce phase, I start getting OutOfMemoryErrors. I've checked a lot of things:
The reducer doesn't save data; once it gets a value, it writes it out to the corresponding output
I tried different values for the number of reduce tasks. Tuning this is a bit weird in my case because more than 15 won't help because there are only 15 keys
Instantiating MultipleOutputs and closing it in reduce(), thinking it holds onto resources for the output files. This only works because keys and output files have a one-to-one mapping.
I tried adding data to the end of the keys so the data would get distributed evenly between reduce tasks
Out of paranoia, mapreduce.reduce.shuffle.memory.limit.percent=0
Verified keys and values really are small
Disabled output compression, thinking there's a memory leak in the compressor
Blindly tuning things like mapreduce.reduce.shuffle.merge.percent
I'm not sure where else memory could be going other than aggressively buffering the shuffle output.
This is running on GCP Dataproc with Hadoop 3.2.2. A lot of guides recommend setting mapreduce.reduce.java.opts. I tried this unsuccessfully, but I also assume Google chose a reasonable default for the host size, and I don't have a convincing story about where the memory's going. My one other theory is something in GoogleHadoopOutputStream that writes to cloud store is buffering. I have some output files between 10GB and 100GB--larger than the memory of the machine.
What else should I look at? Are these other flags I should try to tune? Attaching VisualVM doesn't look easy, but would a heap dump help?

Each GoogleHadoopOutputStream consumes around ~70 MiB of JVM heap because it uploads data to Google Cloud Storage in 64 MiB chunks by default. That's why if you are writing many objects in the same MR task using MultipleOutputs, each task will need number of outputs x 70 MiB JVM heap.
You can reduce memory consumed by each GoogleHadoopOutputStream via fs.gs.outputstream.upload.chunk.size property but this will reduce upload speed to Google Cloud Storage too, that's why a better approach will be to re-factor your MR job to write a single/fewer files in each MR task.

Related

Does Apache Spark read and process in the same time, or in first reads entire file in memory and then starts transformations?

I am curious if Spark first reads entire file into memory and only then starts processing it, meaning applying transformations and actions, or it reads first chunk of a file - applies transformation on it, reads second chunk and so on.
Is there any difference between Spark in Hadoop for the same matter? I read that Spark keeps entire file in memory most of the times, while Hadoop not. But what about the initial step when we read it for the first time and map the keys.
Thanks
I think a fair characterisation would be this:
Both Hadoop (or more accurately MapReduce) and Spark use the same underlying filesystem HDFS to begin with.
During the Mapping phase both will read all data and actually write the map result to disk so that it can be sorted and distributed between nodes via the Shuffle logic.
Both of them do in fact try and cache the data just mapped in memory in addition to spilling it to disk for the Shuffle to do its work.
The difference here though is that Spark is a lot more efficient in this process, trying to optimally align the node chosen for a specific computation with the data already cached on a certain node.
Since Spark also does something called lazy-evaluation the memory use of Spark is very different from Hadoop as a result of planning computation and caching simultaneously.
In in the steps of a word-count job Hadoop does this:
Map all the words to 1.
Write all those mapped pairs of (word, 1) to a single file in HDFS (single file could still span multiple nodes on the distributed HDFS) (this is the shuffle phase)
Sort the rows of (word, 1) in that shared file (this is the sorting phase)
Have the reducers read sections (partitions) from that shared file that now contains all the words sorted and sum up all those 1s for every word.
Spark on the other hand will go the other way around:
It figures that like in Hadoop it is probably most efficient to have all those words summed up via separate Reducer runs, so it decides according to some factors that it wants to split the job into x parts and then merge them into the final result.
So it knows that words will have to be sorted which will require at least part of them in memory at a given time.
After that it evaluates that such a sorted list will require all words mapped to (word, 1) pairs to start the calculation.
It works through steps 3 than 2 than 1.
Now the trick relative to Hadoop is that it knows in Step 3, which in-memory cached items it will need in 2. and in 2. it already knows how these parts (mostly K-V pairs) will be needed in the final step 1.
This allows Spark to very efficiently plan the execution of Jobs, but caching data it knows will be needed in later stages of the job. Hadoop working from the beginning (mapping) to the end without explicitly looking ahead into the following stages, simply cannot use memory this efficiently and hence doesn't waste resources keeping the large chunks in memory, that Spark would keep. Unlike Spark it just doesn't know if all the pairs in a Map phase will be needed in the next step.
The fact that it appears that Spark is keeping the whole dataset in memory hence isn't something Spark actively does, but rather a result of the way Spark is able to plan the execution of a job.
On the other hand, Spark may be able to actually keep fewer things memory in a different kind of job. Counting the number of distinct words is a good example here in my opinion.
Here Spark would have planned ahead and immediately drop a repeat-word from the cache/memory when encountering it during the mapping, while in Hadoop it would go ahead and waste memory on shuffling the repeat words too (I acknowledge there is a million ways to also make Hadoop do this but it's not out of the box, also there is ways of writing your Spark job in unfortunate ways to break these optimisations, but it's not so easy to fool Spark here :)).
Hope this helps understand that the memory use is just a natural consequence of the way Spark works, but not something actively aimed at and also not something strictly required by Spark. It is also perfectly capable of repeatedly spilling data back to disk between steps of the execution when memory becomes an issue.
For more insight into this I recommend learning about the DAG scheduler in Spark from here to see how this is actually done in code.
You'll see that it always follows the pattern of working out where what data is and will be cached before figuring out what to calculate where.
Spark uses lazy iterators to process data and can spill data to disk if necessary. It doesn't read all data in memory.
The difference compared to Hadoop is that Spark can chain multiple operations together.

Hadoop - Reduce the number of Spilled Records

I have an Ubuntu vm running in stand alone/pseudo mode with 4gb ram and 4 cores.
Everything is set to default except:
io.file.buffer.size=65536
io.sort.factor=50
io.sort.mb=500
mapred.tasktracker.map.tasks.maximum=4
mapred.tasktracker.reduce.tasks.maximum=4
This ofc will not be a production machine but I am fiddling with it to get the grips with the fine tuning.
My problem is that when I run my benchmark Hadoop Streaming job (get distinct records over a 1.8gb text file) I get quite a lot of spilled records and the above tweaks don't seem to reduce the spills. Also I have noticed that when I monitor the memory usage in the Ubuntu's System Monitor it never gets fully used and never goes above 2.2gb.
I have looked at chaging HADOOP_HEAP, mapred.map.child.java.opts and mapred.reduce.child.java.opts but I am not sure what to set these to as the defaults seem as though they should be enough.
Is there a setting I am missing that will allow Hadoop to utilise the remaining ram therefore reduce spilled records (hopefully speeding up jobs) or is this normal behaviour?
Many Thanks!
In addition to increasing memory, have you considered if you can run a combiner for your task after the map step, which will compress and reduce the amount of records that need to be kept in memory or spilled?
Unfortunately when you are using streaming, seems that this has to be coded in Java, and can't be in whatever language you're using.
http://wiki.apache.org/hadoop/HadoopStreaming
The default memory assigned to map/reduce task is 200mb. You can increase that value with -Dmapred.child.java.opts=-Xmx512M
Anyway, this is a very interesting material about hadoop tunning Hadoop Performance
Hope it helps!

Idea's for balancing out a HDFS -> HBase map reduce job

For a client, I've been scoping out the short-term feasibility of running a Cloudera flavor hadoop cluster on AWS EC2. For the most part the results have been expected with the performance of the logical volumes being mostly unreliable, that said doing what I can I've got the cluster to run reasonably well for the circumstances.
Last night I ran a full test of their importer script to pull data from a specified HDFS path and push it into Hbase. Their data is somewhat unusual in that the records are less then 1KB's a piece and have been condensed together into 9MB gzipped blocks. All total there are about 500K text records that get extracted from the gzips, sanity checked, then pushed onto the reducer phase.
The job runs within expectations of the environment ( the amount of spilled records is expected by me ) but one really odd problem is that when the job runs, it runs with 8 reducers yet 2 reducers do 99% of the work while the remaining 6 do a fraction of the work.
My so far untested hypothesis is that I'm missing a crucial shuffle or blocksize setting in the job configuration which causes most of the data to be pushed into blocks that can only be consumed by 2 reducers. Unfortunately the last time I worked on Hadoop, another client's data set was in 256GB lzo files on a physically hosted cluster.
To clarify, my question; is there a way to tweak a M/R Job to actually utilize more available reducers either by lowering the output size of the maps or causing each reducer to cut down the amount of data it will parse. Even a improvement of 4 reducers over the current 2 would be a major improvement.
It seems like you are getting hotspots in your reducers. This is likely because a particular key is very popular. What are the keys as the output of the mapper?
You have a couple of options here:
Try more reducers. Sometimes, you get weird artifacts in the randomness of the hashes, so having a prime number of reducers sometimes helps. This will likely not fix it.
Write a custom partitioner that spreads out the work better.
Figure out why a bunch of your data is getting binned into two keys. Is there a way to make your keys more unique to split up the work?
Is there anything you can do with a combiner to reduce the amount of traffic going to the reducers?

HBase bulk load spawn high number of reducer tasks - any workaround

HBase bulk load (using configureIncrementalLoad helper method) configures the job to create as many reducer task as the regions in the hbase table. So if there are few hundred regions then the job would spawn few hundred reducer tasks. This could get very slow on a small cluster..
Is there any workaround possible by using MultipleOutputFormat or something else?
Thanks
Sharding the reduce stage by region is giving you a lot of long-term benefit. You get data locality once the imported data is online. You also can determine when a region has been load balanced to another server. I wouldn't be so quick to go to a coarser granularity.
Since the reduce stage is going a single file write, you should be able to setNumReduceTasks(# of hard drives). That might speed it up more.
It's very easy to get network bottlenecked. Make sure you're compressing your HFile & your intermediate MR data.
job.getConfiguration().setBoolean("mapred.compress.map.output", true);
job.getConfiguration().setClass("mapred.map.output.compression.codec",
org.apache.hadoop.io.compress.GzipCodec.class,
org.apache.hadoop.io.compress.CompressionCodec.class);
job.getConfiguration().set("hfile.compression",
Compression.Algorithm.LZO.getName());
Your data import size might be small enough where you should look at using a Put-based format. This will call the normal HTable.Put API and skip the reducer phase. See TableMapReduceUtil.initTableReducerJob(table, null, job).
When we use HFileOutputFormat, its overrides number of reducers whatever you set.
The number of reducers is equal to number of regions in that HBase table.
So decrease the number of regions if you want to control the number of reducers.
You will find a sample code here:
Hope this will be useful :)

Hadoop Pipes: how to pass large data records to map/reduce tasks

I'm trying to use map/reduce to process large amounts of binary data. The application is characterized by the following: the number of records is potentially large, such that I don't really want to store each record as a separate file in HDFS (I was planning to concatenate them all into a single binary sequence file), and each record is a large coherent (i.e. non-splittable) blob, between one and several hundred MB in size. The records will be consumed and processed by a C++ executable. If it weren't for the size of the records, the Hadoop Pipes API would be fine: but this seems to be based around passing the input to map/reduce tasks as a contiguous block of bytes, which is impractical in this case.
I'm not sure of the best way to do this. Does any kind of buffered interface exist that would allow each M/R task to pull multiple blocks of data in manageable chunks? Otherwise I'm thinking of passing file offsets via the API and streaming in the raw data from HDFS on the C++ side.
I'd like to have any opinions from anyone who's tried anything similar - I'm pretty new to hadoop.
Hadoop is not designed for records about 100MB in size. You will get OutOfMemoryError and uneven splits because some records are 1MB and some are 100MB. By Ahmdal's Law your parallelism will suffer greatly, reducing throughput.
I see two options. You can use Hadoop streaming to map your large files into your C++ executable as-is. Since this will send your data via stdin it will naturally be streaming and buffered. Your first map task must break up the data into smaller records for further processing. Further tasks then operate on the smaller records.
If you really can't break it up, make your map reduce job operate on file names. The first mapper gets some file names, runs them thorough your mapper C++ executable, stores them in more files. The reducer is given all the names of the output files, repeat with a reducer C++ executable. This will not run out of memory but it will be slow. Besides the parallelism issue you won't get reduce jobs scheduled onto nodes that already have the data, resulting in non-local HDFS reads.

Resources