Hadoop - Reduce the number of Spilled Records - hadoop

I have an Ubuntu vm running in stand alone/pseudo mode with 4gb ram and 4 cores.
Everything is set to default except:
io.file.buffer.size=65536
io.sort.factor=50
io.sort.mb=500
mapred.tasktracker.map.tasks.maximum=4
mapred.tasktracker.reduce.tasks.maximum=4
This ofc will not be a production machine but I am fiddling with it to get the grips with the fine tuning.
My problem is that when I run my benchmark Hadoop Streaming job (get distinct records over a 1.8gb text file) I get quite a lot of spilled records and the above tweaks don't seem to reduce the spills. Also I have noticed that when I monitor the memory usage in the Ubuntu's System Monitor it never gets fully used and never goes above 2.2gb.
I have looked at chaging HADOOP_HEAP, mapred.map.child.java.opts and mapred.reduce.child.java.opts but I am not sure what to set these to as the defaults seem as though they should be enough.
Is there a setting I am missing that will allow Hadoop to utilise the remaining ram therefore reduce spilled records (hopefully speeding up jobs) or is this normal behaviour?
Many Thanks!

In addition to increasing memory, have you considered if you can run a combiner for your task after the map step, which will compress and reduce the amount of records that need to be kept in memory or spilled?
Unfortunately when you are using streaming, seems that this has to be coded in Java, and can't be in whatever language you're using.
http://wiki.apache.org/hadoop/HadoopStreaming

The default memory assigned to map/reduce task is 200mb. You can increase that value with -Dmapred.child.java.opts=-Xmx512M
Anyway, this is a very interesting material about hadoop tunning Hadoop Performance
Hope it helps!

Related

Debugging Hadoop reducer OutOfMemoryError

I'm trying to debug an OutOfMemoryError I'm getting in my Hadoop reducers. The mappers complete successfully. They generate small records that are less than 128 bytes. In my reducer, I collect records with the same key (there around 15 possible keys), and write them to separate output files with MultipleOutputs. The distribution of records per key isn't uniform.
In the middle of the reduce phase, I start getting OutOfMemoryErrors. I've checked a lot of things:
The reducer doesn't save data; once it gets a value, it writes it out to the corresponding output
I tried different values for the number of reduce tasks. Tuning this is a bit weird in my case because more than 15 won't help because there are only 15 keys
Instantiating MultipleOutputs and closing it in reduce(), thinking it holds onto resources for the output files. This only works because keys and output files have a one-to-one mapping.
I tried adding data to the end of the keys so the data would get distributed evenly between reduce tasks
Out of paranoia, mapreduce.reduce.shuffle.memory.limit.percent=0
Verified keys and values really are small
Disabled output compression, thinking there's a memory leak in the compressor
Blindly tuning things like mapreduce.reduce.shuffle.merge.percent
I'm not sure where else memory could be going other than aggressively buffering the shuffle output.
This is running on GCP Dataproc with Hadoop 3.2.2. A lot of guides recommend setting mapreduce.reduce.java.opts. I tried this unsuccessfully, but I also assume Google chose a reasonable default for the host size, and I don't have a convincing story about where the memory's going. My one other theory is something in GoogleHadoopOutputStream that writes to cloud store is buffering. I have some output files between 10GB and 100GB--larger than the memory of the machine.
What else should I look at? Are these other flags I should try to tune? Attaching VisualVM doesn't look easy, but would a heap dump help?
Each GoogleHadoopOutputStream consumes around ~70 MiB of JVM heap because it uploads data to Google Cloud Storage in 64 MiB chunks by default. That's why if you are writing many objects in the same MR task using MultipleOutputs, each task will need number of outputs x 70 MiB JVM heap.
You can reduce memory consumed by each GoogleHadoopOutputStream via fs.gs.outputstream.upload.chunk.size property but this will reduce upload speed to Google Cloud Storage too, that's why a better approach will be to re-factor your MR job to write a single/fewer files in each MR task.

Tuning Hadoop job execution on YARN

A bit of intro - I'm learning about Hadoop. I have implemented machine learning algorithm on top of Hadoop (clustering) and tested it only on a small example (30MB).
A couple of days ago I installed Ambari and created a small cluster of four machines (master and 3 workers). Master has Resource manager and NameNode.
Now I'm testing my algorithm by increasing the amount of data (300MB, 3GB). I'm looking for a pointer how to tune up my mini-cluster. Concretely, I would like to know how to determine MapReduce2 and YARN settings in Ambari.
How to determine min/max memory for container, reserved memory for container, Sort Allocation Memory, map memory and reduce memory?
The problem is that execution of my jobs is very slow on Hadoop (and clustering is an iterative algorithm, which makes things worse).
I have a feeling that my cluster setup is not good, because of the following reason:
I run a job for a dataset of 30MB (I set-up block memory for this job to be 8MB, since data is small and processing is intensive) - execution time 30 minutes
I run the same job, but multiply same dataset 10 times - 300MB (same block size, 8MB) - execution time 2 hours
Now same amount of data - 300MB, but block size 128MB - same execution time, maybe even a bit greater than 2 hours
Size of blocks on HDFS is 128MB, so I thought that this will cause the speedup, but that is not the case. My doubts are that the cluster setup (min/max RAM size, map and reduce RAM) is not good, hence it cannot improve even though greater data locality is achieved.
Could this be the consequence of a bad setup, or am I wrong?
Please set the below properties in Yarn configuratins to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.user-limit-factor=0.33
If you need further info on this, please refer following link https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/

Benchmarking Hadoop on EC2 gives identical performances

I am trying to benchmark Hadoop on EC2. I am using a 1GB file with 1 Master and 5 slaves. When I varied the dfs.blocksize like 1m, 64m, 128m, 500m. I was expecting the best performance at 128m since the file size is 1GB and there are 5 slaves. But to my surprise, irrespective of the block size, time taken falls more or less within the same range. How am I achieving this wierd performance?
Couple of things to think about most likely explanation first
Check you are correctly passing in the system variables to control the split size of the job, if you don't change this you won't alter the numbers of mappers (which you can check in the jobtracker UI). If you get the same number of mappers each time your not actually changing anything. To change the split size, use the system props mapred.min.split.size and mapred.max.split.size
Make sure you are really hitting the cluster and not accidentally running locally with 1 process
Be aware that (unlike Spark) Hadoop has a horrifying job initialization time. IME it's around 20 seconds, and therefore for only 1 GB of data your not really seeing much time difference as the majority of the job is spent in initialization.

Flexible heap space allocation to Hadoop MapReduce Mapper tasks

I'm having trouble figuring out the best way to configure my Hadoop cluster (CDH4), running MapReduce1. I'm in a situation where I need to run both mappers that require such a large amount of Java heap space that I couldn't possible run more than 1 mapper per node - but at the same time I want to be able to run jobs that can benefit from many mappers per node.
I'm configuring the cluster through the Cloudera management UI, and the Max Map Tasks and mapred.map.child.java.opts appear to be quite static settings.
What I would like to have is something like a heap space pool with X GB available, that would accommodate both kinds of jobs without having to reconfigure the MapReduce service each time. If I run 1 mapper, it should assign X GB heap - if I run 8 mappers, it should assign X/8 GB heap.
I have considered both the Maximum Virtual Memory and the Cgroup Memory Soft/Hard limits, but neither will get me exactly what I want. Maximum Virtual Memory is not effective, since it still is a per task setting. The Cgroup setting is problematic because it does not seem to actually restrict the individual tasks to a lower amount of heap if there is more of them, but rather will allow the task to use too much memory and then kill the process when it does.
Can the behavior I want to achieve be configured?
(PS you should use the newer name of this property with Hadoop 2 / CDH4: mapreduce.map.java.opts. But both should still be recognized.)
The value you configure in your cluster is merely a default. It can be overridden on a per-job basis. You should leave the default value from CDH, or configure it to something reasonable for normal mappers.
For your high-memory job only, in your client code, set mapreduce.map.java.opts in your Configuration object for the Job before you submit it.
The answer gets more complex if you are running MR2/YARN since it no longer schedules by 'slots' but by container memory. So memory enters the picture in a new, different way with new, different properties. (It confuses me, and I'm even at Cloudera.)
In a way it would be better, because you express your resource requirement in terms of memory, which is good here. You would set mapreduce.map.memory.mb as well to a size about 30% larger than your JVM heap size since this is the memory allowed to the whole process. It would be set higher by you for high-memory jobs in the same way. Then Hadoop can decide how many mappers to run, and decide where to put the workers for you, and use as much of the cluster as possible per your configuration. No fussing with your own imaginary resource pool.
In MR1, this is harder to get right. Conceptually you want to set the maximum number of mappers per worker to 1 via mapreduce.tasktracker.map.tasks.maximum, along with your heap setting, but just for the high-memory job. I don't know if the client can request or set this though on a per-job basis. I doubt it as it wouldn't quite make sense. You can't really approach this by controlling the number of mappers just because you have to hack around to even find out, let alone control, the number of mappers it will run.
I don't think OS-level settings will help. In a way these resemble more how MR2 / YARN thinks about resource scheduling. Your best bet may be to (move to MR2 and) use MR2's resource controls and let it figure the rest out.

Hadoop single node configuration on the high memory machine

I have a single node instance of Apache Hadoop 1.1.1 with default parameter values (see e.g. [1] and [2]) on the machine with a lot of RAM and very limited free disk space size. Then, I notice that this Hadoop instance wastes a lot of disk space during map tasks. What configuration parameters should I pay attention to in order to take advantage of high RAM capacity and decrease disk space usage?
You can use several of the mapred.* params to compress map output, which will greatly reduce the amount of disk space needed to store mapper output. See this question for some good pointers.
Note that different compression codecs will have different issues (i.e. GZip needs more CPU than LZO, but you have to install LZO yourself). This page has a good discussion of compression issues in Hadoop, although it is a bit dated.
The amount of RAM you need depends upon what you are doing in your map-reduce jobs, although you can increase your heap-size in:
conf/mapred-site.xml mapred.map.child.java.opts
See cluster setup for more details on this.
You can use dfs.datanode.du.reserved in hdfs-site.xml to specify an amount of disk space you won't use. I don't know whether hadoop is able to compensate with higher memory usage.
You'll have a problem, though, if you run a mapreduce job that's disk i/o intensive. I don't think any amount of configuring will help you then.

Resources