Flexible heap space allocation to Hadoop MapReduce Mapper tasks - hadoop

I'm having trouble figuring out the best way to configure my Hadoop cluster (CDH4), running MapReduce1. I'm in a situation where I need to run both mappers that require such a large amount of Java heap space that I couldn't possible run more than 1 mapper per node - but at the same time I want to be able to run jobs that can benefit from many mappers per node.
I'm configuring the cluster through the Cloudera management UI, and the Max Map Tasks and mapred.map.child.java.opts appear to be quite static settings.
What I would like to have is something like a heap space pool with X GB available, that would accommodate both kinds of jobs without having to reconfigure the MapReduce service each time. If I run 1 mapper, it should assign X GB heap - if I run 8 mappers, it should assign X/8 GB heap.
I have considered both the Maximum Virtual Memory and the Cgroup Memory Soft/Hard limits, but neither will get me exactly what I want. Maximum Virtual Memory is not effective, since it still is a per task setting. The Cgroup setting is problematic because it does not seem to actually restrict the individual tasks to a lower amount of heap if there is more of them, but rather will allow the task to use too much memory and then kill the process when it does.
Can the behavior I want to achieve be configured?

(PS you should use the newer name of this property with Hadoop 2 / CDH4: mapreduce.map.java.opts. But both should still be recognized.)
The value you configure in your cluster is merely a default. It can be overridden on a per-job basis. You should leave the default value from CDH, or configure it to something reasonable for normal mappers.
For your high-memory job only, in your client code, set mapreduce.map.java.opts in your Configuration object for the Job before you submit it.
The answer gets more complex if you are running MR2/YARN since it no longer schedules by 'slots' but by container memory. So memory enters the picture in a new, different way with new, different properties. (It confuses me, and I'm even at Cloudera.)
In a way it would be better, because you express your resource requirement in terms of memory, which is good here. You would set mapreduce.map.memory.mb as well to a size about 30% larger than your JVM heap size since this is the memory allowed to the whole process. It would be set higher by you for high-memory jobs in the same way. Then Hadoop can decide how many mappers to run, and decide where to put the workers for you, and use as much of the cluster as possible per your configuration. No fussing with your own imaginary resource pool.
In MR1, this is harder to get right. Conceptually you want to set the maximum number of mappers per worker to 1 via mapreduce.tasktracker.map.tasks.maximum, along with your heap setting, but just for the high-memory job. I don't know if the client can request or set this though on a per-job basis. I doubt it as it wouldn't quite make sense. You can't really approach this by controlling the number of mappers just because you have to hack around to even find out, let alone control, the number of mappers it will run.
I don't think OS-level settings will help. In a way these resemble more how MR2 / YARN thinks about resource scheduling. Your best bet may be to (move to MR2 and) use MR2's resource controls and let it figure the rest out.

Related

What is Memory reserved on Yarn

I managed to launch a spark application on Yarn. However memory usage is kind of weird as you can see below :
http://imgur.com/1k6VvSI
What does memory reserved mean ? How can i manage to efficiently use all the memory available ?
Thanks in advance.
Check out this blog from Cloudera that explains the new memory management in YARN.
Here's the pertinent bits:
... An implementation detail of this change that prevents applications from starving under this new flexibility is the notion of reserved containers. Imagine two jobs are running that each have enough tasks to saturate more than the entire cluster. One job wants each of its mappers to get 1GB, and another job wants its mappers to get 2GB. Suppose the first job starts and fills up the entire cluster. Whenever one of its task finishes, it will leave open a 1GB slot. Even though the second job deserves the space, a naive policy will give it to the first one because it’s the only job with tasks that fit. This could cause the second job to be starved indefinitely.
To prevent this unfortunate situation, when space on a node is offered to an application, if the application cannot immediately use it, it reserves it, and no other application can be allocated a container on that node until the reservation is fulfilled. Each node may have only one reserved container. The total reserved memory amount is reported in the ResourceManager UI. A high number means that it may take longer for new jobs to get space. ,,,
A container will become reserved state when the container is assigned to some nodemanager node which do not have enough resource(cpu or memory) for it.

Benchmarking Hadoop on EC2 gives identical performances

I am trying to benchmark Hadoop on EC2. I am using a 1GB file with 1 Master and 5 slaves. When I varied the dfs.blocksize like 1m, 64m, 128m, 500m. I was expecting the best performance at 128m since the file size is 1GB and there are 5 slaves. But to my surprise, irrespective of the block size, time taken falls more or less within the same range. How am I achieving this wierd performance?
Couple of things to think about most likely explanation first
Check you are correctly passing in the system variables to control the split size of the job, if you don't change this you won't alter the numbers of mappers (which you can check in the jobtracker UI). If you get the same number of mappers each time your not actually changing anything. To change the split size, use the system props mapred.min.split.size and mapred.max.split.size
Make sure you are really hitting the cluster and not accidentally running locally with 1 process
Be aware that (unlike Spark) Hadoop has a horrifying job initialization time. IME it's around 20 seconds, and therefore for only 1 GB of data your not really seeing much time difference as the majority of the job is spent in initialization.

Run Map-Reduce application on multiple core on the same machine

I want to run map reduce tasks on a single machine and I want to use all the cores of my machine. Which is the best approach? If I install hadoop in pseudo distributed mode it is possible to use all the cores?
You can make use of the properties mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum to increase the number of Mappers/Reducers spawned simultaneously on a TaskTracker as per your hardware specs. By default, it is set to 2, hence a maximum of 2 maps and 2 reduces will run at a given instance. But, one thing to keep in mind is that if your input is very small then framework will decide it's not worth parallelizing the execution. In such a case you need to handle it by tweaking the default split size through mapred.max.split.size.
Having said that, I, based on my personal experience, have noticed that MR jobs are normally I/O(perhaps memory, sometimes) bound. So, CPU does not really become a bottleneck under normal circumstances. As a result you might find it difficult to fully utilize all the cores on one machine at a time for a job.
I would suggest to devise some strategy to decide the proper number of Mappers/Reducers to efficiently carry out the processing to make sure that you are properly utilizing the CPU since Mappers/Reducers take up slots on each node. One approach could be to take the number of cores, multiply it by .75 and then set the number of Mappers and Reducers as per your needs. For example, you have 12 physical cores or 24 virtual cores, then you could have 24*.75 = 18 slots. Now based on your needs you can decide whether to use 9Mappers+9Reducers or 12Mappers+6Reducers or something else.
I'm reposting my answer from this question: Hadoop and map-reduce on multicore machines
For Apache Hadoop 2.7.3, my experience has been that enabling YARN will also enable multi-core support. Here is a simple guide for enabling YARN on a single node:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node
The default configuration seems to work pretty well. If you want to tune your core usage, then perhaps look into setting 'yarn.scheduler.minimum-allocation-vcores' and 'yarn.scheduler.maximum-allocation-vcores' within yarn-site.xml (https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml)
Also, see here for instructions on how to configure a simple Hadoop sandbox with multicore support: https://bitbucket.org/aperezrathke/hadoop-aee

Share resources between map/reduce

Does running map task slows down the reduce task? By slowing down I mean do they share a common resource?
Of course they are going to affect the system in one way or another. They are both java processes running on the same machines. However, in system configurations these days, it's not that big of a deal as long as you don't do something stupid with the number of slots.
Each map task or reduce task itself isn't multithreaded or multi-processes, so it'll mostly only use one CPU core. This is why a general rule of thumb is 1 map or reduce slot per core makes a bit of sense. So, if you have 12 cores, you could do something like 8 map slots and 4 reduce slots.
Also, the tasks are going to be sharing the same disk, but this isn't that big of a deal either typically since systems have several disks and disk access comes in bursts.
The best way to figure out the best configuration is just try to try different configurations out. It's not hard to set the number of slots, so just tweak it and then rerun some production-representative jobs.
Note that if you are only running one job at a time, the reducers will not be doing much while the mappers are running. In which case, they won't really affect one another. More realistically, you'll have several jobs running and the map tasks of one job will be running at the same time as the other job's reducers.

Hadoop workload

I am currently using wordcount application in hadoop as a benchmark. I find that the cpu usage is fairly nearly constant around 80-90%. I would like to have a fluctuating cpu usage. Is there any hadoop application that can give me this capability? Thanks a lot.
I don't think there's a way to throttle or specify a range for hadoop to use. Hadoop will use the CPU available to it. When I'm running a lot of jobs, I'm constantly in the 90%+ range.
One way you can control the CPU usage is to change the maximum number of mappers/reducers each tasktracker can run simultaneously. This is done through the
mapred.tasktracker.{map|reduce}.tasks.maximum setting in $HADOOP_HOME/conf/core-site.xml.
It will use less CPU on that tasktracker when the number of mapper/reducers is limited.
Another way is to set the configuration value for mapred.tasktracker.{map|reduce}.tasks when setting up the job. This will force that job to use that many mappers/reducers. This number will be split across the available tasktrackers, so if you have 4 nodes and want each node to have 1 mapper you'd set mapred.tasktracker.map.tasks to 4. It's also possible that if a node can run 4 mappers, it will run all 4, I don't know exactly how hadoop will split out the tasks, but forcing a number, per job, is an option.
I hope that helps get you to where you're going. I still don't quite understand what you are looking for. :)

Resources