Running Mappers and Reducers on different Groups of machines - hadoop

We have a nice, big, complicated elastic-mapreduce job that has wildly different constraints on hardware for the Mapper vs Collector vs Reducer.
The issue is: for the Mappers, we need tonnes of lightweight machines to run several mappers in parallel (all good there); the collectors are more memory hungry, but it should still be OK to give them about 6GB of peak heap each . . . but, the problem is the Reducers. When one of those kicks off, it will grab up about 32-64GB for processing.
The result it that we get a round-robbin type of task death because the full memory of a box is consumed, which causes that one mapper and reducer to both be restarted elsewhere.
The simplest approach would be if we could somehow specify a way to have the reducer run on a different "group" (a handful of ginormous boxes) while having the mappers/collectors running on smaller boxes. This could also lead to significant cost-savings as well, as we really shouldn't be sizing the nodes mappers are running on to the demands of the reducers.
An alternative would be to "break up" the job so that there's a 2nd cluster that can be spun up to process the mappers collector's output--but, that's obviously "sub-optimal".
So, the question are:
Is there a way do specify what "groups" a mapper or a reducer will
run upon Elastic MapReduce and/or Hadoop?
Is there a way to prevent the reducers from starting until all the mappers are done?
Does anyone have other ideas on how to approach this?
Cheers!

During a Hadoop MapReduce job, Reducers start running after all the Mappers are done. The output from the Map phase is shuffled and sorted before partitioning takes place to decide which Reducer receives which data. So, Reducers start running after the Shuffle/Sort phase has ended (after the mappers are done).

Related

How many reducers can simultaneously run?

Learning Big Data at Uni and I'm kind of confused on the topic of MapReduce. I was wondering how many reducers can run simultaneously. For example lets say if we had 864 reducers, how many could run simultaneously?
All of them can run simultaneously depending upon what is the state(health, i.e. no rouge/bad node) of cluster is, what is the capacity of the cluster is and also how free the cluster is. If there are other MR jobs running on the same cluster then out of your 864 reducers only few will go in running state, and once the capacity is free then another set of reducer will start running.
Also there is one case which happens sometimes is when your reducer/mapper keep on preempting each other and takes up the whole memory. Job fails in majority of this case. To avoid this we generally set less number of reducer.
One line answer is - all of them can run simultaneously; as each of the reducer performs an independent unit of task in map reduce framework.
Now, how many would actually run in parallel, or more precisely when each of them would be scheduled to run depends on many factors including but not limited to resource availability, scheduling mechanism, cluster configuration etc.

Hadoop combiner execution on reducers

I have a long running MapReduce job with some mappers taking considerably more time than others.
Checking the stats on the web interface, I saw that my combiner also kicked in on the reducers (which where mostly idle as just 2 mappers were still running).
Although it seems reasonable to not waste time and do some pre-aggregation until all mappers have finished, I cannot find any documentation for this behaviour. Can anyone confirm that this is indeed a feature of Hadoop or just displayed wrong on the web interface?
The combiner starts when a reasonable amount of data has been emitted by the mapper. Note that a combiner runs as an aggregation (typically) of a mapper's output (and not on the reduce side). More details can be found here.
Also, the reducers can start gathering (only) the data that are emitted by the mappers, before all the mappers have finished. That is called the shuffling phase of the reducer. You can change the time when the reducers will start gathering data, by changing the mapred.reduce.slowstart.completed.maps property (or mapreduce.job.reduce.slowstart.completedmaps in newer versions). More details on this SO post.

How many Mapreduce Jobs can be run simultaneously

I want to know How many Mapreduce Jobs can be submit/run simultaneously in a single node hadoop envirnment.Is there any limit?
From a configuration standpoint, there's no limit I'm aware of. You can set the number of map and reduce slots to whatever you want. Practically, though, each slot has to spin up a JVM capable of running some hadoop code, which requires some amount of memory, so eventually you would run out of memory on your machine. You might also have to configure job queues cleverly in order to run a ton at the same time.
Now, what is possible is a very different question than what is a good idea...
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.The number of jobs being executed by hadoop will depend as described by John above.
The number of Reducer slots is set when the cluster is configured. This will limit the number of MapReduce jobs based on the number of Reducers each job requests. Mappers are generally more limited by number of DataNodes and # of processors per node.

when is it a good idea to increase/decrease the number of nodes interactively on a hadoop mapreduce job?

I have an intuition that increasing/decreasing
number of nodes interactively on running job can speed up map-heavy
jobs, but won't help wth reduce heavy jobs, where most of work is done
by reduce.
There's an faq about this but it doesn't really explain very well
http://aws.amazon.com/elasticmapreduce/faqs/#cluster-18
This question was answered by Christopher Smith, who gave me permission to post here.
As always... "it depends". One thing you can pretty much always count
on: adding nodes later on is not going to help you as much as having
the nodes from the get go.
When you create a Hadoop job, it gets split up in to tasks. These
tasks are effectively "atoms of work". Hadoop lets you tweak the # of
mapper and # of reducer tasks during job creation, but once the job is
created, it is static. Tasks are assigned to "slots". Traditionally,
each node is configured to have a certain number of slots for map
tasks, and a certain number of slots for reduce tasks, but you can
tweak that. Some newer versions of Hadoop don't require you to
designate the slots as being for map or reduce tasks. Anyway, the
JobTracker periodically assigns tasks to slots. Because this is done
dynamically, new nodes coming online can speed up the processing of a
job by providing more slots to execute the tasks.
This sets the stage for understanding the reality of adding new nodes.
There's obviously an Amdahl's law issue where having more slots than
pending tasks accomplishes little (if you have speculative execution
enabled, it does help somewhat, as Hadoop will schedule the same task
to run on many different nodes, so that a slow node's tasks can be
completed by faster nodes if there are spare resources). So, if you
didn't define your job with many map or reduce tasks, adding more
nodes isn't going to help much. Of course, each task imposes some
overhead, so you don't want to go crazy high either. That's why I
suggest a guideline for task size should be "something which takes
~2-5 minutes to execute".
Of course, when you add nodes dynamically, they have one other
disadvantage: they don't have any data local. Obviously, if you are at
the start of a EMR pipeline, none of the nodes have data in them, so
doesn't matter, but if you have an EMR pipeline made of many jobs,
with earlier jobs persisting their results to HDFS, you get a huge
performance boost because the JobTracker will favour shaping and
assigning tasks so nodes have that lovely locality of data (this is a
core trick of the whole MapReduce design to maximize performance). On
the reducer side, data is coming from other map tasks, so dynamically
added nodes are really at no disadvantage as compared to other nodes.
So, in principle, dynamically adding new nodes is actually less likely
to help with IO bound map tasks that are reading from HDFS.
Except...
Hadoop has a variety of cheats under the covers to optimize
performance. Once is that it starts transmitting map output data to
the reducers before the map task completes/the reducer starts. This
obviously is a critical optimization for jobs where the mappers
generate a lot of data. You can tweak when Hadoop starts to kick off
the transfers. Anyway, this means that a newly spun up node might be
at a disadvantage, because the existing nodes might already have such
a huge data advantage. Obviously, the more output that the mappers
have transmitted, the larger the disadvantage.
That's how it all really works. In practice though, a lot of Hadoop
jobs have mappers processing tons of data in a CPU intensive fashion,
but outputting comparatively little data to the reducers (or they
might send a lot of data to the reducers, but the reducers are still
very simple, so not CPU bound at all). Often jobs will have few
(sometimes even 0) reducer tasks, so even extra nodes could help, if
you already have a reduce slot available for every outstanding reduce
task, new nodes can't help. New nodes also disproportionately help out
with CPU bound work, for obvious reasons, so because that tends to
be map tasks more than reduce tasks, that's where people typically see
the win. If your mappers are I/O bound and pulling data from the
network, adding new nodes obviously increases the aggregate bandwidth
of the cluster, so it helps there, but if your map tasks are I/O bound
reading HDFS, the best thing is to have more initial nodes, with data
already spread over HDFS. It's not unusual to see reducers get I/O
bound because of poorly structured jobs, in which case adding more
nodes can help a lot, because it splits up the bandwidth again.
There's a caveat there too of course: with a really small cluster,
reducers get to read a lot of their data from the mappers running on
the local node, and adding more nodes shifts more of the data to being
pulled over the much slower network. You can also have cases where
reducers spend most of their time just multiplexing data processing
from all the mappers sending them data (although that is tunable as
well).
If you are asking questions like this, I'd highly recommend profiling
your job using something like Amazon's offering of KarmaSphere. It
will give you a better picture of where your bottlenecks are and what
are your best strategies for improving performance.

Idea's for balancing out a HDFS -> HBase map reduce job

For a client, I've been scoping out the short-term feasibility of running a Cloudera flavor hadoop cluster on AWS EC2. For the most part the results have been expected with the performance of the logical volumes being mostly unreliable, that said doing what I can I've got the cluster to run reasonably well for the circumstances.
Last night I ran a full test of their importer script to pull data from a specified HDFS path and push it into Hbase. Their data is somewhat unusual in that the records are less then 1KB's a piece and have been condensed together into 9MB gzipped blocks. All total there are about 500K text records that get extracted from the gzips, sanity checked, then pushed onto the reducer phase.
The job runs within expectations of the environment ( the amount of spilled records is expected by me ) but one really odd problem is that when the job runs, it runs with 8 reducers yet 2 reducers do 99% of the work while the remaining 6 do a fraction of the work.
My so far untested hypothesis is that I'm missing a crucial shuffle or blocksize setting in the job configuration which causes most of the data to be pushed into blocks that can only be consumed by 2 reducers. Unfortunately the last time I worked on Hadoop, another client's data set was in 256GB lzo files on a physically hosted cluster.
To clarify, my question; is there a way to tweak a M/R Job to actually utilize more available reducers either by lowering the output size of the maps or causing each reducer to cut down the amount of data it will parse. Even a improvement of 4 reducers over the current 2 would be a major improvement.
It seems like you are getting hotspots in your reducers. This is likely because a particular key is very popular. What are the keys as the output of the mapper?
You have a couple of options here:
Try more reducers. Sometimes, you get weird artifacts in the randomness of the hashes, so having a prime number of reducers sometimes helps. This will likely not fix it.
Write a custom partitioner that spreads out the work better.
Figure out why a bunch of your data is getting binned into two keys. Is there a way to make your keys more unique to split up the work?
Is there anything you can do with a combiner to reduce the amount of traffic going to the reducers?

Resources