Terasort scheduler in Hadoop - sorting

In the Hadoop's implementation of Terasort, there's a scheduler called TeraScheduler. Having read through the code, the scheduler basically does the following:
Pick the host with smallest number of splits
For this host, pick a fixed number of splits with the smallest number of hosts and "pin" them to be executed on this host. The "unchosen" splits are removed from this host.
Repeat for all hosts.
I don't understand the rationale behind this schedule. How does it perform better than the default scheduler (what is the default scheduler, anyway)? Is there any paper explaining its benefits?

The benefits are two folds:
(1) to make sort local as much as possible.
(2) distribute the work evenly across machine
Both aims to improve the performance.

Related

Forcing SGE to use multiple servers

TL;DR: Is there any way to get SGE to round-robin between servers when scheduling jobs, instead of allocating all jobs to the same server whenever it can?
Details:
I have a large compute process that consists of many smaller jobs. I'm using SGE to distribute the work across multiple servers in a cluster.
The process requires a varying number of tasks at different points in time (technically, it is a DAG of jobs). Sometimes the number of parallel jobs is very large (~1 per CPU in the cluster), sometimes it is much smaller (~1 per server). The DAG is dynamic and not uniform so it isn't easy to tell how many parallel jobs there are/will at any given point.
The jobs use a lot of CPU but also do some non trivial amount of IO (especially at job startup and shutdown). They access a shared NFS server connected to all the compute servers. Each compute server has a narrower connection (10Gb/s) but the NFS server has several wide connections (40Gbs) into the communication switch. Not sure what the bandwidth of the switch backbone is, but it is a monster so it should be high.
For optimal performance, jobs should be scheduled across different servers when possible. That is, if I have 20 servers, each with 20 processors, submitting 20 jobs should run one job on each. Submitting 40 jobs should run 2 on each, etc. Submitting 400 jobs would saturate the whole cluster.
However, SGE is perversely intent on minimizing my I/O performance. Submitting 20 jobs would schedule all of them on a single server. So they all fight for a single measly 10Gb network connection when 19 other machines with a bandwidth of 190Gb sit idle.
I can force SGE to execute each job on a different server in several ways (using resources, using special queues, using my parallel environment and specifying '-t 1-', etc.). However, this means I will only be able to run one job per server, period. When the DAG opens up and spawns many jobs, the jobs will stall waiting for a completely free server while 19 out of the 20 processors of each machine will stay idle.
What I need is a way to tell SGE to to assign each job to the next server that has an available slot in a round-robin order. A better way would be to assign the job to the least loaded server (maximal number of unused slots, or maximal fraction of unused slots, or minimal number of used slots, etc.). But a dead simple round-robin would do the trick.
This seems like a much more sensible strategy in general, compared to SGE's policy of running each job on the same server as the previous job, which is just about the worst possible strategy for my case.
I looked over SGE's configuration options but I couldn't find any way to modify the scheduling strategy. That said, SGE's documentation isn't exactly easy to navigate, so I could have easily missed something.
Does anyone know of any way to get SGE to change its scheduling strategy to round-robin or least-loaded or anything along these lines?
Thanks!
Simply change allocation_rule to $round_robin for the SGE parallel environment (sge_pe file):
allocation_rule
The allocation rule is interpreted by the scheduler thread
and helps the scheduler to decide how to distribute parallel
processes among the available machines. If, for instance, a
parallel environment is built for shared memory applications
only, all parallel processes have to be assigned to a single
machine, no matter how much suitable machines are available.
If, however, the parallel environment follows the distri-
buted memory paradigm, an even distribution of processes
among machines may be favorable.
The current version of the scheduler only understands the
following allocation rules:
<int>: An integer number fixing the number of processes
per host. If the number is 1, all processes have
to reside on different hosts. If the special
denominator $pe_slots is used, the full range of
processes as specified with the qsub(1) -pe switch
has to be allocated on a single host (no matter
which value belonging to the range is finally
chosen for the job to be allocated).
$fill_up: Starting from the best suitable host/queue, all
available slots are allocated. Further hosts and
queues are "filled up" as long as a job still
requires slots for parallel tasks.
$round_robin:
From all suitable hosts a single slot is allocated
until all tasks requested by the parallel job are
dispatched. If more tasks are requested than suit-
able hosts are found, allocation starts again from
the first host. The allocation scheme walks
through suitable hosts in a best-suitable-first
order.
Source: http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_pe.html

How many reducers can simultaneously run?

Learning Big Data at Uni and I'm kind of confused on the topic of MapReduce. I was wondering how many reducers can run simultaneously. For example lets say if we had 864 reducers, how many could run simultaneously?
All of them can run simultaneously depending upon what is the state(health, i.e. no rouge/bad node) of cluster is, what is the capacity of the cluster is and also how free the cluster is. If there are other MR jobs running on the same cluster then out of your 864 reducers only few will go in running state, and once the capacity is free then another set of reducer will start running.
Also there is one case which happens sometimes is when your reducer/mapper keep on preempting each other and takes up the whole memory. Job fails in majority of this case. To avoid this we generally set less number of reducer.
One line answer is - all of them can run simultaneously; as each of the reducer performs an independent unit of task in map reduce framework.
Now, how many would actually run in parallel, or more precisely when each of them would be scheduled to run depends on many factors including but not limited to resource availability, scheduling mechanism, cluster configuration etc.

How does Hadoop/MapReduce scale when input data is NOT stored?

The intended use for Hadoop appears to be for when the input data is distributed (HDFS) and already stored local to the nodes at the time of the mapping process.
Suppose we have data which does not need to be stored; the data can be generated at runtime. For example, the input to the mapping process is to be every possible IP address. Is Hadoop capable of efficiently distributing the Mapper work across nodes? Would you need to explicitly define how to split the input data (i.e. the IP address space) to different nodes, or does Hadoop handle that automatically?
Let me first clarify a comment you made. Hadoop is designed to support potentially massively parallel computation across a potentially large number of nodes regardless of where the data comes from or goes. The Hadoop design favors scalability over performance when it has to. It is true that being clever about where the data starts out and how that data is distributed can make a significant difference in how well/quickly a hadoop job can run.
To your question and example, if you will generate the input data you have the choice of generating it before the first job runs or you can generate it within the first mapper. If you generate it within the mapper then you can figure out what node the mapper's running on and then generate just the data that would be reduced in that partition (Use a partitioner to direct data between mappers and reducers)
This is going to be a problem you'll have with any distributed platform. Storm, for example, lets you have some say in which bolt instance will will process each tuple. The terminology might be different, but you'll be implementing roughly the same shuffle algorithm in Storm as you would Hadoop.
You are probably trying to run a non-MapReduce task on a map reduce cluster then. (e.g. IP scanning?) There may be more appropriate tools for this, your know...
A thing few people do not realize is that MapReduce is about checkpointing. It was developed for huge clusters, where you can expect machines to fail during the computation. By having checkpointing and recovery built-in into the architecture, this reduces the consequences of failures and slow hosts.
And that is why everything goes from disk to disk in MapReduce. It's checkpointed before, and it's checkpointed after. And if it fails, only this part of the job is re-run.
You can easily outperform MapReduce by leaving away the checkpointing. If you have 10 nodes, you will win easily. If you have 100 nodes, you will usually win. If you have a major computation and 1000 nodes, chances are that one node fails and you wish you had been doing similar checkpointing...
Now your task doesn't sound like a MapReduce job, because the input data is virtual. It sounds much more as if you should be running some other distributed computing tool; and maybe just writing your initial result to HDFS for later processing via MapReduce.
But of course there are way to hack around this. For example, you could use /16 subnets as input. Each mapper reads a /16 subnet and does it's job on that. It's not that much fake input to generate if you realize that you don't need to generate all 2^32 IPs, unless you have that many nodes in your cluster...
Number of Mappers depends on the number of Splits generated by the implementation of the InputFormat.
There is NLineInputFormat, which you could configure to generate as many splits as there are lines in the input file. You could create a file where each line is an IP range. I have not used it personally and there are many reports that it does not work as expected.
If you really need it, you could create your own implementation of the InputFormat which generates the InputSplits for your virtual data and force as many mappers as you need.

Hadoop Local Mode: number of mappers and reducers

I need to prototype some Hadoop MR code in Hadoop Local mode in my Mac and I would like to hear some of gotcha there might be.
One particular question is about the number of mappers and reducers. Basically it will be one for both? Specifying more than 1 would work at all? I am going to use smaller sample.
You can not specify number of mapper and reducer in the local mode. It is always single threaded. In the same time, if you want to profile your mapper or reducer performance - it will be quite realistic.
Nearest mode which can have many mappers and reducers is pseudo distributed mode when all deamons are running on the single machine.
Both of the above will not take into account possible problems with data locality, shuffling performance. I also do not expect that your dev machine has the same disk subsystem as production..
In a nutshell - if you have low single mapper / reducer performance in the local mode -you can start fixing it. If it does working good - try on real HW before planning your cluster.

Hadoop / AWS elastic map reduce performance

I am looking for a ballpark if any one has experience with this...
Does anyone have benchmarks on the speed of AWS's map reduce?
Lets say I have 100 million records and I am using hadoop streaming (a php script) to map, group, and reduce (with some simple php calculations). The average group will contain 1-6 records.
Also is it better/more cost effective to run a bunch of small instances or larger ones? I realize it is broken up into nodes within an instance but regardless will larger nodes have a higher I/O so that means faster per node per sever (and more cost efficient)?
Also with streaming how is the ratio of mappers vs reducers determined?
I don't know if you can give a meaningful benchmark -- it's kind of like asking how fast a computer program generally runs. It's not possible to say how fast your program will run without knowing anything about the script.
If you mean, how fast are the instances that power an EMR job, they're the same spec as the underlying instances that your specify, from AWS.
If you want a very rough take on the how EMR performs differently: I'd say you will probably run into I/O bottleneck before CPU bottleneck.
In theory this means you should run many small instances and ask for rack diversity, in order to maybe grab more I/O resources from across more machines rather than let them compete. In practice I've found that fewer, higher I/O instances can be more effective. But even this impression doesn't always hold -- really depends on how busy the zone is and where your jobs are scheduled.

Resources