hadoop: tasks not local with file? - hadoop

I ran a hadoop job and when I look in some map tasks I see they are not running where the file's blocks are. E.g., the map task runs on slave1, but the file blocks (all of them) are in slave2. The files are all gzip.
Why is that happening and how to resolve?
UPDATE: note there are many pending tasks, so this is not a case of a node being idle and therefore hosting tasks that read from other nodes.

Hadoop's default (FIFO) scheduler works like this: When a node has spare capacity, it contacts the master and asks for more work. The master tries to assign a data-local task, or a rack-local task, but if it can't, it will assign any task in the queue (of waiting tasks) to that node. However, while this node was being assigned this non-local task (we'll call it task X), it is possible that another node also had spare capacity and contacted the master asking for work. Even if this node actually had a local copy of the data required by X, it will not be assigned that task because the other node was able to acquire the lock to the master slightly faster than the latter node. This results in poor data locality, but FAST task assignment.
In contrast, the Fair Scheduler uses a technique called delayed scheduling that achieves higher locality by delaying non-local task assignment for a "little bit" (configurable). It achieves higher locality but at a small cost of delaying some tasks.
Other people are working on better schedulers, and this may likely be improved in the future. For now, you can choose to use the Fair Scheduler if you wish to achieve higher data locality.
I disagree with #donald-miner's conclusion that "With a default replication factor of 3, you don't see very many tasks that are not data local." He is correct in noting that more replicas will give improve your locality %, but the percentage of data-local tasks may still be very low. I've also ran experiments myself and saw very low data locality with the FIFO scheduler. You could achieve high locality if your job is large (has many tasks), but for the more common, smaller jobs, they suffer from a problem called "head-of-line scheduling". Quoting from this paper:
The first locality problem occurs in small jobs (jobs that
have small input files and hence have a small number of data
blocks to read). The problem is that whenever a job reaches
the head of the sorted list [...] (i.e. has the fewest
running tasks), one of its tasks is launched on the next slot
that becomes free, no matter which node this slot is on. If
the head-of-line job is small, it is unlikely to have data on
the node that is given to it. For example, a job with data on
10% of nodes will only achieve 10% locality.
That paper goes on to cite numbers from a production cluster at Facebook, and they reported observing just 5% of data locality in a large, production environment.
Final note: Should you care if you have low data locality? Not too much. The running time of your jobs may be dominated by the stragglers (tasks that take longer to complete) and shuffle phase, so improving data locality would only have a very modest improve in running time (if any at all).

Unfortunately, the default scheduler isn't that smart. I'm not sure exactly what's going on, but I think it's using some sort of greedy-style scheduling where it tries to schedule what it can now for the next task, and then moves on. There could definitely be improvements made to the hadoop scheduler and there have been a few academic attempts and making hadoop scheduling more optimal.
This research paper shows that the default hadoop scheduler is not optimal. In the results, they show that increasing the replication factor to three improves data locality significantly, with diminishing returns after that.
So, why hasn't the default scheduler been improved? Here is my opinion/theory: With a default replication factor of 3, you don't see very many tasks that are not data local. By having more replicas, you give the schedule more flexibility to fit tasks in the right spots. Basically, it's a coincidence that you have 3 replicas, and the default scheduler takes advantage of that by being implemented in a lazy manner. Since you typically have 3 replicas for redundancy sake already... there isn't much motivation to help scheduler performance for people with a replication of 1.
If you have the space, I suggest just upping the replication factor to two or three. There really isn't much downside.

Related

What is the principle of "code moving to data" rather than data to code?

In a recent discussion about distributed processing and streaming I came across the concept of 'code moving to data'. Can someone please help explaining the same. Reference for this phrase is MapReduceWay.
In terms of Hadoop, it's stated in a question but still could not figure out an explanation of the principle in a tech agnostic way.
The basic idea is easy: if code and data are on different machines, one of them must be moved to the other machine before the code can be executed on the data. If the code is smaller than the data, better to send the code to the machine holding the data than the other way around, if all the machines are equally fast and code-compatible. [Arguably you can send the source and JIT compile as needed].
In the world of Big Data, the code is almost always smaller than the data.
On many supercomputers, the data is partitioned across many nodes, and all the code for the entire application is replicated on all nodes, precisely because the entire application is small compared to even the locally stored data. Then any node can run the part of the program that applies to the data it holds. No need to send the code on demand.
I also just came across the sentence “Moving Computation is Cheaper than Moving Data” (from the Apache Hadoop documentation) and after some reading I think this refers to the principle of data locality.
Data locality is a strategy for task scheduling aimed at optimizing performance based on the observation that moving data across a network is costly, so when choosing which task to prioritize whenever a computing/data node is free, preference will be given to the task that's going to operate on the data in the free node or in its proximity.
This (from Delay Scheduling: A Simple Technique for Achieving
Locality and Fairness in Cluster Scheduling, Zaharia et al., 2010) explains it clearly:
Hadoop’s default scheduler runs jobs in FIFO order, with five priority levels. When the scheduler receives a heartbeat indicating that a map
or reduce slot is free, it scans through jobs in order of priority and submit time to find one with a task of the required type. For maps,
Hadoop uses a locality optimization as in Google’s MapReduce [18]: after selecting a job, the scheduler greedily picks the map task in
the job with data closest to the slave (on the same node if possible, otherwise on the same rack, or finally on a remote rack).
Note that the fact Hadoop replicates data across nodes increases fair scheduling of tasks (the higher the replication, the higher the probability of a task to have data on the next free node and hence get picked to run next).

Map Job Performance on cluster

Suppose I have 15 blocks of data and two clusters. The first cluster has 5 nodes and a replication factor is 1, while the second one has a replication factor is 3. If I run my map job, should I expect any change in the performance or the execution time of the map job?
In other words, how does replication affect the performance of the mapper on a cluster?
When the JobTracker assigns a job to a TaskTracker on HDFS, a job is assigned to a particular node based upon locality of data (preference is same node first, then same network switch/frame). By having different replication factors, you limit the ability for the JobTracker to assign a node local to the data (JobTracker will still assign the task nodes, but without the benefits of locality). The effect is to restrict the number of TaskTracker nodes which are both local to the data (either data on task node, or data on same switch frame), thus affecting performance for work on your task (reducing parallelization).
Your smaller cluster likely has a single switch, so data is local to the network/frame, so the only bottleneck you might experience would be data transfer from one TaskTracker to another, as the JobTracker is likely to assign jobs to all available TaskTrackers.
But with a larger hadoop cluster, the replication factor = 1 would limit the number of TaskTracker nodes local to the data and thus able to efficiently operate on your data.
There are several papers which support data locality, http://web.eecs.umich.edu/~michjc/papers/tandon_hpdic_minimizeRemoteAccess.pdf, this paper which you cited also supports data locality, http://assured-cloud-computing.illinois.edu/sites/default/files/PID1974767.pdf, and this one, http://www.eng.auburn.edu/~xqin/pubs/hcw10.pdf (which tested a 5 node cluster, same as the OP).
This paper quotes significant benefits to data locality, http://grids.ucs.indiana.edu/ptliupages/publications/InvestigationDataLocalityInMapReduce_CCGrid12_Submitted.pdf, and observes that an increase in replication factor gives better locality.
Note that this paper claims little difference between network throughput and local disk access (8%), http://www.cs.berkeley.edu/~ganesha/disk-irrelevant_hotos2011.pdf, but reports orders of magnitude difference in performance between local memory access and either disk or network access. Furhtermore, the paper quotes a large fraction of jobs (64%) find their data cached in memory "in large part due to the heavy-tailed nature of the workload", as most jobs "access only a small fraction of the blocks".
EDIT: This part of my answer is obsolete now that the other answer was edited: "The other answer is not entirely correct." This was meant to address the incorrect implication that less replicas = fewer paralelism. The rest of my answer (below) still applies.
Any node can execute your tasks, regardless of whether the data is located in that node or not. Hadoop will try to achieve data locality (preference order is: node-local, then rack-local, then any node), but if it can't, then it will chose any node that has available compute capacity to run your task.
Performance wise, in a typical multi-rack installation, rack-local performs almost as good as node-local, since the bottleneck occurs when transmitting data across racks. However, with high-end networking equipment (i.e., full-bisection bandwidth), then it wouldn't matter if your computation is rack-local or not. For more details on this, read this paper.
How much performance improvement can you expect from having more replicas (and thus achieving higher data locality)? Not much; 5-20% maximum improvement. But this is an upper limit, when you implement additional popularity-based replication as in this and this projects. NOTE: I did not just make-up those performance improvement numbers; they come from the papers I linked.
Since vanilla Hadoop does not have these mechanisms in place, I would expect your performance to improve at most 1-5%. This is just a ballpark guess, but you can easily run some tests yourself. The reason for this, is that more replicas could improve the performance of some of your map tasks (the ones that are now able to run with a data-local copy of the block), but it would not improve your shuffle and reduce phases. Furthermore, even if just one mapper takes longer than the rest, this one will determine the length of your whole map phase; so for many jobs, it is likely that increasing locality will not improve their running times at all. Finally, I/O bound jobs can be map-input IO bound, shuffle IO bound (map output heavy), or reduce output IO bound. Only the first type (map-input IO bound) would benefit from locality. More details on MapReduce workload characterization in this paper.
If you are further interested in this, you can also read this paper, in which they improve the running times of mappers but having the input data of ALL the mappers in memory.

For a large mapreduce job, with a few lingering reducers, can this job be safely downsized?

Chris Smith answered this question and said I could post it.
If you have a 200-node mapreduce job, with just 3 running reduce jobs
left lingering, is it safe to switch off all nodes except the master
and the 3 with the running jobs?
Plus maybe a handful more in case of a bad node needing replacement?
If the answer to this question is "yes" it's strange that emr doesn't
automatically turn off most of the nodes nodes when they're not in
use.
Lately there's been several jobs that mostly finished, but with a few
reduces lingering. I think this is costing us since the not-in-use
nodes stay up.
There are these issues I can think of:
-- when does data get copied to S3? If a node is not in use in terms
of running reduce, could it still be needed for copying to S3? In that
case, answer to my question is you're basicaly never safe to switch
off nodes
-- what happens if one of 3 jobs fails? Master/job coordinator should
reassign it to another node. I guess you're safe as long as it can
keep track of what boxes are up, and not wrongly assign to a box that
has been shut off.
If you have a 200-node mapreduce job, with just 3 running reduce jobs
left lingering, is it safe to switch off all nodes except the master
and the 3 with the running jobs?
Plus maybe a handful more in case of a bad node needing replacement?
If the answer to this question is "yes" it's strange that emr doesn't
automatically turn off most of the nodes nodes when they're not in
use.
Keep in mind, EMR is a very thin layer over Hadoop. If you were doing distributed computation on Amazon's fabric, you could be a TON more efficient with something customized for its specific needs which would not really resemble Hadoop or Map/Reduce at all. If you're doing a lot of heavy work with Hadoop, you are often better off with your own cluster or at least with a dedicated cluster in the cloud (that way data is already sliced up on local disk and output need only be persisted to local disk). EMR's main virtue is that it is quick and dirty and hooks in nicely to other parts of AWS (like S3).
Lately there's been several jobs that mostly finished, but with a few
reduces lingering. I think this is costing us since the not-in-use
nodes stay up.
It most definitely is costing you, particularly in terms of runtime. I'd start by being concerned about why the completion times are so non-uniform.
There are these issues I can think of:
-- when does data get copied to S3? If a node is not in use in terms
of running reduce, could it still be needed for copying to S3? In that
case, answer to my question is you're basicaly never safe to switch
off nodes
If you are referring to the output of a job, if you have S3 as your output path for your job configuration, then data from a given task will be written out to S3 before the task exits.
-- what happens if one of 3 jobs fails? Master/job coordinator should
reassign it to another node. I guess you're safe as long as it can
keep track of what boxes are up, and not wrongly assign to a box that
has been shut off.
Well... it's a bit more complicated than that... When the new node is assigned the job, it has to pull the data from somewhere. That somewhere it typically from the mappers who generated the data in the first place. If they aren't there anymore, the map tasks may need to be rerun (or more likely: the job will fail). Normally the replication factor on map output is 1, so this is an entirely plausible scenario. This is one of a few reasons why Hadoop jobs can have their "% complete" go backwards... mappers can even go back from 100% to <100%.
Related to this: it's conceivable, depending on the stage those reducer jobs are in, that they have yet to receive all of the map output that feeds in to them. Obviously in THAT case killing the wrong mapper is deadly.
I think it is important to highlight the difference between taking offline TaskTracker only nodes, vs. nodes running TaskTracker + DataNode service. If you take off more than a couple of the latter, you're going to lose blocks in HDFS, which is usually not a great thing for your job (unless you really don't use HDFS for anything other than distributing your job). You can take off a couple of nodes at a time, and then run a rebalancer to "encourage" HDFS to get the replication factor of all blocks back up to 3. Of course, this triggers network traffic and disk I/O, which might slow down your remaining tasks.
tl;dr: there can be problems killing nodes. While you can be confident that a completed task, which writes its output to S3, has completely written out all of its output by the time the JobTracker is notified the task has completed, the same can't be said for map tasks, which write out to their local directory and transfer data to reducers asynchronously. Even if all the map output has been transferred to their target reducers, if your reducers fail (or if speculative execution triggers the spinning up of a task on another node), you mail really need those other nodes, as Hadoop will likely turn to them for input data for a reassigned reducer.
--
Chris
P.S. This can actually be a big pain point for non-EMR Hadoop setups as well (instead of paying for nodes longer than you need them, it presents as having nodes sitting idle when you have work they could be doing, along with massive compute time loss due to node failures). As a general rule, the tricks to avoid the problem are: keep your tasks sizes pretty consistent and in the 1-5 minute range, enable speculative execution (really crucial in the EMR world where node performance is anything but consistent), keep replication factors up well above your expected node losses for a given job (depending on your node reliability, once you cross >400 nodes with day long job runs, you start thinking about a replication factor of 4), and use a job scheduler that allows new jobs to kick off while old jobs are still finishing up (these days this is usually the default, but it was a totally new thing introduced ~Hadoop 0.20 IIRC). I've even heard of crazy things like using SSD's for mapout dirs (while they can wear out fast from all the writes, their failure scenarios tend to be less catastrophic for a Hadoop job).

when is it a good idea to increase/decrease the number of nodes interactively on a hadoop mapreduce job?

I have an intuition that increasing/decreasing
number of nodes interactively on running job can speed up map-heavy
jobs, but won't help wth reduce heavy jobs, where most of work is done
by reduce.
There's an faq about this but it doesn't really explain very well
http://aws.amazon.com/elasticmapreduce/faqs/#cluster-18
This question was answered by Christopher Smith, who gave me permission to post here.
As always... "it depends". One thing you can pretty much always count
on: adding nodes later on is not going to help you as much as having
the nodes from the get go.
When you create a Hadoop job, it gets split up in to tasks. These
tasks are effectively "atoms of work". Hadoop lets you tweak the # of
mapper and # of reducer tasks during job creation, but once the job is
created, it is static. Tasks are assigned to "slots". Traditionally,
each node is configured to have a certain number of slots for map
tasks, and a certain number of slots for reduce tasks, but you can
tweak that. Some newer versions of Hadoop don't require you to
designate the slots as being for map or reduce tasks. Anyway, the
JobTracker periodically assigns tasks to slots. Because this is done
dynamically, new nodes coming online can speed up the processing of a
job by providing more slots to execute the tasks.
This sets the stage for understanding the reality of adding new nodes.
There's obviously an Amdahl's law issue where having more slots than
pending tasks accomplishes little (if you have speculative execution
enabled, it does help somewhat, as Hadoop will schedule the same task
to run on many different nodes, so that a slow node's tasks can be
completed by faster nodes if there are spare resources). So, if you
didn't define your job with many map or reduce tasks, adding more
nodes isn't going to help much. Of course, each task imposes some
overhead, so you don't want to go crazy high either. That's why I
suggest a guideline for task size should be "something which takes
~2-5 minutes to execute".
Of course, when you add nodes dynamically, they have one other
disadvantage: they don't have any data local. Obviously, if you are at
the start of a EMR pipeline, none of the nodes have data in them, so
doesn't matter, but if you have an EMR pipeline made of many jobs,
with earlier jobs persisting their results to HDFS, you get a huge
performance boost because the JobTracker will favour shaping and
assigning tasks so nodes have that lovely locality of data (this is a
core trick of the whole MapReduce design to maximize performance). On
the reducer side, data is coming from other map tasks, so dynamically
added nodes are really at no disadvantage as compared to other nodes.
So, in principle, dynamically adding new nodes is actually less likely
to help with IO bound map tasks that are reading from HDFS.
Except...
Hadoop has a variety of cheats under the covers to optimize
performance. Once is that it starts transmitting map output data to
the reducers before the map task completes/the reducer starts. This
obviously is a critical optimization for jobs where the mappers
generate a lot of data. You can tweak when Hadoop starts to kick off
the transfers. Anyway, this means that a newly spun up node might be
at a disadvantage, because the existing nodes might already have such
a huge data advantage. Obviously, the more output that the mappers
have transmitted, the larger the disadvantage.
That's how it all really works. In practice though, a lot of Hadoop
jobs have mappers processing tons of data in a CPU intensive fashion,
but outputting comparatively little data to the reducers (or they
might send a lot of data to the reducers, but the reducers are still
very simple, so not CPU bound at all). Often jobs will have few
(sometimes even 0) reducer tasks, so even extra nodes could help, if
you already have a reduce slot available for every outstanding reduce
task, new nodes can't help. New nodes also disproportionately help out
with CPU bound work, for obvious reasons, so because that tends to
be map tasks more than reduce tasks, that's where people typically see
the win. If your mappers are I/O bound and pulling data from the
network, adding new nodes obviously increases the aggregate bandwidth
of the cluster, so it helps there, but if your map tasks are I/O bound
reading HDFS, the best thing is to have more initial nodes, with data
already spread over HDFS. It's not unusual to see reducers get I/O
bound because of poorly structured jobs, in which case adding more
nodes can help a lot, because it splits up the bandwidth again.
There's a caveat there too of course: with a really small cluster,
reducers get to read a lot of their data from the mappers running on
the local node, and adding more nodes shifts more of the data to being
pulled over the much slower network. You can also have cases where
reducers spend most of their time just multiplexing data processing
from all the mappers sending them data (although that is tunable as
well).
If you are asking questions like this, I'd highly recommend profiling
your job using something like Amazon's offering of KarmaSphere. It
will give you a better picture of where your bottlenecks are and what
are your best strategies for improving performance.

Estimating Hadoop Scalability Performance on pseudo-distributed nodes?

Are there any tools, packages, or methodologies available to estimate / simulate the scalability performance of Hadoop using only a single machine using a pseudo-distributed architecture? Such a system would need to make accurate estimations based on jobs that do not interfere with each other in the simulation (e.g., with blocked I/O).
In my mind, how this would work is that I'd run all my map / reduce jobs sequentially, and use some metric to estimate how well the system is scaling (e.g., take the longest running map job and estimate that the run time will be bottlenecked by it).
Additionally, I have multiple map/reduce jobs which are being chained together to form the output.
I think it is largely depends on the nature of your job. Let us try to take a few examples:
1. Your job has heavy input formatting and mapper processing, with minimal data passed to reducer. In this case I would estimate that pseudo distributed cluster will realistically reflect real cluster performance (per slot) and you can assume that 5 nodes cluster will have about x5 performance. I would suggest to put enough data that job time will take at least 5-10 times of the job start-up time. This estimation will be better if you have enough splits to ensure data locality during processing.
If you plan to have a lot of relatively small files - put enough in your test, to simulate per task overhead.
2. Your heavily relaying on Hadoop distributed sort capability (shuffling). Its performance in one node and real cluster can be quite different and the factor is hard to estimate.
I can summarize that throughput of mapper and, in some extent, reducer in terms of MB/sec per slot you can estimated from above. Real cluster probably will have not better performance per slot.

Resources