I wonder if it is possible to install a "background" hadoop cluster. I mean, after all it is meant to be able to deal with nodes being unavailable or slow sometimes.
So assuming some university has a computer lab. Say, 100 boxes, all with upscale desktop hardware, gigabit etherner, probably even identical software installation. Linux is really popular here, too.
However, these 100 boxes are of course meant to be desktop systems for students. There are times where the lab will be full, but also times where the lab will be empty. User data is mostly stored on a central storage - say NFS - so the local disks are not used a lot.
Sounds like a good idea to me to use the systems as Hadoop cluster in their idle time. The simplest setup would be of course to have a cron job start the cluster at night, and shut down in the morning. However, also during the day many computers will be unused.
However, how would Hadoop react to e.g. nodes being shut down when any user logs in? Is it possible to easily "pause" (preempt!) a node in hadoop, and moving it to swap when needed? Ideally, we would give Hadoop a chance to move away the computation before suspending the task (also to free up memory). How would one do such a setup? Is there a way to signal Hadoop that a node will be suspended?
As far as I can tell, datanodes should not be stopped, and maybe replication needs to be increased to have more than 3 copies. With YARN there might also be a problem that by moving the task tracker to an arbitrary node, it may be the one that gets suspended at some point. But maybe it can be controlled that there is a small set of nodes that is always on, and that will run the task trackers.
Is it appropriate to just stop the tasktracker or send a SIGSTOP (then resume with SIGCONT)? The first would probably give hadoop the chance to react, the second would continue faster when the user logs out soon (as the job can then continue). How about YARN?
First of all, hadoop doesn't support 'preempt', how you described it.
Hadoop simply restarts task, if it detects, that task tracker dead.
So in you case, when user logins into host, some script simply kills
tasktracker, and jobtracker will mark all mappers/reducers, which were run
on killed tasktracker, as FAILED. After that this tasks will be rescheduled
on different nodes.
Of course such scenario is not free. By design, mappers and reducers
keep all intermediate data on local hosts. Moreover, reducers fetch mappers
data directly from tasktrackers, where mappers was executed. So, when
tasktracker will be killed, all those data will be lost. And in case
of mappers, it is not a big problem, mapper usually works on relatively
small amount of data (gigabytes?), but reducer will suffer greater.
Reducer runs shuffle, which is costly in terms of network bandwidth and
cpu. If tasktracker runs some reducer, restart of this reducer means,
that all data should be redownloaded once more onto new host.
And I recall, that jobtracker doesn't see immediately, that
tasktracker is dead. So, killed tasks shouldn't restart immediately.
If you workload is light, datanodes can live forever, don't put them offline,
when user login. Datanode eats small amount of memory (256M should be enough
in case small amount of data) and if you workload is light, don't eat much
of cpu and disk io.
As conclusion, you can setup such configuration, but don't rely on
good and predictable job execution on moderated workloads.
Related
I am following the tutorial on the hadoop website: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html.
I run the following example in Pseudo-Distributed Mode.
time hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+'
It takes 1:47min to complete. When I turn off the network (wifi), it finishes in approx 50 seconds.
When I run the same command using the Local (Standalone) Mode, it finishes in approx 5 seconds (on a mac).
I understand that in Pseudo-Distributed Mode there is more overhead involved and hence it will take more time, but in this case it takes way more time. The CPU is completely idle during the run.
Do you have any idea what can cause this issue?
First, I don't have an explanation for why turning off your network would result in faster times. You'd have to dig through the Hadoop logs to figure out that problem.
This is typical behavior most people encounter running Hadoop on a single node. Effectively, you are trying to use Fedex to deliver something to your next door neighbor. It will always be faster to walk it over because the inherent overhead of operating a distributed system. When you run local mode, you are only performing the Map-Reduce function. When you run pseudo-distributed, it will use all the Hadoop servers (NameNode, DataNodes for data; Resource Manager, NodeManagers for compute) and what you are seeing is the latencies involved in that.
When you submit your job, the Resource Manager has to schedule it. As your cluster is not busy, it will ask for resources from the Node Manager. The Node Manager will give it a container which will run your Application Master. Typically, this loop takes about 10 seconds. Once your AM is running it will ask for resources from the Resource Manager for it's Map and Reduce tasks. This takes another 10 seconds. Also when you submit your job there is around a 3 second wait before this job is actually submitted to the Resource Manager. So far that's 23 seconds and you haven't done any computation yet.
Once the job is running, the most likely cause of waiting is allocating memory. On smaller systems (> 32GB of memory) the OS might take a while to allocate space. If you were to run the same thing on what is considered commodity hardware for Hadoop (16+ core, 64+ GB) you would probably see run time closer to 25-30 seconds.
Does stateless node mean just being independent of each others? can you explain this concept w.r.t to hadoop
The explanation can be as follows: each mapper/reducer has no idea about all the other mappers/reducers (i.e. about their current states, their particular outputs if any, etc.). Such statelessness is not great for certain data processing workloads (e.g. graph data) but allows easy parallelization (a particular map/reduce task can be run on any node, meaning a failed mapper/reducer is not an issue, just start a new one on the same input split/mappers' outputs).
I would say that statefulness of the nodes in computing infrastructures has slightly different meaning from what you have defined. Remember there is always coordination process running somewhere, so there is no complete independence between the nodes.
What it can actually mean in computing infrastructures is that the nodes does not store anything about the computation they are performing on persistent storage. Consider the following, you have master running on some machine delegating the tasks to the workers, the workers maintain the information in RAM and retrieve it from RAM when necessary for task computation. Workers also write results into RAM. You can consider the worker nodes as stateless, since whenever the worker node fails (from power cut for example) it would not have any mechanism which would allow it to recover the execution from the point it has stopped at. But still master will know that the node has failed and would delegate the task to another machine in the cluster.
Regarding Hadoop, the architecture is statefull, first of all, because whenever the job is starting its execution it will transfer all the metadata to the worker node (the jar file, split location, etc). Secondly, when the job is scheduled on the node which does not contain the input data, it will be transferred there. Additionally, the intermediate data is being stored on the disk, exactly for failure recovery reasons, so the failure recovery mechanisms can resume the job from the point where execution has stopped.
Chris Smith answered this question and said I could post it.
If you have a 200-node mapreduce job, with just 3 running reduce jobs
left lingering, is it safe to switch off all nodes except the master
and the 3 with the running jobs?
Plus maybe a handful more in case of a bad node needing replacement?
If the answer to this question is "yes" it's strange that emr doesn't
automatically turn off most of the nodes nodes when they're not in
use.
Lately there's been several jobs that mostly finished, but with a few
reduces lingering. I think this is costing us since the not-in-use
nodes stay up.
There are these issues I can think of:
-- when does data get copied to S3? If a node is not in use in terms
of running reduce, could it still be needed for copying to S3? In that
case, answer to my question is you're basicaly never safe to switch
off nodes
-- what happens if one of 3 jobs fails? Master/job coordinator should
reassign it to another node. I guess you're safe as long as it can
keep track of what boxes are up, and not wrongly assign to a box that
has been shut off.
If you have a 200-node mapreduce job, with just 3 running reduce jobs
left lingering, is it safe to switch off all nodes except the master
and the 3 with the running jobs?
Plus maybe a handful more in case of a bad node needing replacement?
If the answer to this question is "yes" it's strange that emr doesn't
automatically turn off most of the nodes nodes when they're not in
use.
Keep in mind, EMR is a very thin layer over Hadoop. If you were doing distributed computation on Amazon's fabric, you could be a TON more efficient with something customized for its specific needs which would not really resemble Hadoop or Map/Reduce at all. If you're doing a lot of heavy work with Hadoop, you are often better off with your own cluster or at least with a dedicated cluster in the cloud (that way data is already sliced up on local disk and output need only be persisted to local disk). EMR's main virtue is that it is quick and dirty and hooks in nicely to other parts of AWS (like S3).
Lately there's been several jobs that mostly finished, but with a few
reduces lingering. I think this is costing us since the not-in-use
nodes stay up.
It most definitely is costing you, particularly in terms of runtime. I'd start by being concerned about why the completion times are so non-uniform.
There are these issues I can think of:
-- when does data get copied to S3? If a node is not in use in terms
of running reduce, could it still be needed for copying to S3? In that
case, answer to my question is you're basicaly never safe to switch
off nodes
If you are referring to the output of a job, if you have S3 as your output path for your job configuration, then data from a given task will be written out to S3 before the task exits.
-- what happens if one of 3 jobs fails? Master/job coordinator should
reassign it to another node. I guess you're safe as long as it can
keep track of what boxes are up, and not wrongly assign to a box that
has been shut off.
Well... it's a bit more complicated than that... When the new node is assigned the job, it has to pull the data from somewhere. That somewhere it typically from the mappers who generated the data in the first place. If they aren't there anymore, the map tasks may need to be rerun (or more likely: the job will fail). Normally the replication factor on map output is 1, so this is an entirely plausible scenario. This is one of a few reasons why Hadoop jobs can have their "% complete" go backwards... mappers can even go back from 100% to <100%.
Related to this: it's conceivable, depending on the stage those reducer jobs are in, that they have yet to receive all of the map output that feeds in to them. Obviously in THAT case killing the wrong mapper is deadly.
I think it is important to highlight the difference between taking offline TaskTracker only nodes, vs. nodes running TaskTracker + DataNode service. If you take off more than a couple of the latter, you're going to lose blocks in HDFS, which is usually not a great thing for your job (unless you really don't use HDFS for anything other than distributing your job). You can take off a couple of nodes at a time, and then run a rebalancer to "encourage" HDFS to get the replication factor of all blocks back up to 3. Of course, this triggers network traffic and disk I/O, which might slow down your remaining tasks.
tl;dr: there can be problems killing nodes. While you can be confident that a completed task, which writes its output to S3, has completely written out all of its output by the time the JobTracker is notified the task has completed, the same can't be said for map tasks, which write out to their local directory and transfer data to reducers asynchronously. Even if all the map output has been transferred to their target reducers, if your reducers fail (or if speculative execution triggers the spinning up of a task on another node), you mail really need those other nodes, as Hadoop will likely turn to them for input data for a reassigned reducer.
--
Chris
P.S. This can actually be a big pain point for non-EMR Hadoop setups as well (instead of paying for nodes longer than you need them, it presents as having nodes sitting idle when you have work they could be doing, along with massive compute time loss due to node failures). As a general rule, the tricks to avoid the problem are: keep your tasks sizes pretty consistent and in the 1-5 minute range, enable speculative execution (really crucial in the EMR world where node performance is anything but consistent), keep replication factors up well above your expected node losses for a given job (depending on your node reliability, once you cross >400 nodes with day long job runs, you start thinking about a replication factor of 4), and use a job scheduler that allows new jobs to kick off while old jobs are still finishing up (these days this is usually the default, but it was a totally new thing introduced ~Hadoop 0.20 IIRC). I've even heard of crazy things like using SSD's for mapout dirs (while they can wear out fast from all the writes, their failure scenarios tend to be less catastrophic for a Hadoop job).
So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m.
I want to understand what is the bottleneck in job submitting process and understand next quote
Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time
Some process I'm aware:
1. data splitting
2. jar file sharing
A few things to understand about HDFS and M/R that helps understand this latency:
HDFS stores your files as data chunk distributed on multiple machines called datanodes
M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers. (Think of summing various results from multiple mappers)
Each mapper and reducer is a full fledged program that is spawned on these distributed system. It does take time to spawn a full fledged programs, even if let us say they did nothing (No-OP map reduce programs).
When the size of data to be processed becomes very big, these spawn times become insignificant and that is when Hadoop shines.
If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results.
Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. Parallelization of the processors (mappers and reducers) will show it's advantage here.
So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better.
How much time does it take to do a no-operation map-reduce program on a cluster?
Use MRBench for this purpose:
MRbench loops a small job a number of times
Checks whether small job runs are responsive and running efficiently on your cluster.
Its impact on the HDFS layer is very limited
To run this program, try the following (Check the correct approach for latest versions:
hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50
Surprisingly on one of our dev clusters it was 22 seconds.
Another issue is file size.
If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. Hadoop will typically try to spawn a mapper per block. That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file.
As far as I know, there is no single bottleneck which causes the job run latency; if there was, it would have been solved a long time ago.
There are a number of steps which takes time, and there are reasons why the process is slow. I will try to list them and estimate where I can:
Run hadoop client. It is running Java, and I think about 1 second overhead can be assumed.
Put job into the queue and let the current scheduler to run the job. I am not sure what is overhead, but, because of async nature of the process some latency should exists.
Calculating splits.
Running and syncronizing tasks. Here we face with the fact that TaskTrackes poll the JobTracker, and not opposite. I think it is done for the scalability sake. It mean that when JobTracker wants to execute some task, it do not call task tracker, but wait that approprieate tracker will ping it to get the job. Task trackers can not ping JobTracker to frequently, otherwise they will kill it in large clusters.
Running tasks. Without JVM reuse it takes about 3 seconds, with it overhead is about 1 seconds per task.
Client poll job tracker for the results (at least I think so) and it also add some latency to getting information that job is finished.
I have seen similar issue and I can state the solution to be broken in following steps :
When the HDFS stores too many small files with fixed chunk size, there will be issues on efficiency in HDFS, the best way would be to remove all unnecessary files and small files having data. Try again.
Try with the data nodes and name nodes:
Stop all the services using stop-all.sh.
Format name-node
Reboot machine
Start all services using start-all.sh
Check data and name nodes.
Try installing lower version of hadoop (hadoop 2.5.2) which worked in two cases and it worked in hit and trial.
I'm trying to run the Fair Scheduler, but it's not assigning Map tasks to some nodes with only one job running. My understanding is that the Fair Scheduler will use the conf slot limits unless multiple jobs exist, at which point the fairness calculations kick in. I've also tried setting all queues to FIFO in fair-scheduler.xml, but I get the same results.
I've set the scheduler in all mapred-site.xml files with the mapreduce.jobtracker.taskscheduler parameter (although I believe only the JobTracker needs it) and some nodes have no problem receiving and running Map tasks. However, other nodes either never get any Map tasks, or get one round of Map tasks (ie, all slots filled once) and then never get any again.
I tried this as a prerequisite to developing my own LoadManager, so I went ahead and put a debug LoadManager together. From log messages, I can see that the problem nodes keep requesting Map tasks, and that their slots are empty. However, they're never assigned any.
All nodes work perfectly with the default scheduler. I just started having this issue when I enabled the Fair Scheduler.
Any ideas? Does someone have this working, and has taken a step that I've missed?
EDIT: It's worth noting that the Fair Scheduler web UI page indicates the correct Fair Share count, but that the Running column is always less. I'm using the default per-user pools and only have 1 user and 1 job at a time.
The reason was the undocumented mapred.fairscheduler.locality.delay parameter. The problematic nodes were located on a different rack with HDFS disabled, making all tasks on these nodes non-rack local. Because of this, they were incurring large delays due to the Fair Scheduler's Delay Scheduling algorithm, described here.