I'm currently working on a cluster using the ClusterVisionOS 3.1. This will be my first time working with a cluster, so I probably haven't tried the "obvious".
I can submit a single job to the cluster with the "qsub" command(this I got working properly)
But the problem starts when submitting multiple jobs at once. I could write a script sending them all at once, but then all nodes would be occupied with my jobs and there are more people here wanting to submit their job.
So here's the deal:
32 nodes (4 processors/slots each)
The best thing would be to tell the cluster to use 3 nodes (12 processors) and queue all my jobs on these nodes/processors, if this is even possible. If I could let the nodes use 1 processor for each job, then that would be perfect.
Ok, so i guess i found out, there is no solution to this problem. My personal solution is write a script that connects through ssh to the cluster and then just let the script check how many jobs are already running under your user name. The script checks if that number does not exceed, lets say, 20 jobs at the same time. As long as this number is not reached it keep submitting jobs.
Maybe its an ugly solution, but a working one!
About the processor thing, the jobs were already submitted to different single processors, fully utilizing the full extent of the nodes.
Related
Learning Big Data at Uni and I'm kind of confused on the topic of MapReduce. I was wondering how many reducers can run simultaneously. For example lets say if we had 864 reducers, how many could run simultaneously?
All of them can run simultaneously depending upon what is the state(health, i.e. no rouge/bad node) of cluster is, what is the capacity of the cluster is and also how free the cluster is. If there are other MR jobs running on the same cluster then out of your 864 reducers only few will go in running state, and once the capacity is free then another set of reducer will start running.
Also there is one case which happens sometimes is when your reducer/mapper keep on preempting each other and takes up the whole memory. Job fails in majority of this case. To avoid this we generally set less number of reducer.
One line answer is - all of them can run simultaneously; as each of the reducer performs an independent unit of task in map reduce framework.
Now, how many would actually run in parallel, or more precisely when each of them would be scheduled to run depends on many factors including but not limited to resource availability, scheduling mechanism, cluster configuration etc.
I manage a small team of developers and at any given time we have several on going (one-off) data projects that could be considered "Embarrassingly parallel" - These generally involve running a single script on a single computer for several days, a classic example would be processing several thousand PDF files to extract some key text and place into a CSV file for later insertion into a database.
We are now doing enough of these type of tasks that I started to investigate developing a simple job queue system using RabbitMQ with a few spare servers (with an eye to use Amazon SQS/S3/EC2 for projects that needed larger scaling)
In searching for examples of others doing this I keep coming across the classic Hadoop New York Times example:
The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth)
Which sounds perfect? So I researched Hadoop and Map/Reduce.
But what I can't work out is how they did it? Or why they did it?
Converting TIFF's in PDF's is not a Map/Reduce problem surely? Wouldn't a simple Job Queue have been better?
The other classic Hadoop example is the "wordcount" from the Yahoo Hadoop Tutorial seems a perfect fit for Map/Reduce, and I can see why it is such a powerful tool for Big Data.
I don't understand how these "Embarrassingly parallel" tasks are put into the Map/Reduce pattern?
TL;DR
This is very much a conceptual question, basically I want to know how would I fit a task of "processing several thousand PDF files to extract some key text and place into a CSV file" into a Map/Reduce pattern?
If you know of any examples that would be perfect, I'm not asking you to write it for me.
(Notes: We have code to process the PDF's, I'm not asking for that - it's just an example, it could be any task. I'm asking about putting that processes like that into the Hadoop Map/Reduce pattern - when there is no clear "Map" or "Reduce" elements to a task.)
Cheers!
Your thinking is right.
The above examples that you mentioned used only part of the solution that hadoop offers. They definitely used parallel computing ability of hadoop plus the distributed file system. It's not necessary that you always will need a reduce step. You may not have any data interdependency between the parallel processes that are run. in which case you will eliminate the reduce step.
I think your problem also will fit into hadoop solution domain.
You have huge data - huge number of PDF files
And a long running job
You can process these files parallely by placing your files on HDFS and running a MapReduce job. Your processing time theoretically improves by the number of nodes that you have on your cluster. If you do not see the need to aggregate the data sets that are produced by the individual threads you do not need to use a reduce step else you need to design a reduce step as well.
The thing here is if you do not need a reduce step, you are just leveraging the parallel computing ability of hadoop plus you are equipped to run your jobs on not so expensive hardware.
I need to add one more thing: error handling and retry. In a distributed environment nodes fail is pretty common. I regularly run EMR cluster consisting of several hundred nodes at time for 3 - 8 days and find out that 3 or 4 fail during that period is very likely.
Hadoop JobTracker will nicely re-submit failed tasks (up to a certain number of times) in a different node.
Chris Smith answered this question and said I could post it.
If you have a 200-node mapreduce job, with just 3 running reduce jobs
left lingering, is it safe to switch off all nodes except the master
and the 3 with the running jobs?
Plus maybe a handful more in case of a bad node needing replacement?
If the answer to this question is "yes" it's strange that emr doesn't
automatically turn off most of the nodes nodes when they're not in
use.
Lately there's been several jobs that mostly finished, but with a few
reduces lingering. I think this is costing us since the not-in-use
nodes stay up.
There are these issues I can think of:
-- when does data get copied to S3? If a node is not in use in terms
of running reduce, could it still be needed for copying to S3? In that
case, answer to my question is you're basicaly never safe to switch
off nodes
-- what happens if one of 3 jobs fails? Master/job coordinator should
reassign it to another node. I guess you're safe as long as it can
keep track of what boxes are up, and not wrongly assign to a box that
has been shut off.
If you have a 200-node mapreduce job, with just 3 running reduce jobs
left lingering, is it safe to switch off all nodes except the master
and the 3 with the running jobs?
Plus maybe a handful more in case of a bad node needing replacement?
If the answer to this question is "yes" it's strange that emr doesn't
automatically turn off most of the nodes nodes when they're not in
use.
Keep in mind, EMR is a very thin layer over Hadoop. If you were doing distributed computation on Amazon's fabric, you could be a TON more efficient with something customized for its specific needs which would not really resemble Hadoop or Map/Reduce at all. If you're doing a lot of heavy work with Hadoop, you are often better off with your own cluster or at least with a dedicated cluster in the cloud (that way data is already sliced up on local disk and output need only be persisted to local disk). EMR's main virtue is that it is quick and dirty and hooks in nicely to other parts of AWS (like S3).
Lately there's been several jobs that mostly finished, but with a few
reduces lingering. I think this is costing us since the not-in-use
nodes stay up.
It most definitely is costing you, particularly in terms of runtime. I'd start by being concerned about why the completion times are so non-uniform.
There are these issues I can think of:
-- when does data get copied to S3? If a node is not in use in terms
of running reduce, could it still be needed for copying to S3? In that
case, answer to my question is you're basicaly never safe to switch
off nodes
If you are referring to the output of a job, if you have S3 as your output path for your job configuration, then data from a given task will be written out to S3 before the task exits.
-- what happens if one of 3 jobs fails? Master/job coordinator should
reassign it to another node. I guess you're safe as long as it can
keep track of what boxes are up, and not wrongly assign to a box that
has been shut off.
Well... it's a bit more complicated than that... When the new node is assigned the job, it has to pull the data from somewhere. That somewhere it typically from the mappers who generated the data in the first place. If they aren't there anymore, the map tasks may need to be rerun (or more likely: the job will fail). Normally the replication factor on map output is 1, so this is an entirely plausible scenario. This is one of a few reasons why Hadoop jobs can have their "% complete" go backwards... mappers can even go back from 100% to <100%.
Related to this: it's conceivable, depending on the stage those reducer jobs are in, that they have yet to receive all of the map output that feeds in to them. Obviously in THAT case killing the wrong mapper is deadly.
I think it is important to highlight the difference between taking offline TaskTracker only nodes, vs. nodes running TaskTracker + DataNode service. If you take off more than a couple of the latter, you're going to lose blocks in HDFS, which is usually not a great thing for your job (unless you really don't use HDFS for anything other than distributing your job). You can take off a couple of nodes at a time, and then run a rebalancer to "encourage" HDFS to get the replication factor of all blocks back up to 3. Of course, this triggers network traffic and disk I/O, which might slow down your remaining tasks.
tl;dr: there can be problems killing nodes. While you can be confident that a completed task, which writes its output to S3, has completely written out all of its output by the time the JobTracker is notified the task has completed, the same can't be said for map tasks, which write out to their local directory and transfer data to reducers asynchronously. Even if all the map output has been transferred to their target reducers, if your reducers fail (or if speculative execution triggers the spinning up of a task on another node), you mail really need those other nodes, as Hadoop will likely turn to them for input data for a reassigned reducer.
--
Chris
P.S. This can actually be a big pain point for non-EMR Hadoop setups as well (instead of paying for nodes longer than you need them, it presents as having nodes sitting idle when you have work they could be doing, along with massive compute time loss due to node failures). As a general rule, the tricks to avoid the problem are: keep your tasks sizes pretty consistent and in the 1-5 minute range, enable speculative execution (really crucial in the EMR world where node performance is anything but consistent), keep replication factors up well above your expected node losses for a given job (depending on your node reliability, once you cross >400 nodes with day long job runs, you start thinking about a replication factor of 4), and use a job scheduler that allows new jobs to kick off while old jobs are still finishing up (these days this is usually the default, but it was a totally new thing introduced ~Hadoop 0.20 IIRC). I've even heard of crazy things like using SSD's for mapout dirs (while they can wear out fast from all the writes, their failure scenarios tend to be less catastrophic for a Hadoop job).
I'm starting to play around with hadoop(but don't have access to a cluster yet so just playing around in standalone). My question is, once its in a cluster setup, how are tasks distributed and can the code base be transfered to new nodes?
Ideally, I would like to run large batch jobs and if I need more capacity add new nodes to a cluster but I'm not sure if I'll have to copy the same code thats running locally or do something special so while the batch job is running I can add capacity. I thought I could store my codebase on the HDFS and have it pulled locally to run every time I need it but that still means I need some kind of initial script on the server and need to run it manually first.
Any suggestions or advice on if this is possible would be great!
Thank you.
When you schedule a mapreduce job using the hadoop jar command, the jobtracker will determine how many mappers are needed to execute your job. This is usually determined by the number of blocks in the input file, and this number is fixed, no matter how many worker nodes you have. It then will enlist one or more tasktrackers to execute your job.
The application jar (along with any other jars that are specified using the -libjars argument), is copied automatically to all of the machines running the tasktrackers that are used to execute your jars. All of that is handled by the Hadoop infrastructure.
Adding additional tasktrackers will increase the parallelism of your job assuming that there are as-yet-unscheduled map tasks. What it will not do is automatically re-partition the input to parallelize across additional map capacity. So if you have a map capacity of 24 (assuming 6 mappers on each of 4 data nodes), and you have 100 map tasks with the first 24 executing, and you add another data node, you'll get some additional speed. If you have only 12 map tasks, adding machines won't help you.
Finally, you need to be aware of data reference locality. Since the data should ideally be processed on the same machines that store it initially, adding new task trackers will not necessarily add proportional processing speed, since the data will not be local on those nodes initially and will need to be copied over the network.
I do not quite agree with Daniel's reply.
Primarily because if "on starting a job, jar code will be copied to all the nodes that the cluster knows of" is true, then even if you use 100 mappers and there are 1000 nodes, code for all jobs will always be copied to all the nodes. Does not make sense.
Instead Chris Shain's reply makes more sense that whenever JobScheduler on JobTracker chooses a job to be executed and identifies a task to be executed by a particular datanode then at this time somehow it conveys the tasktracker from where to copy the codebase.
Initially (before mapreduce job start), the codebase was copied to multiple locations as defined by mapred.submit.replication parameter. Hence, tasktracker can copy the codebase from several locations a list of which may be sent by jobtracker to it.
Before attempting to build a Hadoop cluster I would suggest playing with Hadoop using Amazon's Elastic MapReduce.
With respect to the problem that you are trying to solve, I am not sure that Hadoop is a proper fit. Hadoop is useful for trivially parallelizable batch jobs: parse thousonds (or more) documents, sorting, re-bucketing data). Hadoop Streaming will allow you to create mappers and reducer using any language that you like but the inputs and outputs must be in a fixed format. There are many uses but, in my opinion, process control was not one of the design goals.
[EDIT] Perhaps ZooKeeper is closer to what you are looking for.
You could add capacity to the batch job if you want but it needs to be presented as a possibility in your codebase. For example, if you have a mapper that contains a set of inputs that you want to assign multiple nodes to take the pressure you can. All of this can be done but not with the default Hadoop install.
I'm currently working on a Nested Map-Reduce framework that extends the Hadoop codebase and allows you to spawn more nodes based on inputs that the mapper or reducer gets. If you're interested drop me a line and i'll explain more.
Also, when it comes to the -libjars option, this only works for the nodes that are assigned by the jobtracker as instructed by the job you write. So if you specify 10 mappers, the -libjar will copy your code there. If you want to start with 10, but work your way up, the nodes you add will not have the code.
Easiest way to bypass this is to add your jar to the classpath of the hadoop-env.sh script. That will always when starting a job copy that jar to all the nodes that the cluster knows off.
I'm trying to run the Fair Scheduler, but it's not assigning Map tasks to some nodes with only one job running. My understanding is that the Fair Scheduler will use the conf slot limits unless multiple jobs exist, at which point the fairness calculations kick in. I've also tried setting all queues to FIFO in fair-scheduler.xml, but I get the same results.
I've set the scheduler in all mapred-site.xml files with the mapreduce.jobtracker.taskscheduler parameter (although I believe only the JobTracker needs it) and some nodes have no problem receiving and running Map tasks. However, other nodes either never get any Map tasks, or get one round of Map tasks (ie, all slots filled once) and then never get any again.
I tried this as a prerequisite to developing my own LoadManager, so I went ahead and put a debug LoadManager together. From log messages, I can see that the problem nodes keep requesting Map tasks, and that their slots are empty. However, they're never assigned any.
All nodes work perfectly with the default scheduler. I just started having this issue when I enabled the Fair Scheduler.
Any ideas? Does someone have this working, and has taken a step that I've missed?
EDIT: It's worth noting that the Fair Scheduler web UI page indicates the correct Fair Share count, but that the Running column is always less. I'm using the default per-user pools and only have 1 user and 1 job at a time.
The reason was the undocumented mapred.fairscheduler.locality.delay parameter. The problematic nodes were located on a different rack with HDFS disabled, making all tasks on these nodes non-rack local. Because of this, they were incurring large delays due to the Fair Scheduler's Delay Scheduling algorithm, described here.