Can Hadoop distribute tasks and code base? - hadoop

I'm starting to play around with hadoop(but don't have access to a cluster yet so just playing around in standalone). My question is, once its in a cluster setup, how are tasks distributed and can the code base be transfered to new nodes?
Ideally, I would like to run large batch jobs and if I need more capacity add new nodes to a cluster but I'm not sure if I'll have to copy the same code thats running locally or do something special so while the batch job is running I can add capacity. I thought I could store my codebase on the HDFS and have it pulled locally to run every time I need it but that still means I need some kind of initial script on the server and need to run it manually first.
Any suggestions or advice on if this is possible would be great!
Thank you.

When you schedule a mapreduce job using the hadoop jar command, the jobtracker will determine how many mappers are needed to execute your job. This is usually determined by the number of blocks in the input file, and this number is fixed, no matter how many worker nodes you have. It then will enlist one or more tasktrackers to execute your job.
The application jar (along with any other jars that are specified using the -libjars argument), is copied automatically to all of the machines running the tasktrackers that are used to execute your jars. All of that is handled by the Hadoop infrastructure.
Adding additional tasktrackers will increase the parallelism of your job assuming that there are as-yet-unscheduled map tasks. What it will not do is automatically re-partition the input to parallelize across additional map capacity. So if you have a map capacity of 24 (assuming 6 mappers on each of 4 data nodes), and you have 100 map tasks with the first 24 executing, and you add another data node, you'll get some additional speed. If you have only 12 map tasks, adding machines won't help you.
Finally, you need to be aware of data reference locality. Since the data should ideally be processed on the same machines that store it initially, adding new task trackers will not necessarily add proportional processing speed, since the data will not be local on those nodes initially and will need to be copied over the network.

I do not quite agree with Daniel's reply.
Primarily because if "on starting a job, jar code will be copied to all the nodes that the cluster knows of" is true, then even if you use 100 mappers and there are 1000 nodes, code for all jobs will always be copied to all the nodes. Does not make sense.
Instead Chris Shain's reply makes more sense that whenever JobScheduler on JobTracker chooses a job to be executed and identifies a task to be executed by a particular datanode then at this time somehow it conveys the tasktracker from where to copy the codebase.
Initially (before mapreduce job start), the codebase was copied to multiple locations as defined by mapred.submit.replication parameter. Hence, tasktracker can copy the codebase from several locations a list of which may be sent by jobtracker to it.

Before attempting to build a Hadoop cluster I would suggest playing with Hadoop using Amazon's Elastic MapReduce.
With respect to the problem that you are trying to solve, I am not sure that Hadoop is a proper fit. Hadoop is useful for trivially parallelizable batch jobs: parse thousonds (or more) documents, sorting, re-bucketing data). Hadoop Streaming will allow you to create mappers and reducer using any language that you like but the inputs and outputs must be in a fixed format. There are many uses but, in my opinion, process control was not one of the design goals.
[EDIT] Perhaps ZooKeeper is closer to what you are looking for.

You could add capacity to the batch job if you want but it needs to be presented as a possibility in your codebase. For example, if you have a mapper that contains a set of inputs that you want to assign multiple nodes to take the pressure you can. All of this can be done but not with the default Hadoop install.
I'm currently working on a Nested Map-Reduce framework that extends the Hadoop codebase and allows you to spawn more nodes based on inputs that the mapper or reducer gets. If you're interested drop me a line and i'll explain more.
Also, when it comes to the -libjars option, this only works for the nodes that are assigned by the jobtracker as instructed by the job you write. So if you specify 10 mappers, the -libjar will copy your code there. If you want to start with 10, but work your way up, the nodes you add will not have the code.
Easiest way to bypass this is to add your jar to the classpath of the script. That will always when starting a job copy that jar to all the nodes that the cluster knows off.


Specify the number of map-slots in concurrent hadoop jobs

I have a a hadoop system running. It has all together 8 map slots in parallel. The DFS block size is 128M.
Now suppose I have two jobs: both of them have large input files, say a hundred G. I want them to run in parallel in the hadoop system. (Because the users do not want to wait. They want to see some progress.) I want the first ones take 5 map slots in parallel, the second one runs on the rest 3 map slots. Is that possible to specify the number of map-slots? Currently I use command line to start it as Hadoop jar jarfile classname input output. Can I specify it in the command line?
Thank you very much for the help.
Resource allocation can be done using a Scheduler. Classic Hadoop uses a JobQueueTaskScheduler, while YARN uses CapacityScheduler by default. According to the Hadoop documentation
This document describes the CapacityScheduler, a pluggable scheduler for Hadoop which allows for multiple-tenants to securely share a large cluster such that their applications are allocated resources in a timely manner under constraints of allocated capacities.

Big Data File Processing in Map Reduce

I am trying to understand how does the Map Reduce work in general. So what I know is that there are Mappers that run in parallel over several computers and create a resultset which is then used by Reducers running in parallel over several machines to create the intended data set.
My questions are:
Does one job run on a fixed number of files? So, at the start of
a Job, there is a fixed number of files that need to be processed to
process and produce some data.
If no, then how can we process a
stream of Data that may be coming from different sources maybe
Twitter feeds etc?
If Yes, Please explain how the Map Reduce find
out when all the Mappers have finished and Reducing task should
begin because possibly there is no point of reference.
Yes. Basically a job starts, process files and ends. No running forever.
Stream processing can be handled by Storm or similar
technologies but not Hadoop alone, since it's a batch processing system. You can also look for how Hadoop Yarn and Storm can work together.
The should be a point of reference, because tasktracker running in
different nodes sends status info of different tasks (Map tasks /Reduce tasks) being run
periodically to the jobtracker, which co-ordinates the job run.

Hadoop - "Code moves near data for computation"

I just want to clarify this quote "Code moves near data for computation",
does this mean all java MR written by developer deployed to all servers in cluster ?
If 1 is true, if someone changes a MR program, how its distributed to all the servers ?
Hadoop put MR job's jar to the HDFS - its distributed file system. The task trackers which needed it will take it from there. So it distributed to some nodes and then loaded on-demand by nodes which actually needs them. Usually this needs mean that node is going to process local data.
Hadoop cluster is "stateless" in relation to the jobs. Each time job is viewed as something new and "side effects" of the previous job are not used.
Indeed, when some small number of files (or splits to be precise) are to be processed on large cluster, optimization of sending jar to only few hosts where data indeed reside might somewhat reduce the job latency. I do not know if such optimization is planned.
In hadoop cluster you use the same nodes for data and computation. That means your hdfs datanode is setup on the same cluster used by task tracker for computation. So now when you execute MR jobs job tracker looks where your data is stored. Whereas in other computation model data is not stored in the same cluster and you may have to move data while you are doing your computation on some compute node.
After you start a job all the map functions will get splits of your input file. These map functions are executed so that split of input file is closer to them or in other words in the same rack. This is what we mean by computation is done closer to data.
So to clarify your question, every time you run MR job its code is copied to all the nodes. So if we change a code a new code is copied to all the nodes.

Synchronize data to HBase/HDFS and use it as input to MapReduce job

I would like to synchronize data to a Hadoop filesystem. This data is intended to be used as input for a scheduled MapReduce job.
This example might explain more:
Lets say I have an input stream of documents which contain a bunch of words, these words are needed as input for a MapReduce WordCount job. So, for each document, all words should be parsed out and uploaded to the filesystem. However, if the same document arrives from the input stream again, I only want the changes to be uploaded (or deleted) from the filesystem.
How should the data be stored; should I use HDFS or HBase? The amount of data is not very large, maybe a couple of GB.
Is it possible to start scheduled MapReduce jobs with input from HDFS and/or HBase?
I would first pick the best tool for the job, or do some research to make a reasonable choice. You're asking the question, which is the most important step. Given the amount of data you're planning to process, Hadoop is probably just one option. If this is the first step towards bigger and better things, then that would narrow the field.
I would then start off with the simplest approach that I expect to work, which typically means using the tools I already know. Write code flexibly to make it easier to replace original choices with better ones as you learn more or run into roadblocks. Given what you've stated in your question, I'd start off by using HDFS, using Hadoop command-lines tools to push the data to an HDFS folder (hadoop fs -put ...). Then, I'd write an MR job or jobs to do the processing, running them manually. When it was working I'd probably use cron to handle scheduling of the jobs.
That's a place to start. As you build the process, if you reach a point where HBase seems like a natural fit for what you want to store, then switch over to that. Solve one problem at a time, and that will give you clarity on which tools are the right choice each step of the way. For example, you might get to the scheduling step and know by that time that cron won't do what you need - perhaps your organization has requirements for job scheduling that cron won't fulfil. So, you pick a different tool.

In Hadoop can we control the number of nodes per job programatically?

I am running a job timing analysis. I have a pre configured cluster with 8 nodes. I want to run a given job with 8 nodes, 6 nodes , 4 nodes and 2 nodes respectively and note down the corresponding run times. Is there a way i can do this programatically, i.e by using appropriate settings in the Job configuration in Java code ?
There are a couple of ways. Would prefer in the same order.
exclude files can be used to not allow some of the task trackers/data nodes connect to the job tracker/ name node. Check this faq. The properties to be used are mapreduce.jobtracker.hosts.exclude.filename and dfs.hosts.exclude. Note than once the files have been changed, the name node and the job tracker have to be refreshed using the mradmin and dfsadmin commands with the refreshNodes option and it might take some time for the cluster to settle because data blocks have to be moved from the excluded nodes.
Another way is to stop the task tracker on the nodes. Then the map/reduce tasks will not be scheduled on that node. But, the data will still be fetched from all the data nodes. So, the data nodes also need to be stopped. Make sure that the name node gets out of safe mode and the replication factor is also set properly (with 2 data nodes, the replication factor can't be 3).
A Capacity Scheduler can also be used to limit the usage of a cluster by a particular job. But, when resources are free/idle then the scheduler will allocate resources beyond capacity for better utilization of the cluster. I am not sure if this can be stopped.
Well are you good with scripting ? If so play around with start scripts of the daemons. Since this is an experimental setup, I think restarting hadoop for each experiment should be fine.
