Hadoop - "Code moves near data for computation" - hadoop

I just want to clarify this quote "Code moves near data for computation",
does this mean all java MR written by developer deployed to all servers in cluster ?
If 1 is true, if someone changes a MR program, how its distributed to all the servers ?
Thanks

Hadoop put MR job's jar to the HDFS - its distributed file system. The task trackers which needed it will take it from there. So it distributed to some nodes and then loaded on-demand by nodes which actually needs them. Usually this needs mean that node is going to process local data.
Hadoop cluster is "stateless" in relation to the jobs. Each time job is viewed as something new and "side effects" of the previous job are not used.
Indeed, when some small number of files (or splits to be precise) are to be processed on large cluster, optimization of sending jar to only few hosts where data indeed reside might somewhat reduce the job latency. I do not know if such optimization is planned.

In hadoop cluster you use the same nodes for data and computation. That means your hdfs datanode is setup on the same cluster used by task tracker for computation. So now when you execute MR jobs job tracker looks where your data is stored. Whereas in other computation model data is not stored in the same cluster and you may have to move data while you are doing your computation on some compute node.
After you start a job all the map functions will get splits of your input file. These map functions are executed so that split of input file is closer to them or in other words in the same rack. This is what we mean by computation is done closer to data.
So to clarify your question, every time you run MR job its code is copied to all the nodes. So if we change a code a new code is copied to all the nodes.

Related

Spark running on YARN - What does a real life example's workflow look like?

I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).

Why Map tasks outputs are written to the local disk and not to HDFS?

I am prepping for an exam and here is a question in the lecture notes:
Why Map tasks outputs are written to the local disk and not to HDFS?
Here are my thoughts:
Reduce network traffic usage as the reducer may run on the same machine as the output so copying not required.
Don't need the fault tolerance of HDFS. If the job dies halfway, we can always just re-run the map task.
What are other possible reasons? Are my answers reasonable?
Your reasonings are correct. However I would like to add few points: what if map outputs are written to hdfs. Now, writing to hdfs is not like writing to local disk. It's a more involved process with namenode assuring that at least dfs.replication.min copies are written to hdfs. And namenode will also run a background thread to make additional copies for under replicated blocks. Suppose, the user kills the job in between or jobs just fail. There will be lots of intermediate files sitting on hdfs for no reason which you will have to delete manually. And if this process happens too many times, your cluster's perform and will degrade. Hdfs is optimized for appending and not frequent deleting .Also, during map phase , if the job fails, it performs a cleanup before exiting. If it were hdfs, the deletion process would require namenode to send a block deletion message to appropriate datanodes, which will cause invalidation of that block and it's removal from blocksMap. So much operation involved just for a failed cleanup and for no gain!!
Because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization. Sometimes, however, all the nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks, so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer.
from "Hadoop The Definitive Guide 4 edition"
There is a point I know of writing the map output to Local file system , the output of all the mappers eventually gets merged and finally made a input for shuffling and sorting stages that precedes Reducer phase.

hadoop node unused for map tasks

I've noticed that all map and all reduce tasks are running on a single node (node1). I tried creating a file consisting of a single hdfs block which resides on node2. When running a mapreduce tasks whose input consists only of this block resident on node2, the task still runs on node1. I was under the impression that hadoop prioritizes running tasks on the nodes that contain the input data. I see no errors reported in log files. Any idea what might be going on here?
I have a 3-node cluster running on kvms created by following the cloudera cdh4 distributed installation guide.
I was under the impression that hadoop prioritizes running tasks on
the nodes that contain the input data.
Well, there might be an exceptional case :
If the node holding the data block doesn't have any free CPU slots, it won't be able to start any mappers on that particular node. In such a scenario instead of waiting data block will be moved to a nearby node and get processed there. But before that framework will try to process the replica of that block, locally(If RF > 1).
HTH
I don't understand when you say "I tried creating a file consisting of a single hdfs block which resides on node2". I don't think you can "direct" hadoop cluster to store some block in some specific node.
Hadoop will decide number of mappers based on input's size. If input size is less than hdfs block size (default I think is 64m), it will spawn just one mapper.
You can set job param "mapred.max.split.size" to whatever size you want to force spawning multiple reducers (default should suffice in most cases).

Remotely retrieve a file from hdfs and store it locally in a node

I want to write a job in which each mapper checks if a file from hdfs is stored in the node that is being executed.If this doesn't happen I want to retrieve it from hdfs and store it locally in this node.Is this possible?
EDIT: I am trying to do this (3) Preprocessing for Repartition Join, as described here: link
DistributedCache feature in Hadoop can be used to distribute the side data or auxiliary data required for the completion of the job. Here (1, 2) are some interesting articles for the same.
Why would you want to do this? The Data Locality principle used by Hadoop does this for you. Well, it does not move the data, it does move the program.
This comes from the Wikipedia page about Hadoop:
The jobtracker schedules map/reduce jobs to tasktrackers with an
awareness of the data location. An example of this would be if node A
contained data (x,y,z) and node B contained data (a,b,c). The
jobtracker will schedule node B to perform map/reduce tasks on (a,b,c)
and node A would be scheduled to perform map/reduce tasks on (x,y,z)
And the reason the computation is moved to the data and not the other way around is explained in the Hadoop documentation itself:
“Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed
near the data it operates on. This is especially true when the size of
the data set is huge. This minimizes network congestion and increases
the overall throughput of the system. The assumption is that it is
often better to migrate the computation closer to where the data is
located rather than moving the data to where the application is
running. HDFS provides interfaces for applications to move themselves
closer to where the data is located.

Can Hadoop distribute tasks and code base?

I'm starting to play around with hadoop(but don't have access to a cluster yet so just playing around in standalone). My question is, once its in a cluster setup, how are tasks distributed and can the code base be transfered to new nodes?
Ideally, I would like to run large batch jobs and if I need more capacity add new nodes to a cluster but I'm not sure if I'll have to copy the same code thats running locally or do something special so while the batch job is running I can add capacity. I thought I could store my codebase on the HDFS and have it pulled locally to run every time I need it but that still means I need some kind of initial script on the server and need to run it manually first.
Any suggestions or advice on if this is possible would be great!
Thank you.
When you schedule a mapreduce job using the hadoop jar command, the jobtracker will determine how many mappers are needed to execute your job. This is usually determined by the number of blocks in the input file, and this number is fixed, no matter how many worker nodes you have. It then will enlist one or more tasktrackers to execute your job.
The application jar (along with any other jars that are specified using the -libjars argument), is copied automatically to all of the machines running the tasktrackers that are used to execute your jars. All of that is handled by the Hadoop infrastructure.
Adding additional tasktrackers will increase the parallelism of your job assuming that there are as-yet-unscheduled map tasks. What it will not do is automatically re-partition the input to parallelize across additional map capacity. So if you have a map capacity of 24 (assuming 6 mappers on each of 4 data nodes), and you have 100 map tasks with the first 24 executing, and you add another data node, you'll get some additional speed. If you have only 12 map tasks, adding machines won't help you.
Finally, you need to be aware of data reference locality. Since the data should ideally be processed on the same machines that store it initially, adding new task trackers will not necessarily add proportional processing speed, since the data will not be local on those nodes initially and will need to be copied over the network.
I do not quite agree with Daniel's reply.
Primarily because if "on starting a job, jar code will be copied to all the nodes that the cluster knows of" is true, then even if you use 100 mappers and there are 1000 nodes, code for all jobs will always be copied to all the nodes. Does not make sense.
Instead Chris Shain's reply makes more sense that whenever JobScheduler on JobTracker chooses a job to be executed and identifies a task to be executed by a particular datanode then at this time somehow it conveys the tasktracker from where to copy the codebase.
Initially (before mapreduce job start), the codebase was copied to multiple locations as defined by mapred.submit.replication parameter. Hence, tasktracker can copy the codebase from several locations a list of which may be sent by jobtracker to it.
Before attempting to build a Hadoop cluster I would suggest playing with Hadoop using Amazon's Elastic MapReduce.
With respect to the problem that you are trying to solve, I am not sure that Hadoop is a proper fit. Hadoop is useful for trivially parallelizable batch jobs: parse thousonds (or more) documents, sorting, re-bucketing data). Hadoop Streaming will allow you to create mappers and reducer using any language that you like but the inputs and outputs must be in a fixed format. There are many uses but, in my opinion, process control was not one of the design goals.
[EDIT] Perhaps ZooKeeper is closer to what you are looking for.
You could add capacity to the batch job if you want but it needs to be presented as a possibility in your codebase. For example, if you have a mapper that contains a set of inputs that you want to assign multiple nodes to take the pressure you can. All of this can be done but not with the default Hadoop install.
I'm currently working on a Nested Map-Reduce framework that extends the Hadoop codebase and allows you to spawn more nodes based on inputs that the mapper or reducer gets. If you're interested drop me a line and i'll explain more.
Also, when it comes to the -libjars option, this only works for the nodes that are assigned by the jobtracker as instructed by the job you write. So if you specify 10 mappers, the -libjar will copy your code there. If you want to start with 10, but work your way up, the nodes you add will not have the code.
Easiest way to bypass this is to add your jar to the classpath of the hadoop-env.sh script. That will always when starting a job copy that jar to all the nodes that the cluster knows off.

Resources