Physical Process Tree of MapReduce Jobs in Hadoop (over the cluster nodes) - hadoop

I read a lot of references, book chapters & articles, but I'm still trying to glue everything together :
I fairly understand the MapReduce Logical chain, but I specifically would like to know what specific processes are launched on which physical node over time.
I guess mappers are executed "on site" on datanode machines, but what about the other processes, specifically reducers, who need to access data over multiple datanodes ?
Also, if I get it well, the map and reduce written programs are launched on the master node where the command is executed, and result in launching new threads on new JVMs all over the cluster, is that right ?

I recommend you to visit http://bytepadding.com/map-reduce/
to give you an overview.
MapReduce client can be launched locally or on a dataNode(oozie launcher).
Based on the inputFormat the file location is fetched from namenode by the MapreduceDriver(Application Master)
Based on file split policy, mappers are launched and the process tries to spawn mappers as close as possible to individual file blocks.
Mappers are spawned on dataNodes.
After Mappers have finshed, reducer are launched on DataNodes and data from mappers are copied on these specific machines.

Related

Spark running on YARN - What does a real life example's workflow look like?

I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).

Why Map tasks outputs are written to the local disk and not to HDFS?

I am prepping for an exam and here is a question in the lecture notes:
Why Map tasks outputs are written to the local disk and not to HDFS?
Here are my thoughts:
Reduce network traffic usage as the reducer may run on the same machine as the output so copying not required.
Don't need the fault tolerance of HDFS. If the job dies halfway, we can always just re-run the map task.
What are other possible reasons? Are my answers reasonable?
Your reasonings are correct. However I would like to add few points: what if map outputs are written to hdfs. Now, writing to hdfs is not like writing to local disk. It's a more involved process with namenode assuring that at least dfs.replication.min copies are written to hdfs. And namenode will also run a background thread to make additional copies for under replicated blocks. Suppose, the user kills the job in between or jobs just fail. There will be lots of intermediate files sitting on hdfs for no reason which you will have to delete manually. And if this process happens too many times, your cluster's perform and will degrade. Hdfs is optimized for appending and not frequent deleting .Also, during map phase , if the job fails, it performs a cleanup before exiting. If it were hdfs, the deletion process would require namenode to send a block deletion message to appropriate datanodes, which will cause invalidation of that block and it's removal from blocksMap. So much operation involved just for a failed cleanup and for no gain!!
Because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization. Sometimes, however, all the nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks, so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer.
from "Hadoop The Definitive Guide 4 edition"
There is a point I know of writing the map output to Local file system , the output of all the mappers eventually gets merged and finally made a input for shuffling and sorting stages that precedes Reducer phase.

hadoop node unused for map tasks

I've noticed that all map and all reduce tasks are running on a single node (node1). I tried creating a file consisting of a single hdfs block which resides on node2. When running a mapreduce tasks whose input consists only of this block resident on node2, the task still runs on node1. I was under the impression that hadoop prioritizes running tasks on the nodes that contain the input data. I see no errors reported in log files. Any idea what might be going on here?
I have a 3-node cluster running on kvms created by following the cloudera cdh4 distributed installation guide.
I was under the impression that hadoop prioritizes running tasks on
the nodes that contain the input data.
Well, there might be an exceptional case :
If the node holding the data block doesn't have any free CPU slots, it won't be able to start any mappers on that particular node. In such a scenario instead of waiting data block will be moved to a nearby node and get processed there. But before that framework will try to process the replica of that block, locally(If RF > 1).
HTH
I don't understand when you say "I tried creating a file consisting of a single hdfs block which resides on node2". I don't think you can "direct" hadoop cluster to store some block in some specific node.
Hadoop will decide number of mappers based on input's size. If input size is less than hdfs block size (default I think is 64m), it will spawn just one mapper.
You can set job param "mapred.max.split.size" to whatever size you want to force spawning multiple reducers (default should suffice in most cases).

Hadoop cluster and MapReduce logic

I'm new in hadoop development. I read about hadoop cluster structure and understood that there are one namenode, jobtracker, tasktracker and multiple datanodes.
When we write map-reduce programs we implement mapper and reducer. I also understood logic of these clasess. But I don't understand how are they executed in the hadoop cluster.
Is mapper executed in the namenode only?
Is reducer executed seperatly on the datanodes?
I need to make a lot of parralel computations and don't want to use HDFS, how can I be sure that each output collection (from mapper) executes seperatly in all datanodes?
Please explain me the connection between hadoop cluster and map/reduce logic.
Thanks a lot!
Map Reduce Jobs are executed by Job Tracker and Task Trackers.
Job Tracker initiates the Job the dividing the input file/files into splits. Tasktrackers are given these splits who run map tasks on the splits( One map task per split). After Mappers throws their output.This output will be passed on the reducer depending on the map output keys . Similar keys are sent to one reducer. Reducer can be more than 1 , depending upon your configuration. Reducer process also runs on one the tasktracker nodes only .
You can see stats of the Job on , jobtracker UI which by default runs on 50030 port.
You can also, visit my website for example topics on Bigdata technologies. Also, you can post your questions , I will try to answer.
http://souravgulati.webs.com/apps/forums/show/14108248-bigdata-learnings-hadoop-hbase-hive-and-other-bigdata-technologies-

Does Amazon Elastic Map Reduce runs one or several mapper processes per instance?

My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
I haven't found the answer neither in Hadoop Streaming documentation, nor in Amazon Elastic MapReduce FAQ.
Hadoop has a notion of "slots". Slot is a place where mapper process will run. You configure number of slots per tasktracker node. It is teoretical maximum of map process which will run parralel per node. It can be less if there is not enough separate poprtions of the input data (called FileSplits).
Elastic MapReduce do have its own estimation how much slots to allocate depending on the instance capabilities.
In the same time I can imagine scenario where your processing will be more efficeint when one datastream is prcessed by many cores. If you have your mapper with built-in multicore usage - you can reduce number of slots. But it is inot usually a case in the typycial Hadoop tasks.
See the EMR doco [1] for the number of map/reduce tasks per instance type.
In addition to David's answers you can also have Hadoop run multiple threads per map slot by setting...
conf.setMapRunnerClass(MultithreadedMapRunner.class);
The default is 10 threads but it's tunable with
-D mapred.map.multithreadedrunner.threads=5
I often find this useful for custom high IO stuff.
[1] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_AMI2.html
My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
Once a Hadoop cluster has been set, the minimum required to submit a job is
Input format and location
Output format and location
Map and Reduce functions for processing the data
Location of the NameNode and the JobTracker
Hadoop will take care of distributing the job to different nodes, monitoring them, reading the data from the i/p and writing the data to the o/p. If the user has to do all those tasks, then there is no point of using Hadoop.
Suggest, to go through the Hadoop documentation and a couple of tutorials.

Resources