What is the purpose of "uber mode" in hadoop? - hadoop

Hi I am a big data newbie. I searched all over the internet to find what exactly uber mode is. The more I searched the more I got confused. Can anybody please help me by answering my questions?
What does uber mode do?
Does it works differently in mapred 1.x and 2.x?
And where can I find the setting for it?

What is UBER mode in Hadoop2?
Normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer.
Uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM).
Uber jobs :
Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. Rather then communicate with RM to create the mapper and reducer containers.
The AM runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.
Why
If you have a small dataset or you want to run MapReduce on small amount of data, Uber configuration will help you out, by reducing additional time that MapReduce normally spends in mapper and reducers phase.
Can I configure an Uber for all MapReduce job?
As of now,
map-only jobs and
jobs with one reducer are supported.

Uber Job occurs when multiple mapper and reducers are combined to use a single container. There are four core settings around the configuration of Uber Jobs in the mapred-site.xml. Configuration options for Uber Jobs:
mapreduce.job.ubertask.enable
mapreduce.job.ubertask.maxmaps
mapreduce.job.ubertask.maxreduces
mapreduce.job.ubertask.maxbytes
You can find more details here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.15/bk_using-apache-hadoop/content/uber_jobs.html

In terms of hadoop2.x, Uber jobs are the jobs which are launched in mapreduce ApplicationMaster itself i.e. no separate containers are created for map and reduce jobs and hence the overhead of creating containers and communicating with them is saved.
As far as working (with hadoop 1.x and 2.x) is concerned, I suppose the difference is only observable when it comes to terminologies of 1.x and 2.x, no difference in working.
Configuration params are same as those mentioned by Navneet Kumar in his answer.
PS: Use it only with small dataset.

Pretty good answers are given for "What is Uber Mode?"
Just to add some more information for "Why?"
The application master decides how to run the tasks that make
up the MapReduce job. If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain in running them in parallel, when compared to running them sequentially on one node.
Now, the questions could be raised as "What qualifies as a small job?
By default, a small job is one that has less than 10 mappers, only one reducer, and an input size that is less than the size of one HDFS block.

Related

Multi-threading in Hadoop/Spark

I have an idea about Multi-threading in general but not sure how it is used in Hadoop.
Based on my knowledge, Yarn is responsible for managing/controlling Spark/Mapreduce job resources, can't think of Multi-threading here. Not sure whether it can be used anywhere else in Hadoop Eco System.
I would appreciate if anybody could provide some information on this.
Many thanks,
actually, YARN is responsible for managing the resource allocation and de-allocation for containers requested by Application Master(MR-AppMaster or Spark-Driver). So the RPC between them are all about negotiation of resource agreement and it does not consider any details how tasks are running inside MapReduce and Spark.
For MapReduce-Hadoop, each task(mapper or reducer) is a single process running on a JVM, it doesn't employ any multi-threaded here.
For Spark, each executor are actually composed of many worker threads. Here each Spark task is corresponding to each task(single process) in MapReduce. So Spark does implement based on multi-threads models for lower
overhead of JVM and data shuffling between tasks.
Based on my experiences, Multi-threads models lower the overhead but suffers from the huge cost of fault tolerance. If an executor in Spark fails, all the tasks running inside the executor have to re-run but only single task needs to re-run for MapReduce. Also Spark suffers from huge memory pressure because all the tasks in side a executor needs to cache data in terms of RDD. But Mapreduce task only process one block at a time.
Hope this is helpful.
It is possible to run multithreaded code in Spark. Take an example of Java code in Spark
AnyCollections.parallelStream().forEach(temo -> {
// Add your spark code here.
}
Now based on the number of cores in the driver it will spawn multiple executors and do stuff in parallel.

Spark running on YARN - What does a real life example's workflow look like?

I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).

Can map task and reduce task be in the same node?

I am a new about Hadoop, since the data transfer between map node and reduce node may reduce the efficiency of MapReduce, why not map task and reduce task are put together in the same node?
Actually you can run map and reduce in same JVM if the data is too 'small'. It is possible in Hadoop 2.0 (aka YARN) and now called Ubertask.
From the great "Hadoop: The Definitive Guide" book:
If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. (This is different from MapReduce 1, where small jobs are never run on a single tasktracker.) Such a job is said to be uberized, or run as an uber task.
The amount of data to be processed is too large that's why we are doing map and reduce in separate nodes. If the amount of data to be processed is small then definitely you ca use Map and Reduce on the same node.
Hadoop is usually used when the amount of data is very large in that case for high availability and concurrency separate nodes are needed for both map and reduce operations.
Hope this will clear your doubt.
An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master.
So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1, then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.
SET mapreduce.job.ubertask.enable=TRUE;
So the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.

Comparison between Hadoop Classic and Yarn

I have two clusters each running different version of Hadoop. I am working on a POC were I need to understand how YARN provides the capability to run multiple applications simultaneously which was not accomplished with Classic Map Reduce Framework.
Hadoop Classic:
I have a wordcount.jar file and executed on a single cluster (2 Mappers & 2 Reducers). I started two jobs in parallel, the one lucky started first got both mappers, completed the task and then second job started. This is the expected behavior.
Hadoop Yarn:
Same wordcount.jar with a different cluster (4 cores, so total 4 machines). As Yarn does not pre-assign mapper and reducer, any core can be used as mapper or reducer. Here also I submitted two jobs in parallel.
Expected Behavior: Both the jobs should start with 2 mappers each or whichever config as resource manager assigns but atleast both the jobs should start.
Reality: One job starts with 3 mappers and 1 reducers. second job waits untill first is completed.
Can someone please help me understand the behavior, as well as does the parallelism behavior best reflected with multinode cluster?
Not sure if this is the exact reason, but the Classic Hadoop and YARN architectures use a different scheduler. Classic Hadoop uses a JobQueueTaskScheduler, while YARN uses CapacityScheduler by default.

Can Hadoop distribute tasks and code base?

I'm starting to play around with hadoop(but don't have access to a cluster yet so just playing around in standalone). My question is, once its in a cluster setup, how are tasks distributed and can the code base be transfered to new nodes?
Ideally, I would like to run large batch jobs and if I need more capacity add new nodes to a cluster but I'm not sure if I'll have to copy the same code thats running locally or do something special so while the batch job is running I can add capacity. I thought I could store my codebase on the HDFS and have it pulled locally to run every time I need it but that still means I need some kind of initial script on the server and need to run it manually first.
Any suggestions or advice on if this is possible would be great!
Thank you.
When you schedule a mapreduce job using the hadoop jar command, the jobtracker will determine how many mappers are needed to execute your job. This is usually determined by the number of blocks in the input file, and this number is fixed, no matter how many worker nodes you have. It then will enlist one or more tasktrackers to execute your job.
The application jar (along with any other jars that are specified using the -libjars argument), is copied automatically to all of the machines running the tasktrackers that are used to execute your jars. All of that is handled by the Hadoop infrastructure.
Adding additional tasktrackers will increase the parallelism of your job assuming that there are as-yet-unscheduled map tasks. What it will not do is automatically re-partition the input to parallelize across additional map capacity. So if you have a map capacity of 24 (assuming 6 mappers on each of 4 data nodes), and you have 100 map tasks with the first 24 executing, and you add another data node, you'll get some additional speed. If you have only 12 map tasks, adding machines won't help you.
Finally, you need to be aware of data reference locality. Since the data should ideally be processed on the same machines that store it initially, adding new task trackers will not necessarily add proportional processing speed, since the data will not be local on those nodes initially and will need to be copied over the network.
I do not quite agree with Daniel's reply.
Primarily because if "on starting a job, jar code will be copied to all the nodes that the cluster knows of" is true, then even if you use 100 mappers and there are 1000 nodes, code for all jobs will always be copied to all the nodes. Does not make sense.
Instead Chris Shain's reply makes more sense that whenever JobScheduler on JobTracker chooses a job to be executed and identifies a task to be executed by a particular datanode then at this time somehow it conveys the tasktracker from where to copy the codebase.
Initially (before mapreduce job start), the codebase was copied to multiple locations as defined by mapred.submit.replication parameter. Hence, tasktracker can copy the codebase from several locations a list of which may be sent by jobtracker to it.
Before attempting to build a Hadoop cluster I would suggest playing with Hadoop using Amazon's Elastic MapReduce.
With respect to the problem that you are trying to solve, I am not sure that Hadoop is a proper fit. Hadoop is useful for trivially parallelizable batch jobs: parse thousonds (or more) documents, sorting, re-bucketing data). Hadoop Streaming will allow you to create mappers and reducer using any language that you like but the inputs and outputs must be in a fixed format. There are many uses but, in my opinion, process control was not one of the design goals.
[EDIT] Perhaps ZooKeeper is closer to what you are looking for.
You could add capacity to the batch job if you want but it needs to be presented as a possibility in your codebase. For example, if you have a mapper that contains a set of inputs that you want to assign multiple nodes to take the pressure you can. All of this can be done but not with the default Hadoop install.
I'm currently working on a Nested Map-Reduce framework that extends the Hadoop codebase and allows you to spawn more nodes based on inputs that the mapper or reducer gets. If you're interested drop me a line and i'll explain more.
Also, when it comes to the -libjars option, this only works for the nodes that are assigned by the jobtracker as instructed by the job you write. So if you specify 10 mappers, the -libjar will copy your code there. If you want to start with 10, but work your way up, the nodes you add will not have the code.
Easiest way to bypass this is to add your jar to the classpath of the hadoop-env.sh script. That will always when starting a job copy that jar to all the nodes that the cluster knows off.

Resources