I wanted to know if mapreduce.* parameters are applicable in Spark.
As far as I know in Spark there is no buffer for the map output and for the reduce task the whole process is also different. Parameters like mapreduce.task.io.sort.mb ,mapreduce.reduce.shuffle.input.buffer.percent or mapreduce.reduce.input.buffer.percent control these kind of buffers.
I'm working in optimising parameters for spark tasks/jobs running in a hadoop/yarn cluster.
It is safe to say that these mapreduce parameters don't matter and that I should only care about spark.* parameters since the map, shuffle and reduce parts are different?
It's safe because Spark doesn't use MapReduce as processing engine, but it interacts directly with YARN to submit operations. Thus, when you use Spark, there is no MapReduce job scheduled, but you have a Spark application and Spark jobs.
Related
I am bit confused on what does the term "MapReduce" with respect to Hadoop 1.x. With respect to this, I come across various terms like: JobTracker , TaskTracker (the daemons in MapReduce). Now when we say MapReduce does it refer to these daemons or the API which a developer uses to code MapReduce applications?
Does the user application execute on TaskTracker , JobTracker? Is MapReduce itself a run-time environment?
Can anyone please help me understand this in simple words?
MapReduce is the programming model for data processing (in Hadoop).
Its implementation in Hadoop-1.x is often referred as the Classic MapReduce Implementation (or MapReduce v1) which uses JobTracker and TaskTrackers of Hadoop for the execution of Jobs and its corresponding APIs (user-facing client-side features) for writing them.
JobTracker coordinates the Job run.
TaskTrackers run the tasks that the job has been split into.
To sum up, the MapReduce APIs determine how the MapReduce programming model has to be written whereas the Implementation determine how the Job written using this programming model is executed.
The YARN implementation (MapReduce v2) of MapReduce programming model differs in its APIs used for writing it and the daemons (ResourceManager, ApplicationMaster and NodeManagers) used for execution.
I have an idea about Multi-threading in general but not sure how it is used in Hadoop.
Based on my knowledge, Yarn is responsible for managing/controlling Spark/Mapreduce job resources, can't think of Multi-threading here. Not sure whether it can be used anywhere else in Hadoop Eco System.
I would appreciate if anybody could provide some information on this.
Many thanks,
actually, YARN is responsible for managing the resource allocation and de-allocation for containers requested by Application Master(MR-AppMaster or Spark-Driver). So the RPC between them are all about negotiation of resource agreement and it does not consider any details how tasks are running inside MapReduce and Spark.
For MapReduce-Hadoop, each task(mapper or reducer) is a single process running on a JVM, it doesn't employ any multi-threaded here.
For Spark, each executor are actually composed of many worker threads. Here each Spark task is corresponding to each task(single process) in MapReduce. So Spark does implement based on multi-threads models for lower
overhead of JVM and data shuffling between tasks.
Based on my experiences, Multi-threads models lower the overhead but suffers from the huge cost of fault tolerance. If an executor in Spark fails, all the tasks running inside the executor have to re-run but only single task needs to re-run for MapReduce. Also Spark suffers from huge memory pressure because all the tasks in side a executor needs to cache data in terms of RDD. But Mapreduce task only process one block at a time.
Hope this is helpful.
It is possible to run multithreaded code in Spark. Take an example of Java code in Spark
AnyCollections.parallelStream().forEach(temo -> {
// Add your spark code here.
}
Now based on the number of cores in the driver it will spawn multiple executors and do stuff in parallel.
By using YARN, we can run non mapreduce application.
But how it works?
In HDFS, All gets stored in Blocks. For each blocks one mapper tasks would get create to process whole dataset.
But Non mapreduce applications, how it will process the datasets in different data node with out using mapreduce?
Please explain me.
Do not confuse the Map reduce paradigm with other applications like for instance Spark. Spark can run under Yarn but does not use mappers or reducers.
Instead it uses executors, these executors are aware of the datalocality, the same way mapreduce is.
The spark Driver will start executors on data nodes and will try to keep the data locality in mind when doing so.
Also do not confuse Map Reduce default behaviour with standard behaviour. you do not need to have 1 mapper per input split.
Also HDFS and Map Reduce are two different things. HDFS is just the storage layer while Map Reduce handles processing.
I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).
Hi I am a big data newbie. I searched all over the internet to find what exactly uber mode is. The more I searched the more I got confused. Can anybody please help me by answering my questions?
What does uber mode do?
Does it works differently in mapred 1.x and 2.x?
And where can I find the setting for it?
What is UBER mode in Hadoop2?
Normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer.
Uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM).
Uber jobs :
Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. Rather then communicate with RM to create the mapper and reducer containers.
The AM runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.
Why
If you have a small dataset or you want to run MapReduce on small amount of data, Uber configuration will help you out, by reducing additional time that MapReduce normally spends in mapper and reducers phase.
Can I configure an Uber for all MapReduce job?
As of now,
map-only jobs and
jobs with one reducer are supported.
Uber Job occurs when multiple mapper and reducers are combined to use a single container. There are four core settings around the configuration of Uber Jobs in the mapred-site.xml. Configuration options for Uber Jobs:
mapreduce.job.ubertask.enable
mapreduce.job.ubertask.maxmaps
mapreduce.job.ubertask.maxreduces
mapreduce.job.ubertask.maxbytes
You can find more details here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.15/bk_using-apache-hadoop/content/uber_jobs.html
In terms of hadoop2.x, Uber jobs are the jobs which are launched in mapreduce ApplicationMaster itself i.e. no separate containers are created for map and reduce jobs and hence the overhead of creating containers and communicating with them is saved.
As far as working (with hadoop 1.x and 2.x) is concerned, I suppose the difference is only observable when it comes to terminologies of 1.x and 2.x, no difference in working.
Configuration params are same as those mentioned by Navneet Kumar in his answer.
PS: Use it only with small dataset.
Pretty good answers are given for "What is Uber Mode?"
Just to add some more information for "Why?"
The application master decides how to run the tasks that make
up the MapReduce job. If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain in running them in parallel, when compared to running them sequentially on one node.
Now, the questions could be raised as "What qualifies as a small job?
By default, a small job is one that has less than 10 mappers, only one reducer, and an input size that is less than the size of one HDFS block.