Multi-threading in Hadoop/Spark - hadoop

I have an idea about Multi-threading in general but not sure how it is used in Hadoop.
Based on my knowledge, Yarn is responsible for managing/controlling Spark/Mapreduce job resources, can't think of Multi-threading here. Not sure whether it can be used anywhere else in Hadoop Eco System.
I would appreciate if anybody could provide some information on this.
Many thanks,

actually, YARN is responsible for managing the resource allocation and de-allocation for containers requested by Application Master(MR-AppMaster or Spark-Driver). So the RPC between them are all about negotiation of resource agreement and it does not consider any details how tasks are running inside MapReduce and Spark.
For MapReduce-Hadoop, each task(mapper or reducer) is a single process running on a JVM, it doesn't employ any multi-threaded here.
For Spark, each executor are actually composed of many worker threads. Here each Spark task is corresponding to each task(single process) in MapReduce. So Spark does implement based on multi-threads models for lower
overhead of JVM and data shuffling between tasks.
Based on my experiences, Multi-threads models lower the overhead but suffers from the huge cost of fault tolerance. If an executor in Spark fails, all the tasks running inside the executor have to re-run but only single task needs to re-run for MapReduce. Also Spark suffers from huge memory pressure because all the tasks in side a executor needs to cache data in terms of RDD. But Mapreduce task only process one block at a time.
Hope this is helpful.

It is possible to run multithreaded code in Spark. Take an example of Java code in Spark
AnyCollections.parallelStream().forEach(temo -> {
// Add your spark code here.
}
Now based on the number of cores in the driver it will spawn multiple executors and do stuff in parallel.

Related

Spark performance advantage vs. Hadoop MapReduce [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
I am hearing that Spark has an advantage over hadoop due to spark's in-memory computation. However, one of the obvious problems is not all the data can fit into one computers memory. So is Spark then limited to smaller datasets. At the same time, there is the notion of spark cluster. So I am not following the purported advantages of spark over hadoop MR.
Thanks
Hadoop MapReduce has been the mainstay on Hadoop for batch jobs for a long time. However, two very promising technologies have emerged, Apache Drill, which is a low-density SQL engine for self-service data exploration and Apache Spark, which is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified frame. Let's dig a little bit more into Spark.
To understand Spark, you have to understand really three big concepts.
First is RDDs, the resilient distributed data sets. This is really a representation of the data that's coming into your system in an object format and allows you to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can recompute themselves using the prior information using lineage.
The second concept is transformations. Transformations is what you do to RDDs to get other resilient RDDs. Examples of transformations would be things like opening a file and creating an RDD or doing functions like printer that would then create other resilient RDDs.
The third and the final concept is actions. These are things which will do where you're actually asking for an answer that the system needs to provide you, for instance, count or asking a question about what's the first line that has Spark in it. The interesting thing with Spark is that it does lazy elevation which means that these RDDs are not loaded and pushed into the system as in when the system encounters an RDD but they're only done when there is actually an action to be performed.
One thing that comes up with RDDs is that when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory architectures and most of what are familiar from our past? There are a few differences. Let's go with them in a small, brief way. First of all, writes in RDDs are core of Spark. They are happening at an RDD level. Writes in distributor-shared memory are typically fine-grained. Reads and distributor-shared memory are fine-grained as well. Writes in RDD can be fine or course-grained.
The second piece is recovery. What happens if there is a part in the system, how do we recover it? Since RDDs build this lineage graph if something goes bad, they can go back and recompute based on that graph and regenerate the RDD. Lineage is used very strongly in RDDs to recovery. In distributor-shared memories we typically go back to check-pointing done at intervals or any other semantic check-pointing mechanism. Consistency is relatively trivial in RDDs because the data underneath it is assumed to be immutable. If, however, the data was changing, then consistency would be a problem here. Distributor-shared memory doesn't make any assumptions about mutability and, therefore, leaves the consistency semantics to the application to take care of.
At last let's look at the benefits of Spark:
Spark provides full recovery using lineage.
Spark is optimized in making computations as well as placing the computations optimally using the directory cyclic graph.
Very easy programming paradigms using the transformation and actions on RDDs as well as a ready-rich library support for machine learning, graphics and recently data frames.
At this point a question comes up. If Spark is so great, does Spark actually replace Hadoop? The answer is clearly no because Spark provides an application framework for you to write your big data applications. However, it still needs to run on a storage system or on a no-SQL system.
Spark is never limited to smaller dataset and its not always about in-memorycomputation. Spark has very good number higher APIS . Spark can process the in GB as well. In my realtime experience i have used Spark to handle the streaming application where we usually gets the data in GB/Hour basic . And we have used Spark in Telecommunication to handle bigger dataset as well . Check this RDD Persistence how to accommodate bigger datasets.
In case of real world problem we can't solve them just by one MapReduce program which is having a Mapper class and a reducer class, We mostly need to build a pipeline. A pipeline will consists of multiple stages each having MapReduce program , and out put of one stage will be fed to one or multiple times to the subsequent stages. And this is a pain because of the amount of IO it involves.
In case of MapReduce there are these Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum. And this becomes worse when you have a iterative algorithm in your pipeline. Every iteration will cause significant amount of Disk IO.
So in order to solve the problem , Spark introduced a new Data Structure called RDD . A DS that can hold the information like how the data can be read from the disk and what to compute. Spark also provided easy programming paradigm to create pipeline(DAG) by transforming RDDs . And what you get it a series of RDD which knows how to get the data and what to compute.
Finally when an Action is invoked Spark framework internally optimize the pipeline , group together the portion that can be executed together(map phases), and create a final optimized execution plan from the logical pipeline. And then executes it. It also provides user the flexibility to select the data user wanted to be cached. Hence spark is able to achieve near about 10 to 100 times faster batch processing than MapReduce.
Spark advantages over hadoop.
As spark tasks across stages can be executed on same executor nodes, the time to spawn the Executor is saved for multiple task.
Even if you have huge memory, MapReduce can never make any advantage of caching data in memory and using the in memory data for subsequent steps.
Spark on other hand can cache data if huge JVM is available to it. Across stages the inmemory data is used.
In Spark task run as threads on same executor, making the task memory footprint light.
In MapReduce the Map of reduce Task are processes and not threads.
Spark uses efficient serialization format to store data on disk.
Follow this for detail understanding http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/

Spark and Map-Reduce together

What is the best approach to run Spark on a cluster that runs map reduce jobs?
First question is about co-locality with data. When I start a Spark application, it allocates executors, right? How does it know where to allocate them so they are in the same nodes as the data that jobs will need? (one job may want one piece of data while the job after it may need another)
If I keep the Spark application up, then the executors take slots from the machines in the cluster does it mean that for co-locality I need to have a Spark executor on every node?
With executors running, it means that there are less resources for my map reduce jobs, right? I can stop and start the Spark application for every job, but then it takes away from the speed advantages of having the executors up and running, correct (also the benefits of hotspot for long running processes?)
I have read that container re-sizing (YARN-1197) will help, but doesn't that just mean that executors will stop and start? Isn't that the same as stopping the spark application (in other words, if there are no live executors, what is the benefit of having the Spark application up vs shutting it down and starting when a job requires executors)
Data Locality of executors : Spark does not deal with Data locality while launching executors but while launching tasks on them. So you might need to have executors on each data node(HDFS Redundancy can help you even if you dont have executors on each node).
Long Running process: Whether to shutdown your application or not depends on the use case. If you want to serve real time application requests/spark streaming you will not want to shut down the spark. But if you are doing batch processing you should shut down your executor. For Caching of data across jobs you should consider either HDFS Cache or tachyon. You can also consider dynamic allocation of spark with which you can free executors if they are not used for some time.(http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation).
YARN-1197 - will help in releasing the number of cpus/memory you allocated to containers. I am not sure though if spark supports this or not.

Spark running on YARN - What does a real life example's workflow look like?

I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).

What is the purpose of "uber mode" in hadoop?

Hi I am a big data newbie. I searched all over the internet to find what exactly uber mode is. The more I searched the more I got confused. Can anybody please help me by answering my questions?
What does uber mode do?
Does it works differently in mapred 1.x and 2.x?
And where can I find the setting for it?
What is UBER mode in Hadoop2?
Normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer.
Uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM).
Uber jobs :
Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. Rather then communicate with RM to create the mapper and reducer containers.
The AM runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.
Why
If you have a small dataset or you want to run MapReduce on small amount of data, Uber configuration will help you out, by reducing additional time that MapReduce normally spends in mapper and reducers phase.
Can I configure an Uber for all MapReduce job?
As of now,
map-only jobs and
jobs with one reducer are supported.
Uber Job occurs when multiple mapper and reducers are combined to use a single container. There are four core settings around the configuration of Uber Jobs in the mapred-site.xml. Configuration options for Uber Jobs:
mapreduce.job.ubertask.enable
mapreduce.job.ubertask.maxmaps
mapreduce.job.ubertask.maxreduces
mapreduce.job.ubertask.maxbytes
You can find more details here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.15/bk_using-apache-hadoop/content/uber_jobs.html
In terms of hadoop2.x, Uber jobs are the jobs which are launched in mapreduce ApplicationMaster itself i.e. no separate containers are created for map and reduce jobs and hence the overhead of creating containers and communicating with them is saved.
As far as working (with hadoop 1.x and 2.x) is concerned, I suppose the difference is only observable when it comes to terminologies of 1.x and 2.x, no difference in working.
Configuration params are same as those mentioned by Navneet Kumar in his answer.
PS: Use it only with small dataset.
Pretty good answers are given for "What is Uber Mode?"
Just to add some more information for "Why?"
The application master decides how to run the tasks that make
up the MapReduce job. If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain in running them in parallel, when compared to running them sequentially on one node.
Now, the questions could be raised as "What qualifies as a small job?
By default, a small job is one that has less than 10 mappers, only one reducer, and an input size that is less than the size of one HDFS block.

Differences between MapReduce and Yarn

I was searching about hadoop and mapreduce with respect to straggler problems and the papers in this problem
but yesterday I found that there is hadoop 2 with Yarn ,,
unfortunately no paper is talking about straggler problem in Yarn
So I want to know what is difference between MapReduce and Yarn in the part straggler?
is Yarn suffer from straggler problem?
and when MRmaster asks resource manger for resources , resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities ?
thanks so much,,
Here are the MapReduce 1.0 and MapReduce 2.0 (YARN)
MapReduce 1.0
In a typical Hadoop cluster, racks are interconnected via core switches. Core switches should connect to top-of-rack switches Enterprises using Hadoop should consider using 10GbE, bonded Ethernet and redundant top-of-rack switches to mitigate risk in the event of failure. A file is broken into 64MB chunks by default and distributed across Data Nodes. Each chunk has a default replication factor of 3, meaning there will be 3 copies of the data at any given time. Hadoop is “Rack Aware” and HDFS has replicated chunks on nodes on different racks. JobTracker assign tasks to nodes closest to the data depending on the location of nodes and helps the NameNode determine the ‘closest’ chunk to a client during reads. The administrator supplies a script which tells Hadoop which rack the node is in, for example: /enterprisedatacenter/rack2.
Limitations of MapReduce 1.0 – Hadoop can scale up to 4,000 nodes. When it exceeds that limit, it raises unpredictable behavior such as cascading failures and serious deterioration of overall cluster. Another issue being multi-tenancy – it is impossible to run other frameworks than MapReduce 1.0 on a Hadoop cluster.
MapReduce 2.0
MapReduce 2.0 has two components – YARN that has cluster resource management capabilities and MapReduce.
In MapReduce 2.0, the JobTracker is divided into three services:
ResourceManager, a persistent YARN service that receives and runs applications on the cluster. A MapReduce job is an application.
JobHistoryServer, to provide information about completed jobs
Application Master, to manage each MapReduce job and is terminated when the job completes.
TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a node. NodeManager is responsible for launching containers that could either be a map or reduce task.
This new architecture breaks JobTracker model by allowing a new ResourceManager to manage resource usage across applications, with ApplicationMasters taking the responsibility of managing the execution of jobs. This change removes a bottleneck and lets Hadoop clusters scale up to larger configurations than 4000 nodes. This architecture also allows simultaneous execution of a variety of programming models such as graph processing, iterative processing, machine learning, and general cluster computing, including the traditional MapReduce.
You say "Differences between MapReduce and YARN". MapReduce and YARN definitely different. MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which support parallel processing that we known as MapReduce. Before hadoop 2, hadoop already support MapReduce. In short, MapReduce run above YARN Architecture. Sorry, i don't mention in part of straggler problem.
"when MRmaster asks resource manger for resources?"
when user submit MapReduce Job. After MapReduce job has done, resource will be back to free.
"resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities"
I don't get this question point. Obviously, the resources manager will give all resource it needs no matter what cluster computing capabilities. Cluster computing capabilities will influence on processing time.
There is no YARN in MapReduce 1. In MapReduce there is Yarn.
If for straggler problem you mean that if first guy waits 'something' which then causes more waits along a road who depends on that first guy then I guess there is always this problem in MR jobs. Getting allocated resources naturally participate to this problem along with all other things which may cause components to wait something.
Tez which is supposed to be a drop-in replacement for MR job runtime makes a things differently. Instead of doing task runs in a same way current MR Appmaster does it tries to use DAG of tasks which does a much better job of not getting into bad straggler problem.
You need to understand a relationship between MR and YARN. YARN is simply a dummy resource scheduler meaning it doesn't schedule 'tasks'. What it gives to MR Appmaster is a set or resources(in a sense it's only combination of memory and cpu and location). It's then MR Appmaster responsibility to decide what to do with those resources.

Resources