Spark and Map-Reduce together - hadoop

What is the best approach to run Spark on a cluster that runs map reduce jobs?
First question is about co-locality with data. When I start a Spark application, it allocates executors, right? How does it know where to allocate them so they are in the same nodes as the data that jobs will need? (one job may want one piece of data while the job after it may need another)
If I keep the Spark application up, then the executors take slots from the machines in the cluster does it mean that for co-locality I need to have a Spark executor on every node?
With executors running, it means that there are less resources for my map reduce jobs, right? I can stop and start the Spark application for every job, but then it takes away from the speed advantages of having the executors up and running, correct (also the benefits of hotspot for long running processes?)
I have read that container re-sizing (YARN-1197) will help, but doesn't that just mean that executors will stop and start? Isn't that the same as stopping the spark application (in other words, if there are no live executors, what is the benefit of having the Spark application up vs shutting it down and starting when a job requires executors)

Data Locality of executors : Spark does not deal with Data locality while launching executors but while launching tasks on them. So you might need to have executors on each data node(HDFS Redundancy can help you even if you dont have executors on each node).
Long Running process: Whether to shutdown your application or not depends on the use case. If you want to serve real time application requests/spark streaming you will not want to shut down the spark. But if you are doing batch processing you should shut down your executor. For Caching of data across jobs you should consider either HDFS Cache or tachyon. You can also consider dynamic allocation of spark with which you can free executors if they are not used for some time.(http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation).
YARN-1197 - will help in releasing the number of cpus/memory you allocated to containers. I am not sure though if spark supports this or not.

Related

Multi-threading in Hadoop/Spark

I have an idea about Multi-threading in general but not sure how it is used in Hadoop.
Based on my knowledge, Yarn is responsible for managing/controlling Spark/Mapreduce job resources, can't think of Multi-threading here. Not sure whether it can be used anywhere else in Hadoop Eco System.
I would appreciate if anybody could provide some information on this.
Many thanks,
actually, YARN is responsible for managing the resource allocation and de-allocation for containers requested by Application Master(MR-AppMaster or Spark-Driver). So the RPC between them are all about negotiation of resource agreement and it does not consider any details how tasks are running inside MapReduce and Spark.
For MapReduce-Hadoop, each task(mapper or reducer) is a single process running on a JVM, it doesn't employ any multi-threaded here.
For Spark, each executor are actually composed of many worker threads. Here each Spark task is corresponding to each task(single process) in MapReduce. So Spark does implement based on multi-threads models for lower
overhead of JVM and data shuffling between tasks.
Based on my experiences, Multi-threads models lower the overhead but suffers from the huge cost of fault tolerance. If an executor in Spark fails, all the tasks running inside the executor have to re-run but only single task needs to re-run for MapReduce. Also Spark suffers from huge memory pressure because all the tasks in side a executor needs to cache data in terms of RDD. But Mapreduce task only process one block at a time.
Hope this is helpful.
It is possible to run multithreaded code in Spark. Take an example of Java code in Spark
AnyCollections.parallelStream().forEach(temo -> {
// Add your spark code here.
}
Now based on the number of cores in the driver it will spawn multiple executors and do stuff in parallel.

Data locality with Spark standalone and HDFS

I have a Job that need to access parquet files on HDFS and I would like to minimise the network activity. So far I have HDFS Datanodes and Spark Workers started on the same nodes, but when I launch my job the data locality is always at ANY where it should be NODE_LOCAL since the data is distributed among all the nodes.
Is there any option I should configure to tell Spark to start the tasks where the data is ?
The property you are looking for is spark.locality.wait. If you increase its value it will execute jobs more locally, as spark wont send the data to other workers just because the one is busy on which the data resides. Although, setting the value to high might result in longer execution times cause you do not utilise workers efficiently.
Also have a look here:
http://spark.apache.org/docs/latest/configuration.html

Spark running on YARN - What does a real life example's workflow look like?

I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).

Differences between MapReduce and Yarn

I was searching about hadoop and mapreduce with respect to straggler problems and the papers in this problem
but yesterday I found that there is hadoop 2 with Yarn ,,
unfortunately no paper is talking about straggler problem in Yarn
So I want to know what is difference between MapReduce and Yarn in the part straggler?
is Yarn suffer from straggler problem?
and when MRmaster asks resource manger for resources , resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities ?
thanks so much,,
Here are the MapReduce 1.0 and MapReduce 2.0 (YARN)
MapReduce 1.0
In a typical Hadoop cluster, racks are interconnected via core switches. Core switches should connect to top-of-rack switches Enterprises using Hadoop should consider using 10GbE, bonded Ethernet and redundant top-of-rack switches to mitigate risk in the event of failure. A file is broken into 64MB chunks by default and distributed across Data Nodes. Each chunk has a default replication factor of 3, meaning there will be 3 copies of the data at any given time. Hadoop is “Rack Aware” and HDFS has replicated chunks on nodes on different racks. JobTracker assign tasks to nodes closest to the data depending on the location of nodes and helps the NameNode determine the ‘closest’ chunk to a client during reads. The administrator supplies a script which tells Hadoop which rack the node is in, for example: /enterprisedatacenter/rack2.
Limitations of MapReduce 1.0 – Hadoop can scale up to 4,000 nodes. When it exceeds that limit, it raises unpredictable behavior such as cascading failures and serious deterioration of overall cluster. Another issue being multi-tenancy – it is impossible to run other frameworks than MapReduce 1.0 on a Hadoop cluster.
MapReduce 2.0
MapReduce 2.0 has two components – YARN that has cluster resource management capabilities and MapReduce.
In MapReduce 2.0, the JobTracker is divided into three services:
ResourceManager, a persistent YARN service that receives and runs applications on the cluster. A MapReduce job is an application.
JobHistoryServer, to provide information about completed jobs
Application Master, to manage each MapReduce job and is terminated when the job completes.
TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a node. NodeManager is responsible for launching containers that could either be a map or reduce task.
This new architecture breaks JobTracker model by allowing a new ResourceManager to manage resource usage across applications, with ApplicationMasters taking the responsibility of managing the execution of jobs. This change removes a bottleneck and lets Hadoop clusters scale up to larger configurations than 4000 nodes. This architecture also allows simultaneous execution of a variety of programming models such as graph processing, iterative processing, machine learning, and general cluster computing, including the traditional MapReduce.
You say "Differences between MapReduce and YARN". MapReduce and YARN definitely different. MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which support parallel processing that we known as MapReduce. Before hadoop 2, hadoop already support MapReduce. In short, MapReduce run above YARN Architecture. Sorry, i don't mention in part of straggler problem.
"when MRmaster asks resource manger for resources?"
when user submit MapReduce Job. After MapReduce job has done, resource will be back to free.
"resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities"
I don't get this question point. Obviously, the resources manager will give all resource it needs no matter what cluster computing capabilities. Cluster computing capabilities will influence on processing time.
There is no YARN in MapReduce 1. In MapReduce there is Yarn.
If for straggler problem you mean that if first guy waits 'something' which then causes more waits along a road who depends on that first guy then I guess there is always this problem in MR jobs. Getting allocated resources naturally participate to this problem along with all other things which may cause components to wait something.
Tez which is supposed to be a drop-in replacement for MR job runtime makes a things differently. Instead of doing task runs in a same way current MR Appmaster does it tries to use DAG of tasks which does a much better job of not getting into bad straggler problem.
You need to understand a relationship between MR and YARN. YARN is simply a dummy resource scheduler meaning it doesn't schedule 'tasks'. What it gives to MR Appmaster is a set or resources(in a sense it's only combination of memory and cpu and location). It's then MR Appmaster responsibility to decide what to do with those resources.

Can map task and reduce task be in the same node?

I am a new about Hadoop, since the data transfer between map node and reduce node may reduce the efficiency of MapReduce, why not map task and reduce task are put together in the same node?
Actually you can run map and reduce in same JVM if the data is too 'small'. It is possible in Hadoop 2.0 (aka YARN) and now called Ubertask.
From the great "Hadoop: The Definitive Guide" book:
If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. (This is different from MapReduce 1, where small jobs are never run on a single tasktracker.) Such a job is said to be uberized, or run as an uber task.
The amount of data to be processed is too large that's why we are doing map and reduce in separate nodes. If the amount of data to be processed is small then definitely you ca use Map and Reduce on the same node.
Hadoop is usually used when the amount of data is very large in that case for high availability and concurrency separate nodes are needed for both map and reduce operations.
Hope this will clear your doubt.
An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master.
So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1, then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.
SET mapreduce.job.ubertask.enable=TRUE;
So the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.

Resources