How non mapreduce applications work in YARN? - hadoop

By using YARN, we can run non mapreduce application.
But how it works?
In HDFS, All gets stored in Blocks. For each blocks one mapper tasks would get create to process whole dataset.
But Non mapreduce applications, how it will process the datasets in different data node with out using mapreduce?
Please explain me.

Do not confuse the Map reduce paradigm with other applications like for instance Spark. Spark can run under Yarn but does not use mappers or reducers.
Instead it uses executors, these executors are aware of the datalocality, the same way mapreduce is.
The spark Driver will start executors on data nodes and will try to keep the data locality in mind when doing so.
Also do not confuse Map Reduce default behaviour with standard behaviour. you do not need to have 1 mapper per input split.
Also HDFS and Map Reduce are two different things. HDFS is just the storage layer while Map Reduce handles processing.

Related

mapreduce parameters in Spark

I wanted to know if mapreduce.* parameters are applicable in Spark.
As far as I know in Spark there is no buffer for the map output and for the reduce task the whole process is also different. Parameters like mapreduce.task.io.sort.mb ,mapreduce.reduce.shuffle.input.buffer.percent or mapreduce.reduce.input.buffer.percent control these kind of buffers.
I'm working in optimising parameters for spark tasks/jobs running in a hadoop/yarn cluster.
It is safe to say that these mapreduce parameters don't matter and that I should only care about spark.* parameters since the map, shuffle and reduce parts are different?
It's safe because Spark doesn't use MapReduce as processing engine, but it interacts directly with YARN to submit operations. Thus, when you use Spark, there is no MapReduce job scheduled, but you have a Spark application and Spark jobs.

Spark running on YARN - What does a real life example's workflow look like?

I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).

Spark Fundamentals

I am new to Spark... some basic things i am not clear when going through fundamentals:
Query 1. For distributing processing - Can Spark work without HDFS - Hadoop file system on a cluster (like by creating it's own distributed file system) or does it requires some base distributed file system in place as a per-requisite like HDFS, GPFS, etc.
Query 2. If we already have a file loaded in HDFS (as distributed blocks) - then will Spark again be converting it into blocks and redistributes at it's level (for distributed processing) or will just use the block distribution as per the Haddop HDFS cluster.
Query 3. Other than defining of a DAG does SPARK also creates the partitions like MapReduce does and shuffles partitions to the reducer nodes for further computation?
I am confused on same, as till DAG creation it's clear that Spark Executor working on each Worker node loads data blocks as RDD in memory and computation is applied as per DAG .... but where does the part goes required for partitioning the data as per Keys and taking them to other nodes where reducer task will be performed (just like mapreduce) how that is done in-memory??
This would be better asked as separate questions and question 3 is hard to understand. Anyway:
No, Spark does not require a distributed file system.
By default Spark will create one partition per HDFS block, and will co-locate computation with the data if possible.
You're asking about shuffle. Shuffle creates blocks on the mappers that the reducers will fetch from them. The spark.shuffle.memoryFraction parameter controls how much memory to allocate to shuffle block files. (20% by default.) The spark.shuffle.spill parameter controls whether to spill shuffle blocks to local disk when the memory runs out.
Query 1. For distributing processing - Can Spark work without HDFS ?
For distributed processing, Spark does not require HDFS. But it may read/write data from/to HDFS system. For some use case, it may write data to HDFS. For teragen sorting world record program, it used HDFS for sorting the data instead of using in-memoery.
Spark don't provide distributed storage. But integration with HDFS is one option for storage. But Spark can use other storage systems like Cassnadra etc. Have a look at this article for more details : https://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/
Query 2. If we already have a file loaded in HDFS (as distributed blocks) - then will Spark again be converting it into blocks and redistributes at it's level
I agree with Daniel Darabos response. Spark will create one partition per HDFS block.
Query 3: on shuffle
Depending on size of the data, shuffle will be done in-memory Or it may use disk (e.g. teragen sorting) or it may use both. Have a look at this excellent article on Spark shuffle.
Fine with this. What if you don’t have enough memory to store the whole “map” output? You might need to spill intermediate data to the disk. Parameter spark.shuffle.spill is responsible for enabling/disabling spilling, and by default spilling is enabled
The amount of memory that can be used for storing “map” outputs before spilling them to disk is “JVM Heap Size” * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction, with default values it is “JVM Heap Size” * 0.2 * 0.8 = “JVM Heap Size” * 0.16.
Query 1.
Yes it can work with others as well . Spark works with RDDs , if you have corresponding RDD implemented thats it .When you actually create a RDD by opening a file in HDFS , it inherently creates a HADOOP RDD which has implementation for understanding the HDFS , if you write your own Distributed file system you can write your own implementation for the same and instantiate the class its done . But writing the connector RDD to our own DFS is the challenge . For more you can look at the RDD interface in spark code
Query 2. It wont re create , instead my means of the HADOOP/HDFS RDD connector it knows where the blocks are .It will also try to use the same yarn nodes to run the jvm task to do processing .
Query 3. Not sure about this
Query 1 :- For simple spark provide distribute processing because of abstraction RDD (resilent distribute dataset), and without HDFS this cann't provide distribute Storage.
Query 2:- No it won't recreate.Here Spark will provide every block as partition(which means reference to that block) so it launch yarn on same block
Query 3:- no idea.

Hadoop cluster for non-MapReduce algorithms in parallel

The Apache Hadoop is inspired by the Google MapReduce paper. The flow of MapReduce can be considered as two set of SIMDs (single instruction multiple data), one for Mappers, another for Reducers. Reducers, through predefined "key", consume the output of Mappers. The essence of MapReduce framework (and Hadoop) is to automatically partition the data, determine the number of partitions and parallel jobs, and manage distributed resources.
I have a general algorithm (not necessarily MapReducable) to be run in parallel. I am not implementing the algorithm itself the MapReduce-way. Instead, the algorithm is just a single-machine python/java program. I want to run 64 copies of this program in parallel (assuming there is no concurrency issue in the program). i.e. I am more interested in the computing resources in the Hadoop cluster than the MapReduce frameworks. Is there anyway I can use the Hadoop cluster in this old fashion?
Other way of thinking about MapReduce, is MR does the transformation and Reduce does some sort of aggregations.
Hadoop also allows for a Map only job. This way it should be possible to run 64 copies of the Map program run in parallel.
Hadoop has the concept of slots. By default there will be 2 map and 2 reduce slots per node/machine. So, for 64 processes in parallel, 32 nodes are required. If the nodes are of higher end configuration, then the number of M/R slots per node can also be bumped up.

Does Amazon Elastic Map Reduce runs one or several mapper processes per instance?

My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
I haven't found the answer neither in Hadoop Streaming documentation, nor in Amazon Elastic MapReduce FAQ.
Hadoop has a notion of "slots". Slot is a place where mapper process will run. You configure number of slots per tasktracker node. It is teoretical maximum of map process which will run parralel per node. It can be less if there is not enough separate poprtions of the input data (called FileSplits).
Elastic MapReduce do have its own estimation how much slots to allocate depending on the instance capabilities.
In the same time I can imagine scenario where your processing will be more efficeint when one datastream is prcessed by many cores. If you have your mapper with built-in multicore usage - you can reduce number of slots. But it is inot usually a case in the typycial Hadoop tasks.
See the EMR doco [1] for the number of map/reduce tasks per instance type.
In addition to David's answers you can also have Hadoop run multiple threads per map slot by setting...
conf.setMapRunnerClass(MultithreadedMapRunner.class);
The default is 10 threads but it's tunable with
-D mapred.map.multithreadedrunner.threads=5
I often find this useful for custom high IO stuff.
[1] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_AMI2.html
My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
Once a Hadoop cluster has been set, the minimum required to submit a job is
Input format and location
Output format and location
Map and Reduce functions for processing the data
Location of the NameNode and the JobTracker
Hadoop will take care of distributing the job to different nodes, monitoring them, reading the data from the i/p and writing the data to the o/p. If the user has to do all those tasks, then there is no point of using Hadoop.
Suggest, to go through the Hadoop documentation and a couple of tutorials.

Resources