how RAM is used in mapreduce processing? - hadoop

Need clarification on processing, daemons like(namenode,datanode,jobttracker,task tracker) these all lie in a cluster (single node cluster- they are distributed in hard-disk).
What is the use of RAM or cache in map reduce processing or how it is accessed by various process in map reduce ?

Job Tracker and Task tracker were used to manage resources in cluster in map reduce 1.x and the reason it was removed is because it was not efficient method. Since map reduce 2.x a new mechanism was introduced called YARN. You can visit this link http://javacrunch.in/Yarn.jsp for understanding in depth working of YARN. Hadoop daemons use the ram for optimizing the job execution like in map reduce RAM is used for keeping resource logs in memory when a new job is submitted so that resources manager can identify how to distribute a job in a cluster. One more important thing is that hadoop map reduce performe disk oriented jobs it uses disk for executing a job and that is a major reason due to which it is slower than spark.
Hope this solve your query

You mentioned cluster in your question, we will not call single server or machine as cluster
Daemons(Processes) don't distributed across hard disks, those will utilize RAM to run
Regarding Cache look into this answer

RAM is used during processing of Map Reduce application.
Once the data is read through InputSplits (from HDFS blocks) into memory (RAM), the processing happens on data stored in RAM.
mapreduce.map.memory.mb = The amount of memory to request from the scheduler for each map task.
mapreduce.reduce.memory.mb = The amount of memory to request from the scheduler for each reduce task.
Default value for above two parameters is 1024 MB ( 1 GB )
Some more memory related parameters have been used in Map Reduce phase. Have a look at documentation page about mapreduce-site.xml for more details.
Related SE questions:
Mapreduce execution in a hadoop cluster

Related

Why YARN Application Master can require more that 1GB?

On my hadoop cluster I have an issue when ApplicationMaster(AM) killed by NodeManager because AM tries to allocate more than default 1GB. MR application, that AM is in charge of, is a mapper only job(1(!) mapper, no reducers, downloads data from remote source). At the moment when AM killed, MR job is ok (uses about 70% of ram limit). MR job doesn't have any custom counters, distributes caches etc, just downloads data (by portions) via custom input format.
To fix this issue, I raised memory limit for AM, but I want to know what is the reason of eating 1GB (!) for a trivial job like mine?

Running parallel queries in Spark

How does spark handle concurrent queries? I have read a bit about spark and underlying RDD's but I am unable to understand how concurrent queries would be handled?
For example if I run a query which loads the data in memory and the entire available memory is consumed and at the same time someone else runs a query involving another set of data, how would spark allocate the memory to both the queries? Also what would be the impact if the priorities are taken into account.
Also can running lots of parallel queries would result in the machines hanging ?
Firstly Spark doesn't take the in-memory (RAM) more than threshold limit.
Spark tries to allocate the default in-memory to every job.
If there is insufficient memory for a new job then it tries to spill the in-memory content of LeastRecentlyUsed (LRU) RDD to disk and then allocates to new job.
Optionally you can also specify the storage of RDD like IN-MEMORY only, DISK only, MEMORY AND DISK etc..
Scenario: consider a low in-memory machine with huge no of jobs, then most of the RDDs will be placed in disk only, as per the above approach.
So, the jobs will continue to run but it will not take the advantage of Spark in-memory processing.
Spark does the memory allocation very intelligently.
If Spark used on top-of YARN then Resource manager also takes place in the resource allocation.

Map Reduce Slot Definition

I am on my way for becoming a cloudera Hadoop administrator. Since my start, I am hearing a lot about computing slots per machine in a Hadoop Cluster like defining number of Map Slots and Reduce slots.
I have searched internet for a log time for getting a Noob definition for a Map Reduce Slot but didn't find any.
I am really pissed off by going through PDF's explaining the configuration of Map Reduce.
Please explain what exactly it means when it comes to a computing slot in a Machine of a cluster.
In map-reduce v.1 mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum are used to configure number of map slots and reduce slots accordingly in mapred-site.xml.
starting from map-reduce v.2 (YARN), containers is a more generic term is used instead of slots, containers represents the max number of tasks that can run in parallel under the node regardless being Map task, Reduce task or application master task (in YARN).
generally it depends on CPU and memory
In out cluster, we set 20 map slot and 15 reduce slot for a machine with 32Core,64G memory
1.approximately one slot needs one cpu core
2.number of map slot should be a little more than reduce
IN MRV1 each machine had fixed number of Slots dedicated for maps and reduce.
In general each machine is configured with 4:1 ratio of maps:reducer on a machine .
logically one would be reading lot of data(Maps) and crunching them to small set(Reduce).
In MRV2 concept of containers came in and any container can run either a map/reducer/shell script .
A bit late though, I'll answer anyways.
Computing Slot. Can you think of all the various computations in the Hadoop that would require some resource i.e. memory/CPUs/Disk Size.
Resource = Memory or CPU-Core or Disk Size required
Allocating resource to start a Container, allocating resource to perform a map or a reduce task etc.
It is all about how you would want to manage the resources you have in hand. Now what would that be? RAM, Cores, Disks Size.
Goal is to ensure your processing is not constrained by any one of these cluster resources. You want your processing to be as dynamic as possible.
As an example, Hadoop YARN allows you to configure min RAM required to start a YARN container, min RAM require to start a MAP/REDUCE task, JVM Heap Size (for Map and Reduce tasks) and the amount of virtual memory each task would get.
Unlike Hadoop MR1, you do not pre-configure (as an example RAM size) before you even begin executing Map-Reduce tasks. In the sense you would want your resource allocation to be as elastic as possible, i.e. dynamically increase RAM/CPU cores for either MAP or a REDUCE task.

Spark vs MapReduce , why is Spark faster than MR ,the principle?

As I know ,Spark preload the data from every nodes' disk(HDFS) into every nodes' RDD to compute. But as I guess, MapReduce must also load the data from HDFS to memory and then compute it in memory. So.. why is Spark more faseter?
Just because MapReduce load the data to memory at every time when MapReduce want to do the compute but Spark preload the data? Thank you very much.
There is a concept of an Resilient Distributed Dataset (RDD), which Spark uses, it allows to transparently store data on memory and persist it to disc when needed.
On other hand in Map reduce after Map and reduce tasks data will be shuffled and sorted (synchronisation barrier) and written to disk.
In Spark, there is no synchronisation barrier that slows map-reduce down. And the usage of memory makes the execution engine really fast.
Hadoop Map Reduce
Hadoop Map Reduce is Batch Processing
2.In HDFS high latency. Here is a full explanation about Hadoop MapReduce and Spark
http://commandstech.com/basic-difference-between-spark-and-map-reduce-with-examples/
Spark:
Coming to Spark is Streaming processing
Low latency because of RDDs.

Can map task and reduce task be in the same node?

I am a new about Hadoop, since the data transfer between map node and reduce node may reduce the efficiency of MapReduce, why not map task and reduce task are put together in the same node?
Actually you can run map and reduce in same JVM if the data is too 'small'. It is possible in Hadoop 2.0 (aka YARN) and now called Ubertask.
From the great "Hadoop: The Definitive Guide" book:
If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. (This is different from MapReduce 1, where small jobs are never run on a single tasktracker.) Such a job is said to be uberized, or run as an uber task.
The amount of data to be processed is too large that's why we are doing map and reduce in separate nodes. If the amount of data to be processed is small then definitely you ca use Map and Reduce on the same node.
Hadoop is usually used when the amount of data is very large in that case for high availability and concurrency separate nodes are needed for both map and reduce operations.
Hope this will clear your doubt.
An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master.
So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1, then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.
SET mapreduce.job.ubertask.enable=TRUE;
So the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.

Resources