Why YARN Application Master can require more that 1GB? - hadoop

On my hadoop cluster I have an issue when ApplicationMaster(AM) killed by NodeManager because AM tries to allocate more than default 1GB. MR application, that AM is in charge of, is a mapper only job(1(!) mapper, no reducers, downloads data from remote source). At the moment when AM killed, MR job is ok (uses about 70% of ram limit). MR job doesn't have any custom counters, distributes caches etc, just downloads data (by portions) via custom input format.
To fix this issue, I raised memory limit for AM, but I want to know what is the reason of eating 1GB (!) for a trivial job like mine?

Related

Why is hadoop slow for a simple hello world job

I am following the tutorial on the hadoop website: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html.
I run the following example in Pseudo-Distributed Mode.
time hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+'
It takes 1:47min to complete. When I turn off the network (wifi), it finishes in approx 50 seconds.
When I run the same command using the Local (Standalone) Mode, it finishes in approx 5 seconds (on a mac).
I understand that in Pseudo-Distributed Mode there is more overhead involved and hence it will take more time, but in this case it takes way more time. The CPU is completely idle during the run.
Do you have any idea what can cause this issue?
First, I don't have an explanation for why turning off your network would result in faster times. You'd have to dig through the Hadoop logs to figure out that problem.
This is typical behavior most people encounter running Hadoop on a single node. Effectively, you are trying to use Fedex to deliver something to your next door neighbor. It will always be faster to walk it over because the inherent overhead of operating a distributed system. When you run local mode, you are only performing the Map-Reduce function. When you run pseudo-distributed, it will use all the Hadoop servers (NameNode, DataNodes for data; Resource Manager, NodeManagers for compute) and what you are seeing is the latencies involved in that.
When you submit your job, the Resource Manager has to schedule it. As your cluster is not busy, it will ask for resources from the Node Manager. The Node Manager will give it a container which will run your Application Master. Typically, this loop takes about 10 seconds. Once your AM is running it will ask for resources from the Resource Manager for it's Map and Reduce tasks. This takes another 10 seconds. Also when you submit your job there is around a 3 second wait before this job is actually submitted to the Resource Manager. So far that's 23 seconds and you haven't done any computation yet.
Once the job is running, the most likely cause of waiting is allocating memory. On smaller systems (> 32GB of memory) the OS might take a while to allocate space. If you were to run the same thing on what is considered commodity hardware for Hadoop (16+ core, 64+ GB) you would probably see run time closer to 25-30 seconds.

how RAM is used in mapreduce processing?

Need clarification on processing, daemons like(namenode,datanode,jobttracker,task tracker) these all lie in a cluster (single node cluster- they are distributed in hard-disk).
What is the use of RAM or cache in map reduce processing or how it is accessed by various process in map reduce ?
Job Tracker and Task tracker were used to manage resources in cluster in map reduce 1.x and the reason it was removed is because it was not efficient method. Since map reduce 2.x a new mechanism was introduced called YARN. You can visit this link http://javacrunch.in/Yarn.jsp for understanding in depth working of YARN. Hadoop daemons use the ram for optimizing the job execution like in map reduce RAM is used for keeping resource logs in memory when a new job is submitted so that resources manager can identify how to distribute a job in a cluster. One more important thing is that hadoop map reduce performe disk oriented jobs it uses disk for executing a job and that is a major reason due to which it is slower than spark.
Hope this solve your query
You mentioned cluster in your question, we will not call single server or machine as cluster
Daemons(Processes) don't distributed across hard disks, those will utilize RAM to run
Regarding Cache look into this answer
RAM is used during processing of Map Reduce application.
Once the data is read through InputSplits (from HDFS blocks) into memory (RAM), the processing happens on data stored in RAM.
mapreduce.map.memory.mb = The amount of memory to request from the scheduler for each map task.
mapreduce.reduce.memory.mb = The amount of memory to request from the scheduler for each reduce task.
Default value for above two parameters is 1024 MB ( 1 GB )
Some more memory related parameters have been used in Map Reduce phase. Have a look at documentation page about mapreduce-site.xml for more details.
Related SE questions:
Mapreduce execution in a hadoop cluster

Spark and Map-Reduce together

What is the best approach to run Spark on a cluster that runs map reduce jobs?
First question is about co-locality with data. When I start a Spark application, it allocates executors, right? How does it know where to allocate them so they are in the same nodes as the data that jobs will need? (one job may want one piece of data while the job after it may need another)
If I keep the Spark application up, then the executors take slots from the machines in the cluster does it mean that for co-locality I need to have a Spark executor on every node?
With executors running, it means that there are less resources for my map reduce jobs, right? I can stop and start the Spark application for every job, but then it takes away from the speed advantages of having the executors up and running, correct (also the benefits of hotspot for long running processes?)
I have read that container re-sizing (YARN-1197) will help, but doesn't that just mean that executors will stop and start? Isn't that the same as stopping the spark application (in other words, if there are no live executors, what is the benefit of having the Spark application up vs shutting it down and starting when a job requires executors)
Data Locality of executors : Spark does not deal with Data locality while launching executors but while launching tasks on them. So you might need to have executors on each data node(HDFS Redundancy can help you even if you dont have executors on each node).
Long Running process: Whether to shutdown your application or not depends on the use case. If you want to serve real time application requests/spark streaming you will not want to shut down the spark. But if you are doing batch processing you should shut down your executor. For Caching of data across jobs you should consider either HDFS Cache or tachyon. You can also consider dynamic allocation of spark with which you can free executors if they are not used for some time.(http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation).
YARN-1197 - will help in releasing the number of cpus/memory you allocated to containers. I am not sure though if spark supports this or not.

Spark running on YARN - What does a real life example's workflow look like?

I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).

Suspending hadoop nodes temporarily - background hadoop cluster

I wonder if it is possible to install a "background" hadoop cluster. I mean, after all it is meant to be able to deal with nodes being unavailable or slow sometimes.
So assuming some university has a computer lab. Say, 100 boxes, all with upscale desktop hardware, gigabit etherner, probably even identical software installation. Linux is really popular here, too.
However, these 100 boxes are of course meant to be desktop systems for students. There are times where the lab will be full, but also times where the lab will be empty. User data is mostly stored on a central storage - say NFS - so the local disks are not used a lot.
Sounds like a good idea to me to use the systems as Hadoop cluster in their idle time. The simplest setup would be of course to have a cron job start the cluster at night, and shut down in the morning. However, also during the day many computers will be unused.
However, how would Hadoop react to e.g. nodes being shut down when any user logs in? Is it possible to easily "pause" (preempt!) a node in hadoop, and moving it to swap when needed? Ideally, we would give Hadoop a chance to move away the computation before suspending the task (also to free up memory). How would one do such a setup? Is there a way to signal Hadoop that a node will be suspended?
As far as I can tell, datanodes should not be stopped, and maybe replication needs to be increased to have more than 3 copies. With YARN there might also be a problem that by moving the task tracker to an arbitrary node, it may be the one that gets suspended at some point. But maybe it can be controlled that there is a small set of nodes that is always on, and that will run the task trackers.
Is it appropriate to just stop the tasktracker or send a SIGSTOP (then resume with SIGCONT)? The first would probably give hadoop the chance to react, the second would continue faster when the user logs out soon (as the job can then continue). How about YARN?
First of all, hadoop doesn't support 'preempt', how you described it.
Hadoop simply restarts task, if it detects, that task tracker dead.
So in you case, when user logins into host, some script simply kills
tasktracker, and jobtracker will mark all mappers/reducers, which were run
on killed tasktracker, as FAILED. After that this tasks will be rescheduled
on different nodes.
Of course such scenario is not free. By design, mappers and reducers
keep all intermediate data on local hosts. Moreover, reducers fetch mappers
data directly from tasktrackers, where mappers was executed. So, when
tasktracker will be killed, all those data will be lost. And in case
of mappers, it is not a big problem, mapper usually works on relatively
small amount of data (gigabytes?), but reducer will suffer greater.
Reducer runs shuffle, which is costly in terms of network bandwidth and
cpu. If tasktracker runs some reducer, restart of this reducer means,
that all data should be redownloaded once more onto new host.
And I recall, that jobtracker doesn't see immediately, that
tasktracker is dead. So, killed tasks shouldn't restart immediately.
If you workload is light, datanodes can live forever, don't put them offline,
when user login. Datanode eats small amount of memory (256M should be enough
in case small amount of data) and if you workload is light, don't eat much
of cpu and disk io.
As conclusion, you can setup such configuration, but don't rely on
good and predictable job execution on moderated workloads.

Resources