What does YARN_HEAPSIZE mean in yarn-env.sh? - hadoop

What does setting "YARN_HEAPSIZE=500" in yarn-env.sh mean?
What does it mean to set the heap size of HADOOP's main daemon?
For example ,
How are the above settings different from yarn.scheduler.minimum-allocation-mb,yarn.scheduler.maximum-allocation-mb,yarn.nodemanager.resource.memory-mb in yarn-site.xml ?

They each set the JVM daemon heap sizes.
The others you've mentioned in the XML files are for applications that get launched within YARN.
The explanation for each is already in the Hadoop website under the cluster installation pages.


Switching from Capacity Scheduler to Fair Scheduler on Hortonworks Data Platform

My organization is currently using Hortonworks HDP to manage our Hadoop Cluster. The default YARN scheduler is the Capacity Scheduler. I would like to switch to a Fair Scheduler. I am completely new to HDP.
In the absence of a cluster management suite, this would be done by editing the yarn-site.xml and changing the yarn.resourcemanager.scheduler.class property to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
and creating an extra fair-scheduler.xml file to specify the queue configurations as mentioned here and then referring YARN to that configuration by setting the yarn.scheduler.fair.allocation.file property in yarn-site.xml.
Now in Ambari, while it is possible to change the yarn.resourcemanager.scheduler.class property via the UI, and add a new custom property yarn.scheduler.fair.allocation.file, I cannot (for the love of god) find a way to have ambari read fair-scheduler.xml instead of capacity-scheduler.xml.
So my question is; How do I go about switching to the fair scheduler via Ambari? There's got to be an easy way, right?
Properties in capacity-scheduler.xml
On your RM node, set yarn.scheduler.fair.allocation.file with the full path of your fair-scheduler.xml (or in custom yarn-site under ambari)
tail -n 1000 /var/log/hadoop-yarn/yarn/hadoop-yarn-resourcemanager-master.log | grep "fair-scheduler.xml"
After a restart of your ResourceManager, you should see that it is loading your file:
2019-02-19 15:49:26,358 INFO fair.AllocationFileLoaderService (AllocationFileLoaderService.java:reloadAllocations(230)) - Loading allocation file file:/usr/hdp/current/hadoop-client/conf/fair-scheduler.xml
Works on Hdp 3.1.1, and probably 3.0.0 too

Data node capacity is too low on the server

I am using hortonworks distribution of hadoop and have installed it with ambari.
I have it installed on the server and there aren't any space constraints as such.
Yet my datanode is given only 991 MB of space which is surprisingly low.
What could be the possible reasons ?How do I fix it?
You can try df -g from Linux terminal to confirm you have enough space.
Login to Amabari UI --> HDFS tab --> config, search for the property dfs.datanode.du.reserved
This property defines the space reserved that will not be used for Hadoop.

Does Mesos Overwrite Hadoop Memory Settings?

My company runs hadoop on mesos, and I’m new to mesos. The current limiting rate of the hadoop application I’m in charge of is the speed of reducer tasks, so I was hoping to play around with mesos and hadoop memory settings to speed up the reducer.
Unfortunately, I don’t understand the relationship between hadoop memory settings and mesos memory configuration, and I suspect that mesos may be overriding some of my hadoop memory settings.
Is changing the value of mapreduce.reduce.java.opts or mapreduce.reduce.memory.mb (in /etc/hadoop/conf/mapred-site.xml) affected by mesos? Does mesos limit the amount of memory that I can allocate to the reducer?
If so, where are the config files in mesos so I can change those settings?
9/30/2015 Update:
The file at https://github.com/mesos/hadoop/blob/master/configuration.md lists parameters that you can put in your mapred-site.xml file.
I'm still not sure how those parameters affect the memory-associated hadoop configuration parameters in mapred-site.xml.
The configuration is described in the respective GitHub repo mesos/hadoop.

Mismatch in no of Executors(Spark in YARN Pseudo distributed mode)

I am running Spark using YARN(Hadoop 2.6) as cluster manager. YARN is running in Pseudo distributed mode. I have started the spark shell with 6 executors and was expecting the same
spark-shell --master yarn --num-executors 6
But whereas in the Spark Web UI, I see only 4 executors
Any reason for this?
PS : I ran the nproc command in my Ubuntu(14.04) and give below is the result. I believe this mean, my system has 8 cores
mountain#mountain:~$ nproc
did you take in account spark.yarn.executor.memoryOverhead?
possobly it creates hiden memory requrement and finaly yarn could not provide whole resources.
also, note that yarn round container size to yarn.scheduler.increment-allocation-mb.
all detail here:
This happens when there are not enough resources on your cluster to start more executors. Following things are taken into account
Spark executor runs inside a yarn container. This container size is determined from the value of yarn.scheduler.minimum-allocation-mb in yarn-site.xml. Check this property. If your existing containers consume all available memory then more memory will not be available for new containers. so no new executors will be started
The storage memory column in the UI displays the amount of memory used for execution and RDD storage. By default, this equals (HEAP_SPACE - 300MB) * 75%. The rest of the memory is used for internal metadata, user data structures and other stuffs. ref(Spark on YARN: Less executor memory than set via spark-submit)
I hope this helps.

How do I configure and reboot an HDInsight cluster running on Azure?

Specifically, I want to change the maximum number of mappers and the maximum number of reducers for each node in an HDInsight cluster running on Microsoft Azure.
Using remote desktop, I logged in to the head node. I edited the mapred-site.xml file on the head node and changed the mapred.tasktracker.map.tasks.maximum and the mapred.tasktracker.reduce.tasks.maximum values. I tried rebooting the head node, but I was not able to reboot. I used the start-onebox.cmd and stop-onebox.cmd scripts to try and start/stop HDInsight.
I then ran a streaming mapreduce passing the desired number of reducers to the hadoop-streaming.jar, but the number of reducers was still limited by the previous value of mapred.tasktracker.reduce.tasks.maximum. Most of my reducers were pending execution.
Do I need to change the mapred-site.xml file on every node? Is there an easy way to change this, or do I need to remote desktop into every node? How do I reboot or restart the cluster so that my new values are used?
I know it has been a while since the question was posted, but I would like to post for other users who may find useful.
There are 2 ways you can change Hadoop configuration files (such as mapred-site.xml, hive-site.xml etc) on HDinsight
Option #1:
This is the easiest - you can supply the hadoop configuration values per job, as shown in this blog
Option #2:
You can customize HDinsight cluster with hadoop configuration values during provisioning or installing a cluster, as shown in this blog
Manually modifying a config file is not supported and the change will be lost when the Azure VM gets re-imaged.
