Hadoop how to allocate more memory per node - hadoop

I have a Hadoop cluster running on 2 nodes(master and slave) each of which have
126GB RAM and 32 CPU.
when I run my cluster i am only able to see 8 GB memory per node. How do I increase this? What would be the optimal memory to be allocated per node and how to do it?

This blog post will give you a ton of help; http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

You might have to tell Hadoop what parameters to use when launching the JVMs or else it will use your Java implementation default values.
In your mapred-site.xml, you can add this mapred.child.java.opts field to specify the memory size to use for the JVMs.
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx16000m</value>
</property>
Where 16000 is the number of MB you want to allocate to each JVM.
I hope it helps!
Source

Related

Difference between `yarn.scheduler.maximum-allocation-mb` and `yarn.nodemanager.resource.memory-mb`?

What is difference between yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb?
I see both of these in yarn-site.xml and I see the explanations here.
yarn.scheduler.maximum-allocation-mb is given the following definition: The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this will throw a InvalidResourceRequestException. Does this mean memory requests ONLY on the resourcemanager are limited by this value?
And yarn.nodemanager.resource.memory-mb is given definition of Amount of physical memory, in MB, that can be allocated for containers. Does this mean the total amount for all containers across the entire cluster, summed together?
HOwever, I still cannot discern between these. Those explanations make me think that they are the same.
Even more confusing, their default values are exactly the same: 8192 mb. How do I tell difference between these? Thank you.
Consider in a scenario where you are setting up a cluster where each machine having 48 GB of RAM. Some of this RAM should be reserved for Operating System and other installed applications.
yarn.nodemanager.resource.memory-mb:
Amount of physical memory, in MB, that can be allocated for containers. It means the amount of memory YARN can utilize on this node and therefore this property
should be lower than the total memory of that machine.
<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value> <!-- 40 GB -->
The next step is to provide YARN guidance on how to break up the total resources available into Containers. You do this by specifying the minimum unit of RAM to allocate for a Container.
In yarn-site.xml
<name>yarn.scheduler.minimum-allocation-mb</name> <!-- RAM-per-container ->
<value>2048</value>
yarn.scheduler.maximum-allocation-mb:
It defines the maximum memory allocation available for a container in MB
it means RM can only allocate memory to containers in increments of "yarn.scheduler.minimum-allocation-mb" and not exceed "yarn.scheduler.maximum-allocation-mb" and It should not be more then total allocated memory of the Node.
In yarn-site.xml
<name>yarn.scheduler.maximum-allocation-mb</name> <!-Max RAM-per-container->
<value>8192</value>
For MapReduce applications, YARN processes each map or reduce task in a container and on a single machine there can be number of containers.
We want to allow for a maximum of 20 containers on each node, and thus need (40 GB total RAM) / (20 # of containers) = 2 GB minimum per container controlled by property yarn.scheduler.minimum-allocation-mb
Again we want to restrict maximum memory utilization for a container controlled by property "yarn.scheduler.maximum-allocation-mb"
For example, if one job is asking for 2049 MB memory per map container(mapreduce.map.memory.mb=2048 set in mapred-site.xml), RM will give it one 4096 MB(2*yarn.scheduler.minimum-allocation-mb) container.
If you have a huge MR job which asks for a 9999 MB map container, the job will be killed with the error message.

Error in reducing ram size in hortonworks hadoop

i have to reduce ram size of virtual box from 4 gb to 1 gb .I had tried for reducing it But it is unchangable so please suggest ways to do it in right manner . I am attaching screenshot .
The same error had occured when i had tried for hadoop , now you can use these things .
Configuring YARN
In a Hadoop cluster, it’s vital to balance the usage of RAM, CPU and disk so that processing is not constrained by any one of these cluster resources. As a general recommendation, we’ve found that allowing for 1-2 Containers per disk and per core gives the best balance for cluster utilization. So with our example cluster node with 12 disks and 12 cores, we will allow for 20 maximum Containers to be allocated to each node.
Each machine in our cluster has 48 GB of RAM. Some of this RAM should be reserved for Operating System usage. On each node, we’ll assign 40 GB RAM for YARN to use and keep 8 GB for the Operating System. The following property sets the maximum memory YARN can utilize on the node:
In yarn-site.xml
<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value>
The next step is to provide YARN guidance on how to break up the total resources available into Containers. You do this by specifying the minimum unit of RAM to allocate for a Container. We want to allow for a maximum of 20 Containers, and thus need (40 GB total RAM) / (20 # of Containers) = 2 GB minimum per container:
In yarn-site.xml
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
YARN will allocate Containers with RAM amounts greater than the yarn.scheduler.minimum-allocation-mb.
For more information you can visit hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

How hadoop YARN memory and core works?

Before, I have read about hadoop YARN memory and core parameters that set on a cluseter. But, I still don't know clearly about that parameter. And also about the container. That on 1 node only has 1 container or more ? And on 1 cluster only have 1 Application Master or more?
Help me to know about container, application master, memory, and core works on YARN,
Thanks All.....
The above diagram is from apache YARN web site . It gives a clear explanation about the inner details and functionality.
We set the parameters for YARN in yarn-site.xml in /conf .
property yarn.nodemanager.resource.memory-mb can be used to set maximum amount of RAM that can be used by YARN on a node .
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
property yarn.scheduler.minimum-allocation-mb can be used to set minimum amount of memory for a container . For example if we have 16 GB memory and the yarn.nodemanager.resource.memory-mb is 4GB . So we have 12 GB remaining and minimum of 2GB is allocated for each container thus we can have atmost 6 containers.
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
From the architecture you can clearly see that we will have more than one app master.

How to set the VCORES in hadoop mapreduce/yarn?

The following are my configuration :
**mapred-site.xml**
map-mb : 4096 opts:-Xmx3072m
reduce-mb : 8192 opts:-Xmx6144m
**yarn-site.xml**
resource memory-mb : 40GB
min allocation-mb : 1GB
the Vcores in hadoop cluster displayed 8GB but i dont know how the computation or where to configure it.
hope someone could help me.
Short Answer
It most probably doesn't matter, if you are just running hadoop out of the box on your single-node-cluster or even a small personal distributed cluster. You just need to worry about memory.
Long Answer
vCores are used for larger clusters in order to limit CPU for different users or applications. If you are using YARN for yourself there is no real reason to limit your container CPU. That is why vCores are not even taken into consideration by default in Hadoop !
Try setting your available nodemanager vcores to 1. It doesn't matter ! Your number of containers will still be 2 or 4 .. or whatever the value of :
yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb
If really do want the number of containers to take vCores into consideration and be limited by :
yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores
then you need to use a different a different Resource Calculator. Go to your capacity-scheduler.xml config and change DefaultResourceCalculator to DominantResourceCalculator.
In addition to using vCores for container allocation, you want to use vCores to really limit CPU usage of each node ? You need to change even more configurations to use the LinuxContainerExecutor instead of the DefaultContainerExecutor, because it can manage linux cgroups which are used to limit CPU resources. Follow this page if you want more info on this.
yarn.nodemanager.resource.cpu-vcores - Number of CPU cores that can be allocated for containers.
mapreduce.map.cpu.vcores - The number of virtual CPU cores allocated for each map task of a job
mapreduce.reduce.cpu.vcores - The number of virtual CPU cores for each reduce task of a job
I accidentally came across this question and I eventually managed to find the answers that I needed, so I will try to provide a complete answer.
Entities and they relations For each hadoop application/job, you have an Application Master that communicates with the ResourceManager about available resources on the cluster. The ResourceManager receives information about available resources on each node from each NodeManager. The resources are called Containers (memory and CPU). For more information see this.
Resource declaration on the cluster Each NodeManager provides information about its available resources. Relevant settings are yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores in $HADOOP_CONF_DIR/yarn-site.xml. They declare the memory and cpus that can be allocated to Containers.
Ask for resources For your jobs you can configure what resources are needed by each map/reduce. This can be done as follows (this is for the map tasks).
conf.set("mapreduce.map.cpu.vcores", "4");
conf.set("mapreduce.map.memory.mb", "2048");
This will ask for 4 virtual cores and 2048MB of memory for each map task.
You can also configure the resources that are necessary for the Application Master the same way with the properties yarn.app.mapreduce.am.resource.mb and yarn.app.mapreduce.am.resource.cpu-vcores.
Those properties can have default values in $HADOOP_CONF_DIR/mapred-default.xml.
For more options and default values I would recommend you to take a look at this and this

Container is running beyond memory limits

In Hadoop v1, I have assigned each 7 mapper and reducer slot with size of 1GB, my mappers & reducers runs fine. My machine has 8G memory, 8 processor.
Now with YARN, when run the same application on the same machine, I got container error.
By default, I have this settings:
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
It gave me error:
Container [pid=28920,containerID=container_1389136889967_0001_01_000121] is running beyond virtual memory limits. Current usage: 1.2 GB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
I then tried to set memory limit in mapred-site.xml:
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
But still getting error:
Container [pid=26783,containerID=container_1389136889967_0009_01_000002] is running beyond physical memory limits. Current usage: 4.2 GB of 4 GB physical memory used; 5.2 GB of 8.4 GB virtual memory used. Killing container.
I'm confused why the the map task need this much memory. In my understanding, 1GB of memory is enough for my map/reduce task. Why as I assign more memory to container, the task use more? Is it because each task gets more splits? I feel it's more efficient to decrease the size of container a little bit and create more containers, so that more tasks are running in parallel. The problem is how can I make sure each container won't be assigned more splits than it can handle?
You should also properly configure the maximum memory allocations for MapReduce. From this HortonWorks tutorial:
[...]
Each machine in our cluster has 48 GB of RAM. Some of this RAM should be >reserved for Operating System usage. On each node, we’ll assign 40 GB RAM for >YARN to use and keep 8 GB for the Operating System
For our example cluster, we have the minimum RAM for a Container
(yarn.scheduler.minimum-allocation-mb) = 2 GB. We’ll thus assign 4 GB
for Map task Containers, and 8 GB for Reduce tasks Containers.
In mapred-site.xml:
mapreduce.map.memory.mb: 4096
mapreduce.reduce.memory.mb: 8192
Each Container will run JVMs for the Map and Reduce tasks. The JVM
heap size should be set to lower than the Map and Reduce memory
defined above, so that they are within the bounds of the Container
memory allocated by YARN.
In mapred-site.xml:
mapreduce.map.java.opts: -Xmx3072m
mapreduce.reduce.java.opts: -Xmx6144m
The above settings configure the upper limit of the physical RAM that
Map and Reduce tasks will use.
To sum it up:
In YARN, you should use the mapreduce configs, not the mapred ones. EDIT: This comment is not applicable anymore now that you've edited your question.
What you are configuring is actually how much you want to request, not what is the max to allocate.
The max limits are configured with the java.opts settings listed above.
Finally, you may want to check this other SO question that describes a similar problem (and solution).
There is a check placed at Yarn level for Virtual and Physical memory usage ratio.
Issue is not only that VM doesn't have sufficient physical memory. But it is because Virtual memory usage is more than expected for given physical memory.
Note : This is happening on Centos/RHEL 6 due to its aggressive allocation of virtual memory.
It can be resolved either by :
Disable virtual memory usage check by setting
yarn.nodemanager.vmem-check-enabled to false;
Increase VM:PM ratio by setting yarn.nodemanager.vmem-pmem-ratio to some higher value.
References :
https://issues.apache.org/jira/browse/HADOOP-11364
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
Add following property in yarn-site.xml
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
I had a really similar issue using HIVE in EMR. None of the extant solutions worked for me -- ie, none of the mapreduce configurations worked for me; and neither did setting yarn.nodemanager.vmem-check-enabled to false.
However, what ended up working was setting tez.am.resource.memory.mb, for example:
hive -hiveconf tez.am.resource.memory.mb=4096
Another setting to consider tweaking is yarn.app.mapreduce.am.resource.mb
I can't comment on the accepted answer, due to low reputation. However, I would like to add, this behavior is by design. The NodeManager is killing your container. It sounds like you are trying to use hadoop streaming which is running as a child process of the map-reduce task. The NodeManager monitors the entire process tree of the task and if it eats up more memory than the maximum set in mapreduce.map.memory.mb or mapreduce.reduce.memory.mb respectively, we would expect the Nodemanager to kill the task, otherwise your task is stealing memory belonging to other containers, which you don't want.
While working with spark in EMR I was having the same problem and setting maximizeResourceAllocation=true did the trick; hope it helps someone. You have to set it when you create the cluster. From the EMR docs:
aws emr create-cluster --release-label emr-5.4.0 --applications Name=Spark \
--instance-type m3.xlarge --instance-count 2 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json
Where myConfig.json should say:
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
We also faced this issue recently. If the issue is related to mapper memory, couple of things I would like to suggest that needs to be checked are.
Check if combiner is enabled or not? If yes, then it means that reduce logic has to be run on all the records (output of mapper). This happens in memory. Based on your application you need to check if enabling combiner helps or not. Trade off is between the network transfer bytes and time taken/memory/CPU for the reduce logic on 'X' number of records.
If you feel that combiner is not much of value, just disable it.
If you need combiner and 'X' is a huge number (say millions of records) then considering changing your split logic (For default input formats use less block size, normally 1 block size = 1 split) to map less number of records to a single mapper.
Number of records getting processed in a single mapper. Remember that all these records need to be sorted in memory (output of mapper is sorted). Consider setting mapreduce.task.io.sort.mb (default is 200MB) to a higher value if needed. mapred-configs.xml
If any of the above didn't help, try to run the mapper logic as a standalone application and profile the application using a Profiler (like JProfiler) and see where the memory getting used. This can give you very good insights.
Running yarn on Windows Linux subsystem with Ubunto OS, error "running beyond virtual memory limits, Killing container"
I resolved it by disabling virtual memory check in the file yarn-site.xml
<property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
I am practicing Hadoop programs (version hadoop3). Via virtual box I have installed Linux OS. We allocate very limited memory at time of installation of Linux.
By setting the following memory limit properties in mapred-site.xml and restarting your HDFS and YARN then my program worked.
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
I haven't personally checked, but hadoop-yarn-container-virtual-memory-understanding-and-solving-container-is-running-beyond-virtual-memory-limits-errors sounds very reasonable
I solved the issue by changing yarn.nodemanager.vmem-pmem-ratio to a higher value , and I would agree that:
Another less recommended solution is to disable the virtual memory check by setting yarn.nodemanager.vmem-check-enabled to false.

Resources