HADOOP exceeding memory limit - hadoop

I am receiving follow message under the unhealthy node section.
1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /tmp/hadoop-User/nm-local-dir : used space above threshold of 90.0% ]
1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ C:/hadoop-3.3.0/logs/userlogs : used space above threshold of 90.0% ]
I saw a solution previously, either clean up the disk where the hadoop node is running on or increase the limit in yarn file. I do not want to increase the value in yarn file, I want to clean up the disk. Can anybody tell me what should I delete, i.e. where is the location on which hadoop is running its jobs?

Related

Does Kubernetes consider the current memory usage when scheduling pods

The Kubernetes docs on https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/ state:
The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.
Does Kubernetes consider the current state of the node when calculating capacity? To highlight what I mean, here is a concrete example:
Assuming I have a node with 10Gi of RAM, running 10 Pods each with 500Mi of resource requests, and no limits. Let's say they are "bursting", and each Pod is actually using 1Gi of RAM. In this case, the node is fully utilized (10 x 1Gi = 10Gi), but the resources requests are only 10 x 500Mi = 5Gi. Would Kubernetes consider scheduling another pod on this node because only 50% of the memory capacity on the node has been requested, or would it use the fact that 100% of the memory is currently being utilized, and the node is at full capacity?
By default kubernetes will use cgroups to manage and monitor the "allocatable" memory on a node for pods. It is possible to configure kubelet to entirely rely on the static reservations and pod requests from your deployments though so the method depends on your cluster deployment.
In either case, a node itself will track "memory pressure", which monitors the existing overall memory usage of a node. If a node is under memory pressure then no new pods will be scheduled and existing pods will be evicted.
It's best to set sensible memory requests and limits for all workloads to help the scheduler as much as possible.
If a kubernetes deployment does not configure cgroup memory monitoring, setting requests is a requirement for all workloads.
If the deployment is using cgroup memory monitoring, at least setting requests give the scheduler extra detail as to whether the pods to be scheduled should fit on a node.
Capacity and Allocatable Resources
The Kubernetes Reserve Compute Resources docco has a good overview of how memory is viewed on a node.
Node Capacity
---------------------------
| kube-reserved |
|-------------------------|
| system-reserved |
|-------------------------|
| eviction-threshold |
|-------------------------|
| |
| allocatable |
| (available for pods) |
| |
| |
---------------------------
The default scheduler checks a node isn't under memory pressure, then looks at the allocatable memory available on a node and whether the new pods requests will fit in it.
The allocatable memory available is the total-available-memory - kube-reserved - system-reserved - eviction-threshold - scheduled-pods.
Scheduled Pods
The value for scheduled-pods can be calculated via a dynamic cgroup, or statically via the pods resource requests.
The kubelet --cgroups-per-qos option which defaults to true enables cgroup tracking of scheduled pods. The pods kubernetes runs will be in
If --cgroups-per-qos=false then the allocatable memory will only be reduced by the resource requests that scheduled on a node.
Eviction Threshold
The eviction-threshold is the level of free memory when Kubernetes starts evicting pods. This defaults to 100MB but can be set via the kubelet command line. This setting is teid to both the allocatable value for a node and also the memory pressure state of a node in the next section.
System Reserved
kubelets system-reserved value can be configured as a static value (--system-reserved=) or monitored dynamically via cgroup (--system-reserved-cgroup=).
This is for any system daemons running outside of kubernetes (sshd, systemd etc). If you configure a cgroup, the processes all need to be placed in that cgroup.
Kube Reserved
kubelets kube-reserved value can be configured as a static value (via --kube-reserved=) or monitored dynamically via cgroup (--kube-reserved-cgroup=).
This is for any kubernetes services running outside of kubernetes, usually kubelet and a container runtime.
Capacity and Availability on a Node
Capacity is stored in the Node object.
$ kubectl get node node01 -o json | jq '.status.capacity'
{
"cpu": "2",
"ephemeral-storage": "61252420Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "4042284Ki",
"pods": "110"
}
The allocatable value can be found on the Node, you can note than existing usage doesn't change this value. Only schduleding pods with resource requests will take away from the allocatable value.
$ kubectl get node node01 -o json | jq '.status.allocatable'
{
"cpu": "2",
"ephemeral-storage": "56450230179",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "3939884Ki",
"pods": "110"
}
Memory Usage and Pressure
A kube node can also have a "memory pressure" event. This check is done outside of the allocatable resource checks above and is more a system level catch all. Memory pressure looks at the current root cgroup memory usage minus the inactive file cache/buffers, similar to the calculation free does to remove the file cache.
A node under memory pressure will not have pods scheduled, and will actively try and evict existing pods until the memory pressure state is resolved.
You can set the eviction threshold amount of memory kubelet will maintain available via the --eviction-hard=[memory.available<500Mi] flag. The memory requests and usage for pods can help informs the eviction process.
kubectl top node will give you the existing memory stats for each node (if you have a metrics service running).
$ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node01 141m 7% 865Mi 22%
If you were not using cgroups-per-qos and a number of pods without resource limits, or a number of system daemons then the cluster is likely to have some problems scheduling on a memory constrained system as allocatable will be high but the actual value might be really low.
Memory Pressure calculation
Kubernetes Out Of Resource Handling docco includes a script which emulates kubelets memory monitoring process:
# This script reproduces what the kubelet does
# to calculate memory.available relative to root cgroup.
# current memory usage
memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024))
memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}')
memory_working_set=${memory_usage_in_bytes}
if [ "$memory_working_set" -lt "$memory_total_inactive_file" ];
then
memory_working_set=0
else
memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file))
fi
memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set))
memory_available_in_kb=$((memory_available_in_bytes / 1024))
memory_available_in_mb=$((memory_available_in_kb / 1024))
echo "memory.capacity_in_bytes $memory_capacity_in_bytes"
echo "memory.usage_in_bytes $memory_usage_in_bytes"
echo "memory.total_inactive_file $memory_total_inactive_file"
echo "memory.working_set $memory_working_set"
echo "memory.available_in_bytes $memory_available_in_bytes"
echo "memory.available_in_kb $memory_available_in_kb"
echo "memory.available_in_mb $memory_available_in_mb"
Definitely YES, Kubernetes consider memory usage during pod scheduling process.
The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
There are two key concepts in scheduling. First one, the scheduler attempts to filter the nodes that are capable of running a given pod based on resource requests and other scheduling requirements. Second, the scheduler weighs the eligible nodes based on absolute and relative resource utilization of the nodes and other factors. The highest weighted eligible node is selected for scheduling of the pod.
Good explanation of scheduling in Kuberneres you can find here: kubernetes-scheduling.
Simple example: your pod normally uses 100 Mi of ram but you run it with a 50 Mi request. If you have a node with 75 Mi free the scheduler may choose to run the pod there. When pod memory consumption later expands to 100 Mi this puts the node under pressure, at which point the kernel may choose to kill your process. So it is important to get both memory requests and memory limits right. About memory usage, requests and limits you can read more here: memory-resource.
A container can exceed its memory request if the node has memory available. But a container is not allowed to use more than its memory limit. If a container allocates more memory than its limit, the container becomes a candidate for termination. If the container continues to consume memory beyond its limit, the container is terminated. If a terminated container can be restarted the kubelet restarts it, as with any other type of runtime failure.
I hope its helps.
Yes, Kubernetes will consider current memory usage when scheduling Pods (not just requests), so your new Pod wouldn't get scheduled on the full node. Of course, there are also a number of other factors.
(FWIW, when it comes to resources, a request helps the scheduler by declaring a baseline value, and a limit kills the Pod when resources exceed that value, which helps with capacity planning/estimation.

How to limit a disk usage on DataNode without causing Hadoop to enter safemode?

I have 3 nodes Hadoop 2.7.3 cluster which can be described as follows:
Node A: 25gb, DataNode, NameNode
Node B: 50gb, DataNode
Node C: 25gb, DataNode
The problem is that on the node A there is a high disk usage (about 95%). What I would like to achieve is to limit the disk usage so that it will never be more than 85%.
I tried to set dfs.namenode.resource.du.reserved property to somewhat about 3gb but it does not solve my problem because as soon as available disk space is lower than that value, my Hadoop enters safemode immediately.
I know that all required resources must be available for the NN to continue operating and that the NN will continue to operate as long as any redundant resource is available.
Also, I know about dfs.namenode.edits.dir.required property which defines required resources, but I don't think that making NN redundant instead of required is a good idea.
So my questions is as in the topic. How can I say to Hadoop: "Hey, listen. This is a datanode, put anything you want here, but if the disk usage will be higher than 85% then do not panic- just stop putting anything here and continue to do your thing on the rest of DN."?
Am I missing something? Is it even possible? If not, then what would you guys suggest me to do?
There is a process called Namenode resource checker which scans the Namenode storage volumes for available free disk space. Whenever the available free space falls below the value specified in the dfs.namenode.resource.du.reserved property (default 100MB), it forces the Namenode to enter safemode.
Setting it to 3GB would expect that free space on this node. But the Datanode would be consuming all the available free space for its data storage not considering the disk space requirements of the Namenode.
Limit the datanode's disk usage on this particular node, add this property to hdfs-site.xml
<property>
<name>dfs.datanode.du.reserved</name>
<value>3221225472</value>
<description>3GB of disk space reserved for non DFS usage.
This space will be left unconsumed by the Datanode.
</description>
</property>
Change the reserved space value as per your required threshold.

spark + yarn cluster: how can i configure physical node to run only one executor\task each time?

I have an environment that combines 4 physical nodes with a small amount of RAM and each has 8 CPU cores.
I noticed that spark decides automatically to split the RAM for each CPU. The result is that a memory error occurred.
I'm working with big data structures, and I want that each executor will have the entire RAM memory on the physical node (otherwise i'll get a memory error).
I tried to configure 'yarn.nodemanager.resource.cpu-vcores 1' on 'yarn-site.xml' file or 'spark.driver.cores 1' on spark-defaults.conf without any success.
Try setting spark.executor.cores 1

In YARN, how is the container size determined?

In YARN application, how does ApplicationMaster decide on the size of the container? I understand there are parameters controlling on the minimum memory allocation, vcores ratio etc. But how does application master understand that it needs so much amount of memory and so many CPUs for a particular job - either MapReduce / Spark?
First let me explain in one or two lines how YARN works then we go through the questions.
So let's assume we have 100GB of total YARN cluster memory and 1GB minimum-allocation-mb, then we have 100 max containers. If we set the minimum allocation to 4GB, then we have 25 max containers.
Each application will get the memory it asks for rounded up to the next container size. So if the minimum is 4GB and you ask for 4.5GB you will get 8GB.
If the job/task Memory requirement is bigger than the allocated container size, in which case it will shoot down this container.
Now coming back to your original question, how YARN application master decide how much amount of Memory and CPU is required for a particular job.
YARN Resource Manager (RM) allocates resources to the application through logical queues which include memory, CPU, and disks resources.
By default, the RM will allow up to 8192MB ("yarn.scheduler.maximum-allocation-mb") to an Application Master (AM) container allocation request.
The default minimum allocation is 1024MB ("yarn.scheduler.minimum-allocation-mb").
The AM can only request resources from the RM that are in increments of ("yarn.scheduler.minimum-allocation-mb") and do not exceed ("yarn.scheduler.maximum-allocation-mb").
The AM is responsible for rounding off ("mapreduce.map.memory.mb") and ("mapreduce.reduce.memory.mb") to a value divisible by the ("yarn.scheduler.minimum-allocation-mb").
RM will deny an allocation greater than 8192MB and a value not divisible by 1024MB.
Following YARN and Map-Reduce parameters need to set to change the default Memory requirement:-
For YARN
yarn.scheduler.minimum-allocation-mb
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.vmem-pmem-ratio
yarn.nodemanager.resource.memory.mb
For MapReduce
mapreduce.map.java.opts
mapreduce.map.memory.mb
mapreduce.reduce.java.opts
mapreduce.reduce.memory.mb
So conclusion is that, application master doesn't use any logic to calculate resources (memory/CPU) requirement for a particular job. It simply use above mentioned parameters value for it.
If any jobs doesn't complete in given container size (including virtual Memory), then node manager simply kill the container.

Multiple volume & limit disk usage with Hadoop

I am using Hadoop to processing on large set of data. I set up a hadoop node to use multiple volumes : one of these volume is a NAS with 10To disk, and the other one is the local disk from server with a storage capacity of 400 GB.
The problem is, if I understood, that data-nodes will attempt to place equal amount of data in each volumes. Thus when I run a job on a large set of data the disk with 400 GB is quickly full, while the 10 To disk got enough space remained. Then my map-reduce program produce by Hive freeze because my cluster turn on the safe mode...
I tried to set the property for limit Data node's disk usage, but it does nothing : I have still the same problem.
Hope that someone could help me.
Well it seems that my mapreduce program turn on safe mode because :
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
I saw that error on the namenode web interface. I want to disable this option with the property dfs.safemode.threshold.pct but I do not know if it is a good way to solve it?
I think you can turn to dfs.datanode.fsdataset.volume.choosing.policy for help.
<property><name>dfs.datanode.fsdataset.volume.choosing.policy</name><value>org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy</value>
Use the dfs.datanode.du.reserved configuration setting in $HADOOP_HOME/conf/hdfs-site.xml for limiting disk usage.
Reference
<property>
<name>dfs.datanode.du.reserved</name>
<!-- cluster variant -->
<value>182400</value>
<description>Reserved space in bytes per volume. Always leave this much space free for non dfs use.
</description>
</property>

Resources