Generating Java heap dump in Kubernetes container - performance

I am running a Jenkins Controller in kubernetes. I have noticed that the controller has been restarting ALOT.
kgp jkmaster-0
NAME READY STATUS RESTARTS AGE
jkmaster-0 1/1 Running 8 30m
The memory allocation to the pod is as follows
Limits:
memory: 2500M
Requests:
cpu: 300m
memory: 1G
As long as the controller is idle, I dont see any spikes occurring. But as soon as I start spawning jobs, I notice that there are spikes and each spike results in a OOMError and a restart happens
kgp jkmaster-0
NAME READY STATUS RESTARTS AGE
jkmaster-0 0/1 OOMKilled 3 3h8m
Inorder to look into this further, I would like to generate a Heap Dump. So what I am done is to add the following
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/srv/jenkins/
to JAVA_OPTS. I am expecting that the next time when the jenkins controller hits OOM, it should generate a Heapdump with under /srv/jenkins/ but there is none. Any idea if there is something I have missed ?
There is no file of the type java_pid.hprof under /srv/jenkins/ after a restart.
All JAVA_OPTS
JAVA_OPTS: -Djava.awt.headless=true -XX:InitialRAMPercentage=10.0 -XX:MaxRAMPercentage=60.0 -server -XX:NativeMemoryTracking=summary -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication \
-XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -XX:+PrintFlagsFinal -Djenkins.install.runSetupWizard=false -Dhudson.DNSMultiCast.disabled=true \
-Dhudson.slaves.NodeProvisioner.initialDelay=5000 -Dsecurerandom.source=file:/dev/urandom \
-Xlog:gc:file=/srv/jenkins/gc-%t.log -Xlog:gc*=debug -XX:+AlwaysPreTouch -XX:+DisableExplicitGC \
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/srv/jenkins/ -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Dhudson.model.DownloadService.noSignatureCheck=true

Related

HDFS NameNode JvmPauseMonitor waring caused by cloudera manager agent

In our online cluster, we got lots of such warning:
2020-01-21 09:08:00,711 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2328ms
No GCs detected
As it showed, there is no GC, but jvm just paused.
We noticed that the waring happens every 1 minute, and when such warning happens, cpu usage of cmf-agent increase to 80%. So we try shutdown cmf-agent, after cmf-agent is down, the waring just gone.
We have disabled Cloudera Manager jstack monitoring, what else cloudera managager agent do will cause jvm pause?
we are using CDH 5.4.14.
Edit at 2020-01-22:
/usr/java/default/bin/java -Dproc_namenode -Xmx1000m -Dhdfs.audit.logger=INFO,RFAAUDIT -Dsecurity.audit.logger=INFO,RFAS -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-cmf-hdfs-NAMENODE-1350.log.out -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/GPLEXTRAS-5.14.4-1.cdh5.14.4.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xms210453397504 -Xmx210453397504 -Xmn24g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:SoftRefLRUPolicyMSPerMB=0 -verbose:GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC -XX:+UseGCLogFileRotation -XX:+PrintClassHistogramAfterFullGC -XX:+PrintClassHistogramBeforeFullGC -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=200M -Xloggc:/var/log/hadoop-hdfs/nn.gc.log -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNode

Container is running beyond physical memory limits

I have a MapReduce Job that process 1.4 Tb of data.
While doing it, I am getting the error as below.
The number of splits is 6444.
Before starting the job I set the following settings:
conf.set("mapreduce.map.memory.mb", "8192");
conf.set("mapreduce.reduce.memory.mb", "8192");
conf.set("mapreduce.map.java.opts.max.heap", "8192");
conf.set("mapreduce.map.java.opts", "-Xmx8192m");
conf.set("mapreduce.reduce.java.opts", "-Xmx8192m");
conf.set("mapreduce.job.heap.memory-mb.ratio", "0.8");
conf.set("mapreduce.task.timeout", "21600000");
The error:
2018-05-18 00:50:36,595 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1524473936587_2969_m_004719_3: Container [pid=11510,containerID=container_1524473936587_2969_01_004894] is running beyond physical memory limits. Current usage: 8.1 GB of 8 GB physical memory used; 8.8 GB of 16.8 GB virtual memory used. Killing container.
Dump of the process-tree for container_1524473936587_2969_01_004894 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 11560 11510 11510 11510 (java) 14960 2833 9460879360 2133706 /usr/lib/jvm/java-7-oracle-cloudera/bin/java
-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/sdk/7/yarn/nm/usercache/administrator/appcache/application_1524473936587_2969/container_1524473936587_2969_01_004894/tmp
-Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1524473936587_2969/container_1524473936587_2969_01_004894
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 10.106.79.75 41869 attempt_1524473936587_2969_m_004719_3 4894
|- 11510 11508 11510 11510 (bash) 0 0 11497472 679 /bin/bash -c /usr/lib/jvm/java-7-oracle-cloudera/bin/java
-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/sdk/7/yarn/nm/usercache/administrator/appcache/application_1524473936587_2969/container_1524473936587_2969_01_004894/tmp
-Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1524473936587_2969/container_1524473936587_2969_01_004894
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 10.106.79.75 41869 attempt_1524473936587_2969_m_004719_3 4894 1>/var/log/hadoop-yarn/container/application_1524473936587_2969/container_1524473936587_2969_01_004894/stdout 2>/var/log/hadoop-yarn/container/application_1524473936587_2969/container_1524473936587_2969_01_004894/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Any help would be really appreciated!
The setting mapreduce.map.memory.mb will set the physical memory size of the container running the mapper (mapreduce.reduce.memory.mb will do the same for the reducer container).
Besure that you adjust the heap value as well. In newer version of YARN/MRv2 the setting mapreduce.job.heap.memory-mb.ratio can be used to have it auto-adjust. The default is .8, so 80% of whatever the container size is will be allocated as the heap. Otherwise, adjust manually using mapreduce.map.java.opts.max.heap and mapreduce.reduce.java.opts.max.heap settings.
BTW, I believe that 1 GB is the default and it is quite low. I recommend reading the below link. It provides a good understanding of YARN and MR memory setting, how they relate, and how to set some baseline settings based on the cluster node size (disk, memory, and cores).
Reference: http://community.cloudera.com/t5/Cloudera-Manager-Installation/ERROR-is-running-beyond-physical-memory-limits/td-p/55173
Try to set yarn memory allocation limits:
SET yarn.scheduler.maximum-allocation-mb=16G;
SET yarn.scheduler.minimum-allocation-mb=8G;
You may lookup other Yarn settings here:
https://www.ibm.com/support/knowledgecenter/STXKQY_BDA_SHR/bl1bda_tuneyarn.htm
Try with : set yarn.app.mapreduce.am.resource.mb=1000;
Explanation is here :
In spark, spark.driver.memoryOverhead is considered in calculating the total memory required for the driver. By default it is 0.10 of the driver-memory or minimum 384MB. In your case it will be 8GB * 0.1 = 9011MB ~= 9G
YARN allocates memory only in increments/multiples of yarn.scheduler.minimum-allocation-mb .
When yarn.scheduler.minimum-allocation-mb=4G, it can only allocate container sizes of 4G,8G,12G etc. So if something like 9G is requested it will round up to the next multiple and will allocate 12G of container size for the driver.
When yarn.scheduler.minimum-allocation-mb=1G, then container sizes of 8G, 9G, 10G are possible. The nearest rounded up size of 9G will be used in this case.
https://community.cloudera.com/t5/Support-Questions/Yarn-Container-is-running-beyond-physical-memory-limits-but/m-p/199353#M161393

RMAppMaster is running beyond physical memory limits

I am trying to troubleshoot this puzzling issue: RMAppMaster oversteps its allocated container memory and is then killed by the node manager even if heap size is much smaller than container size.
NM logs:
2017-12-01 11:18:49,863 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 14191 for container-id container_1506599288376_62101_01_000001: 1.0 GB of 1 GB physical memory used; 3.1 GB of 2.1 GB virtual memory used
2017-12-01 11:18:49,863 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_1506599288376_62101_01_000001 has processes older than 1 iteration running over the configured limit. Limit=1073741824, current usage = 1076969472
2017-12-01 11:18:49,863 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=14191,containerID=container_1506599288376_62101_01_000001] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 3.1 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1506599288376_62101_01_000001 :
|- 14279 14191 14191 14191 (java) 4915 235 3167825920 262632 /usr/java/default//bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Djava.net.preferIPv4Stack=true -Xmx512m org.apache.hadoop.mapreduce.v2.app.MRAppMaster
|- 14191 14189 14191 14191 (bash) 0 1 108650496 300 /bin/bash -c /usr/java/default//bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Djava.net.preferIPv4Stack=true -Xmx512m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001/stdout 2>/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001/stderr
You can observe that while the heap size is set to 512MB, physical memory observed by the NM grows up to 1GB.
Application is an Oozie launcher (Hive task), thus it has only one mapper which does mostly nothing and no reducer.
What baffles me is that only this specific instance of MRAppMaster is killed and I cannot explain the 500MB overhead between max heap size and physical memory as defined by the NM:
Other MRAppMaster instances run fine even with the default config (yarn.app.mapreduce.am.resource.mb = 1024 and yarn.app.mapreduce.am.command-opts = -Xmx825955249).
MRAppMaster does not run any application specific code, why only this one is having trouble? I expect MRAppMaster memory consumption to be somewhat linear to the number of tasks / attempts and this app has only one mapper.
-Xmx has been reduced to 512MB to see if the issue still happens with ~500MB of headroom. I expect MRAppMaster to consume very little native memory, what could those extra 500MB be?
I will try to workaround the issue by increasing yarn.app.mapreduce.am.resource.mb, but had really like to understand what is going on. Any idea?
config: cdh-5.4

Yarn container lauch failed exception and mapred-site.xml configuration

I have 7 nodes in my Hadoop cluster [8GB RAM and 4VCPUs to each nodes], 1 Namenode + 6 datanodes.
EDIT-1#ARNON: I followed the link, mad calculation according to the hardware configruation on my nodes and have added the update mapred-site and yarn-site.xml files in my question. Still my application is crashing with the same exection
My mapreduce application has 34 input splits with a block size of 128MB.
mapred-site.xml has the following properties:
mapreduce.framework.name = yarn
mapred.child.java.opts = -Xmx2048m
mapreduce.map.memory.mb = 4096
mapreduce.map.java.opts = -Xmx2048m
yarn-site.xml has the following properties:
yarn.resourcemanager.hostname = hadoop-master
yarn.nodemanager.aux-services = mapreduce_shuffle
yarn.nodemanager.resource.memory-mb = 6144
yarn.scheduler.minimum-allocation-mb = 2048
yarn.scheduler.maximum-allocation-mb = 6144
EDIT-2#ARNON: Setting yarn.scheduler.minimum-allocation-mb to 4096 puts all the map task in suspended state and assigning it as 3072 crashes with the follwoing
Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 3876 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1424264025191_0002/container_1424264025191_0002_01_000011/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_000011
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.12 50842 attempt_1424264025191_0002_m_000005_0 11 >
/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_000011/stdout 2>
/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_000011/stderr
How can avoid this?any help is appreciated
Is there an option to restrict number of containers on hadoop ndoes?
It seems you are allocating too much memory your tasks (even without looking at all the configurations) 8GB RAM and 8GB per map task and all of which is heap
Try to use lower allocations 2Gb with 1GB heap or something like that

Sonar launch error

I am facing the below error when i try to start the sonar by using a mysql. Do i need to modify any details in the sonar.properties file with respect to the elastic configurations ?
Has anyone face a similar error before ?
014.12.15 21:38:49 WARN sea[o.e.transport.netty] [sonar-1418659692862] exception caught on transport layer [[id: ,9001]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format
at org. elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:46) ~[elasticsearch-1.1.2.jar:na]
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425) ~[elasticsearch-1.1.2.jar:na]
In your sonar.properties, you have:
#--------------------------------------------------------------------------------------------------
# SEARCH INDEX
# Elasticsearch is used to facilitate fast and accurate information retrieval.
# It is executed in a dedicated Java process.
# JVM options. Note that enabling the HotSpot Server VM mode (-server) is recommended.
#sonar.search.javaOpts=-Xmx256m -Xms256m -Xss256k -Djava.net.preferIPv4Stack=true \
# -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 \
# -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError \
# -Djava.awt.headless=true
# Elasticsearch port. Default is 9001. Use 0 to get a free port.
# This port must be private and must not be exposed to the Internet.
#sonar.search.port=9001
Uncomment sonar.search.port and set it to 0.
Check for port availability.
Try working with new MySQL DB instance.

Resources