CMS class unloading took much time - java-8

On large load was noticed large GC pause(400ms) for our application. During investigation it turns out, that pause happens on CMS Final Remark and class unloading phase took a lot more time, than other phases(10x-100X):
(CMS Final Remark)
[YG occupancy: 142247 K (294912 K)]
2019-03-13T07:38:30.656-0700: 24252.576:
[Rescan (parallel) , 0.0216770 secs]
2019-03-13T07:38:30.677-0700: 24252.598:
[weak refs processing, 0.0028353 secs]
2019-03-13T07:38:30.680-0700: 24252.601:
[class unloading, 0.3232543 secs]
2019-03-13T07:38:31.004-0700: 24252.924:
[scrub symbol table, 0.0371301 secs]
2019-03-13T07:38:31.041-0700: 24252.961:
[scrub string table, 0.0126352 secs]
[1 CMS-remark: 2062947K(4792320K)] 2205195K(5087232K), 0.3986822 secs]
[Times: user=0.63 sys=0.01, real=0.40 secs]
Total time for which application threads were stopped: 0.4156259 seconds,
Stopping threads took: 0.0014133 seconds
This pause always happens in first second of performance test, the duration of pause varies from 300ms to 400+ms.
Unfortunately, I have no access to server(it's under maintenance) and have only logs for several test runs. But when the server would available, I want to be prepared for further investigation, but I have no idea of what causes such a behavior.
My first thought was about Linux Huge pages, but we don't use them.
After more time in logs, I found following:
Heap after GC invocations=7969 (full 511):
par new generation total 294912K, used 23686K [0x0000000687800000, 0x000000069b800000, 0x000000069b800000)
eden space 262144K, 0% used [0x0000000687800000, 0x0000000687800000, 0x0000000697800000)
from space 32768K, 72% used [0x0000000699800000, 0x000000069af219b8, 0x000000069b800000)
to space 32768K, 0% used [0x0000000697800000, 0x0000000697800000, 0x0000000699800000)
concurrent mark-sweep generation total 4792320K, used 2062947K [0x000000069b800000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 282286K, capacity 297017K, committed 309256K, reserved 1320960K
class space used 33038K, capacity 36852K, committed 38872K, reserved 1048576K
}
Heap after GC invocations=7970 (full 511):
par new generation total 294912K, used 27099K [0x0000000687800000, 0x000000069b800000, 0x000000069b800000)
eden space 262144K, 0% used [0x0000000687800000, 0x0000000687800000, 0x0000000697800000)
from space 32768K, 82% used [0x0000000697800000, 0x0000000699276df0, 0x0000000699800000)
to space 32768K, 0% used [0x0000000699800000, 0x0000000699800000, 0x000000069b800000)
concurrent mark-sweep generation total 4792320K, used 2066069K [0x000000069b800000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 282303K, capacity 297017K, committed 309256K, reserved 1320960K
class space used 33038K, capacity 36852K, committed 38872K, reserved 1048576K
}
Investigating GC pause happens between GC invocations 7969 and 7970. And the amount of used space in meta space is almost the same(it's actually increased)
So, It looks like it's not actually some stall classes which aren't used anymore(since no space was cleared) and it's not safe point reaching issue - since blocking of threads took small time(0.0014133).
How to investigate such case and what diagnostic information is required for proper preparedness.
Technical details
Centos5 + JDK8 + CMS GC with args:
-XX:+CMSClassUnloadingEnabled
-XX:CMSIncrementalDutyCycleMin=10
-XX:+CMSIncrementalPacing
-XX:CMSInitiatingOccupancyFraction=50
-XX:+CMSParallelRemarkEnabled
-XX:+DisableExplicitGC
-XX:InitialHeapSize=5242880000
-XX:MaxHeapSize=5242880000
-XX:MaxNewSize=335544320
-XX:MaxTenuringThreshold=6
-XX:NewSize=335544320
-XX:OldPLABSize=16
-XX:+UseCompressedClassPointers
-XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC

Related

Pod sizing with Actuator metrics jvm.memory.max

I am trying to size our pods using the actuator metrics info. With the below K8 resource quota configuration;
resources:
requests:
memory: "512Mi"
limits:
memory: "512Mi"
We are observing that jvm.memory.max returns ~1455 mb. I understand that this value includes heap and non-heap. Further drilling into the api (jvm.memory.max?tag=area:nonheap) and (jvm.memory.max?tag=area:heap) results in ~1325mb and ~129mb respectively.
Obviously with the non-heap set to max out at a value greater than the K8 limit, the container is bound to get killed eventually. But why is the jvm (non-heap memory) not bounded by the memory configuration of the container (configured in K8)?
The above observations are valid with java 8 and java 11. The below blog discusses the experimental options with java 8 where CPU and heap configurations are discussed but no mention of non-heap. What are some suggestions to consider in sizing the pods?
-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap
Source
Java 8 has a few flags that can help the runtime operate in a more container aware manner:
java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -jar app.jar
Why you get maximum JVM heap memory of 129 MB if you set the maximum container memory limit to 512 MB? So the answer is that memory consumption in JVM includes both heap and non-heap memory. The memory required for class metadata, JIT complied code, thread stacks, GC, and other processes is taken from the non-heap memory. Therefore, based on the cgroup resource restrictions, the JVM reserves a portion of the memory for non-heap use to ensure system stability.
The exact amount of non-heap memory can vary widely, but a safe bet if you’re doing resource planning is that the heap is about 80% of the JVM’s total memory. So if you set the set maximum heap to 1000 MB, you can expect that the whole JVM might need around 1250 MB.
The JVM read that the container is limited to 512M and created a JVM with maximum heap size of ~129MB. Exactly 1/4 of the container memory as defined in the JDK ergonomic page.
If you dig into the JVM Tuning guide you will see the following.
Unless the initial and maximum heap sizes are specified on the command line, they're calculated based on the amount of memory on the machine. The default maximum heap size is one-fourth of the physical memory while the initial heap size is 1/64th of physical memory. The maximum amount of space allocated to the young generation is one third of the total heap size.
You can find more information about it here.

what algorithm does the g1 GC based on to resize the Eden region

I found a GC problem in my application, young GC suddenly becomes 10 times as much as normal which result of a small size of eden region. Through the gc log, I found than eden size become 5000M-7000M usually while it keeped 528M when falled into problems. I don't know what happend cause such problem.
I have read a lot of blogs to research the algorithm that adjust the eden region size, and most of the blog just tell me than the eden region size is automatically changed according to previous gc pause time. only a vaguely answer is found in https://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection,
if (recent_STW_time < MaxGCPauseMillis)
eden = min(100% - G1ReservePercent - Tenured, G1MaxNewSizePercent)
else
eden = min(100% - Tenured, G1NewSizePercent)
But according to this, eden region size should be 4000-6500M in my application.

Reading Go gctrace output

I have gctrace output that looks like this:
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock, 0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
I am not sure how to read the CPU times in particular. I understand that it is broken down into three phases (STW sweep termination, concurrent mark/scan, and STW mark termination), but I'm not sure what the + signs mean (i.e. 0.18+7720 and 3615+0.65). What do these + signs signify?
In your case, they look like assist and termination times;
// CPU time
0.18 : **STW** Sweep termination.
7720ms : Mark/Scan - Assist Time (GC performed in line with allocation).
21356ms : Mark/Scan - Background GC time.
3615ms : Mark/Scan - Idle GC time.
0.65ms : **STW** Mark termination.
I think it changes (or it may) over various Go versions and you can find more detailed info at runtime package docs.
Currently, it is:
gc # ##s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # P
where the fields are as follows:
gc # the GC number, incremented at each GC
##s time in seconds since program start
#% percentage of time spent in GC since program start
#+...+# wall-clock/CPU times for the phases of the GC
#->#-># MB heap size at GC start, at GC end, and live heap
# MB goal goal heap size
# P number of processors used
Example here
See also Interpreting GC trace output
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock,
0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
gc 6
#48.155s since program start
15%: of time spent in GC since program start
0.093+12360+0.32 ms clock stop-the-world (STW) sweep termination + concurrent
mark and scan + and STW mark termination
0.18+7720/21356/3615+0.65 ms cpu (GC performed in
line with allocation), background GC time, and idle GC time
11039->13278->6876 MB heap size at GC start, at GC end, and live heap
8 P number of processors used

discrepancy between htop and golang readmemstats

My program loads a lot of data at start up and then calls debug.FreeOSMemory() so that any extra space is given back immediately.
loadDataIntoMem()
debug.FreeOSMemory()
after loading into memory , htop shows me the following for the process
VIRT RES SHR
11.6G 7629M 8000
But a call to runtime.ReadMemStats shows me the following
Alloc 5593336608 5.3G
BuckHashSys 1574016 1.6M
HeapAlloc 5593336610 5.3G
HeapIdle 2607980544 2.5G
HeapInuse 7062446080 6.6G
HeapReleased 2607980544 2.5G
HeapSys 9670426624 9.1G
MCacheInuse 9600 9.4K
MCacheSys 16384 16K
MSpanInuse 106776176 102M
MSpanSys 115785728 111M
OtherSys 25638523 25M
StackInuse 589824 576K
StackSys 589824 576K
Sys 10426738360 9.8G
TotalAlloc 50754542056 48G
Alloc is the amount obtained from system and not yet freed ( This is
resident memory right ?) But there is a big difference between the two.
I rely on HeapIdle to kill my program i.e if HeapIdle is more than 2 GB, restart - in this case it is 2.5, and isn't going down even after a while. Golang should use from heap idle when allocating more in the future, thus reducing heap idle right ?
If assumption 1 is wrong, which stat can accurately tell me what the RES value in htop is.
What can I do to reduce the value of HeapIdle ?
This was tried on go 1.4.2, 1.5.2 and 1.6.beta1
The effective memory consumption of your program will be Sys-HeapReleased. This still won't be exactly what the OS reports, because the OS can choose to allocate memory how it sees fit based on the requests of the program.
If your program runs for any appreciable amount of time, the excess memory will be offered back to the OS so there's no need to call debug.FreeOSMemory(). It's also not the job of the garbage collector to keep memory as low as possible; the goal is to use memory as efficiently as possible. This requires some overhead, and room for future allocations.
If you're having trouble with memory usage, it would be a lot more productive to profile your program and see why you're allocating more than expected, instead of killing your process based on incorrect assumptions about memory.

Windows program has big native heap, much larger than all allocations

We are running a mixed mode process (managed + unmanaged) on Win 7 64 bit.
Our process is using up too much memory (especially VM). Based on our analysis, the majority of the memory is used by a big native heap. Our theory is that the LFH is saving too many free blocks in committed memory, for future allocations. They sum to about 1.2 GB while our actual allocated native memory is only at most 0.6 GB.
These numbers are from a test run of the process. In production it sometimes exceeded 10 GB of VM - with maybe 6 GB unaccounted for by known allocations.
We'd like to know if this theory of excessive committed-but-free-for-allocation segments is true, and how this waste can be reduced.
Here's the details of our analysis.
First we needed to figure out what's allocated and rule out memory leaks. We ran the excellent Heap Inspector by Jelle van der Beek and we ruled out a leak and established that the known allocations are at most 0.6 deci-GB.
We took a full memory dump and opened in WinDbg.
Ran !heap -stat
It reports a big native heap with 1.83 deci-GB committed memory. Much more than the sum of our allocations!
_HEAP 000000001b480000
Segments 00000078
Reserved bytes 0000000072980000
Committed bytes 000000006d597000
VirtAllocBlocks 0000001e
VirtAlloc bytes 0000000eb7a60118
Then we ran !heap -stat -h 0000001b480000
heap # 000000001b480000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
c0000 12 - d80000 (10.54)
b0000 d - 8f0000 (6.98)
e0000 a - 8c0000 (6.83)
...
If we add up all the 20 reported items, they add up to 85 deci-MB - much less than the 1.79 deci-GB we're looking for.
We ran !heap -h 1b480000
...
Flags: 00001002
ForceFlags: 00000000
Granularity: 16 bytes
Segment Reserve: 72a70000
Segment Commit: 00002000
DeCommit Block Thres: 00000400
DeCommit Total Thres: 00001000
Total Free Size: 013b60f1
Max. Allocation Size: 000007fffffdefff
Lock Variable at: 000000001b480208
Next TagIndex: 0000
Maximum TagIndex: 0000
Tag Entries: 00000000
PsuedoTag Entries: 00000000
Virtual Alloc List: 1b480118
Unable to read nt!_HEAP_VIRTUAL_ALLOC_ENTRY structure at 000000002acf0000
Uncommitted ranges: 1b4800f8
FreeList[ 00 ] at 000000001b480158: 00000000be940080 . 0000000085828c60 (9451 blocks)
...
When adding up up all the segment sizes in the report, we get:
Total Size = 1.83 deci-GB
Segments Marked Busy Size = 1.50 deci-GB
Segments Marked Busy and Internal Size = 1.37 deci-GB
So all the committed bytes in this report do add up to the total commit size. We grouped on block size and the most heavy allocations come from blocks of size 0x3fff0. These don't correspond to allocations that we know of. There were also mystery blocks of other sizes.
We ran !heap -p -all. This reports the LFH internal segments but we don't understand it fully. Those 3fff0 sized blocks in the previous report appear in the LFH report with an asterisk mark and are sometimes Busy and sometimes Free. Then inside them we see many smaller free blocks.
We guess these free blocks are legitimate. They are committed VM that the LFH reserves for future allocations. But why is their total size so much greater than sum of memory allocations, and can this be reduced?
Well, I can sort of answer my own question.
We had been doing lots and lots of tiny allocations and deallocations in our program. There was no leak, but it seems this somehow created a fragmentation of some sort. After consolidating and eliminating most of these allocations our software is running much better and using less peak memory. It is still a mystery why the peak committed memory was so much higher than the peak actually-used memory.

Resources