Adobe Experience Manager (AEM), Java garbage collection tuning and memory management - performance

I am currently using the Adobe Experience Manager for a Client's site (Java language). It uses openJDK:
#java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (rhel-2.5.1.2.el6_5-x86_64 u65-b17)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
It is running on Rackspace with the following:
vCPU: 4
Memory: 16GB
Guest OS: Red Hat Enterprise Linux 6 (64-bit)
Since it has been in production I have been experiencing very slow performance on the part of the application. It goes like this I launch the app, everything is smooth then 3 to 4 days later the CPU usage spikes to 400% (~4000 users/day hit the site). I got a few OOM exceptions (1 or 2) but mostly the site was exceptionally slow and never becomes an OOM exception. Since I am a novice at Java Memory management I started reading about how it works and found tools like jstat. When the system was overwhelmed the second time around, I ran:
#top
Got the PID of the java process and then pressed shift+H and noted the PIDs of the threads with high CPU percentage. Then I ran
#sudo -uaem jstat <PID>
Got a thread dump and converted the thread PIDs I wrote down previously and searched for their hex value in the dump. After all that, I finally found that it was not surprisingly the Garbage Collector that is flipping out for some reason.
I started reading a lot about Java GC tuning and came up with the following java options.
So restarted the application with the following options:
java
-Dcom.day.crx.persistence.tar.IndexMergeDelay=0
-Djackrabbit.maxQueuedEvents=1000000
-Djava.io.tmpdir=/srv/aem/tmp/
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/srv/aem/tmp/
-Xms8192m -Xmx8192m
-XX:PermSize=256m
-XX:MaxPermSize=1024m
-XX:+UseParallelGC
-XX:+UseParallelOldGC
-XX:ParallelGCThreads=4
-XX:NewRatio=1
-Djava.awt.headless=true
-server
-Dsling.run.modes=publish
-jar crx-quickstart/app/cq-quickstart-6.0.0-standalone.jar start
-c crx-quickstart -i launchpad -p 4503
-Dsling.properties=conf/sling.properties
And it looks like it is performing much better but I think that it probably needs more GC tuning.
When I run:
#sudo -uaem jstat <PID> -gcutils
I get this:
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 0.00 55.97 100.00 45.09 4725 521.233 505 4179.584 4700.817
after 4 days that I restarted it.
When I run:
#sudo -uaem jstat <PID> -gccapacity
I get this:
NGCMN NGCMX NGC S0C S1C EC
4194304.0 4194304.0 4194304.0 272896.0 279040.0 3636224.0
OGCMN OGCMX OGC OC PGCMN PGCMX
4194304.0 4194304.0 4194304.0 4194304.0 262144.0 1048576.0
PGC PC YGC FGC
262144.0 262144.0 4725 509
after 4 days that I restarted it.
These result are much better than when I started but I think it can get even better. I'm not really sure what to do next as I'm no GC pro so I was wondering if you guys would have any tips or advice for me on how I could get better app/GC performance and if anything is obvious like ratio's and sizes of youngGen and oldGen ?
How should I set the survivors and eden sizes/ratios ?
Should I change GC type like use CMS GC or G1 ?
How should I proceed ?
Any advice would be helpful.
Best,
Nicola

Young and Old area ratio are interms 1:3 but it could varies depends on the application usage on
short lived objects and long lived objects. If the short lived objects are more then the
young space could be extended for example 2:3 (young:old). Reason for increase in the ratio is
to avoid scavange garbage cycle. When more short lived objects are allocated then the young space
fill fast and lead to scavenge GC cycle inturn affects the application performance. When the ratio
increased then the current value then there are possibilities in the reduction of scavenge GC cycle.
When the young space increased automatically survivor and Eden space increase accordingly.
CMS policy used to reduce pause time of the application and G1 policy targeted for larger memories
with high throughput. Gc policy can be changed based on the need of the application.
Recommended Use Cases for G1 :
The first focus of G1 is to provide a solution for users running applications that require large heaps with limited GC latency.
This means heap sizes of around 6GB or larger, and stable and predictable pause time below 0.5 seconds.
As you use 8G heap size, you can test with G1 gc policy for the same environment in order to check the GC performance.

Related

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

Improve erlang cowboy performance

We have been using Cowboy in production on our Compute Engine machines on GCP and we started benchmarking and improving the performance of our service to handle more Reqs/sec (in our case since we are in Adtech it is bids/sec).
After isolating and handling a lot of the issues separately we came down to Cowboy optimization, these are our current findings and limitations:
Cowboy setup
We are using Cowboy 2.5 with 200 acceptors and max backlog of 1024
init(Req, _State) ->
T1 = erlang:monotonic_time(),
{ok, BRjson, _} = cowboy_req:read_body(Req),
%% ---- rest of work goes here but is switched off for our test---
erlang:send_after(60, self(), {'RSP', x, no_workers}),
{cowboy_loop, Req, #state{t1 = T1}, hibernate}.
Erlang VM
OTP 21
VM args: -smp auto +P 134217727 +K true +A 64 -rate 1200 +stbt db +scl false +sfwi 500 +spp true +zdbbl 8092
Load
Json requests ~4KB in size. And testing is done using a separate machine on the same internal network (no SSL) using jmeter. All requests are POST with keep-alive
Servers
GCP Compute Engine 10 vcpu cores and 14GB RAM (now and tested before with the 4 vcpu)
Findings
We are able to reach to ~1900 reqs/sec but all CPU cores in htop are showing almost 80% utilization
At 1000 reqs/sec we se cpu utilization at 45-50% per core (still high bearing in mind that no other part of our application is running)
*Note: using the 4 vcpu machine we were able to get close to 700 reqs/sec and memory in all of our tests is barely utilizied or changing with load
QUESTION: How to improve cowboy's performance in terms of cpu usage?
First off, thanks #Pouriya for suggestions--actually, discussing this back and forth made me go back and re-check one of my comments about the right tool for the job. PS: we are on GCP so 72 cores would be out of question at this stage.
Cowboy is great! but it does add a bit of overhead in the critical path of each request--a feature (or issue in my case) that is not needed.
We tested again with Elli (https://github.com/elli-lib/elli) but built a proper testing setup this time and it provided improvement up to 20% ~ exactly what we needed!
If anyone at Cowboy/Ranch team has a way of drastically improving CPU overhead will gladly test since we still use it in our APIs but not the critical path.

Need help understanding my ElasticSearch Cluster Health

When querying my cluster, I noticed these stats for one of my nodes in the cluster. Am new to Elastic and would like the community's health in understanding the meaning of these and if I need to take any corrective measures?
Does the Heap used look on the higher side and if yes, how would I rectify it? Also any comments on the System Memory Used would be helpful - it feels like its on the really high side as well.
These are the JVM level stats
JVM
Version OpenJDK 64-Bit Server VM (1.8.0_171)
Process ID 13735
Heap Used % 64%
Heap Used/Max 22 GB / 34.2 GB
GC Collections (Old/Young) 1 / 46,372
Threads (Peak/Max) 163 / 147
This is the OS Level stats
Operating System
System Memory Used % 90%
System Memory Used 59.4 GB / 65.8 GB
Allocated Processors 16
Available Processors 16
OS Name Linux
OS Architecture amd64
As You state that you are new to Elasticsearch I must say you go through cluster as well as cat API you can find documentation at clusert API and cat API
This will help you understand more in depth.

issues with consistent speed when using lein test

disclaimer - I am running this on a mid 2012 macbook air i7-3667U and 8gb ram with the 64bit jvm.
Running the test suite for an application lein t is running at what I would consider an abnormally slow speed. Most of the tests involve mongo db (creating and dropping tables/collections). I have moved to monngodb enterprise which allows running in memory. As I assumed that the bottleneck was the db io.
with a mongo.conf
storage:
engine: inMemory
dbPath: /Users/beoliver/data/testdb
inMemory:
engineConfig:
inMemorySizeGB: 1
mongo is started with the flag --conf ~/path/to/mongo.conf
I added the java flags to the project
:jvm-opts ["-XX:-OmitStackTraceInFastThrow" "-Xmx4g" "-Xms1g"]
to try and avoid extra swaps.
This appeared to fix the issue and the tests ran as:
time lein t
...
lein t 238.71s user 8.72s system 59% cpu 6:57.92 total
This is reasonable compared with the results from other team members.
But then re-running the tests again the speed is back to the original (half and hour mark).
lein t 252.53s user 13.76s system 16% cpu 26:52.45 total
cpu usage peaks at about 50% but for the most part is around <5% (this includes times when it idles at <1%)
Real memory size: 1.55 GB
Virtual memory size : 8.08 GB
Shared Memory Size: 18.0 MB
Private Memory Size : 1.67 GB
Has anyone had similar experiences? Suggestions? Is there a good way of profiling - better than starting at Activity monitor?

Elasticsearch High CPU When Idle

I'm fairly new to Elasticsearch and I've bumped into an issue that I'm having difficulties in even troubleshooting. My Elasticsearch (1.1.1) is currently spiking the cpu even though no searching or indexing is going on. CPU usage isn't always at 100%, but it jumps up there quite a bit and load is very high.
Previously, the indices on this node ran perfectly fine for months without any issue. This just started today and I have no idea what's causing it.
The problem persists even after I restart ES and I even restarted the server in pure desperation. No effect on the issue.
Here are some stats to help frame the issue, but I'd imagine there's more information that's needed. I'm just not sure what to provide.
Elasticsearch 1.1.1
Gentoo Linux 3.12.13
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.7) (Gentoo build 1.6.0_27-b27)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
One node, 5 shards, 0 replicas
32GB RAM on system, 16GB Dedicated to Elasticsearch
RAM does not appear to be the issue here.
Any tips on troubleshooting the issue would be appreciated.
Edit: Info from top if it's helpful at all.
top - 19:56:56 up 3:22, 2 users, load average: 10.62, 11.15, 9.37
Tasks: 123 total, 1 running, 122 sleeping, 0 stopped, 0 zombie
%Cpu(s): 98.5 us, 0.6 sy, 0.0 ni, 0.7 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 32881532 total, 31714120 used, 1167412 free, 187744 buffers
KiB Swap: 4194300 total, 0 used, 4194300 free, 12615280 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2531 elastic+ 20 0 0.385t 0.020t 3.388g S 791.9 64.9 706:00.21 java
As Andy Pryor mentioned, the background merging might have been what was causing the issue. Our index rollover had been paused and two of our current indices were over 200GB. Rolling them over appeared to have resolved the issue and we've been humming along just fine since.
Edit:
The high load when seemingly idle was determined to have been caused by merges on several very large indices that were not being rolled over on a weekly basis. This was a failure of an internal process to roll over indices on a weekly basis. After addressing this oversight the merge times were short and the high load subsided.

Resources