Reduce latency in P99 response time in low TPS application - Jboss application server - performance

I'm looking to find way to reduce latencies / higher response time at P99. The application is running on Jboss application server. Current configuration of the system is 0.5 core and 2 GB memory.
Suspecting low TPS might be the reason for higher P99's because current usages of the application at peak traffic is 0.25 core, averaging "0.025 core". And old gen GC times are running at 1s. Heap setting -xmx1366m -xms512m, metaspace at 250mb
Right now we have parallel GC policy, will G1GC policy help?
What else should I consider?


How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,

Is a Native Quarkus Application better than one on the JVM?

I am comparing the same Quarkus application in different executables, regular jar, fast-jar and native executable. To be able to compare them, I run the same performance test.
The results are the following:
Regular Jar, starts in 0.904s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 361.86us 4.48ms 155.85ms 99.73%
Req/Sec 29.86k 4.72k 37.60k 87.83%
3565393 requests in 1.00m, 282.22MB read
Requests/sec: 59324.15
Transfer/sec: 4.70MB
Fast-Jar, starts in 0.590s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 344.38us 3.89ms 142.71ms 99.74%
Req/Sec 27.21k 6.52k 40.67k 73.48%
3246932 requests in 1.00m, 257.01MB read
Requests/sec: 54025.50
Transfer/sec: 4.28MB
Native, start in 0.011s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 303.72us 471.86us 29.00ms 98.05%
Req/Sec 19.03k 3.21k 30.19k 78.75%
2272236 requests in 1.00m, 179.86MB read
Requests/sec: 37867.20
Transfer/sec: 3.00MB
The number of requests processed in a native application is roughly 1 million less than a JVM Quarkus application. However, the starting up time, Avg and Stdev in native application is better than others.
I was wondering why this happens and if a native application is better than one over the JVM.
Start up time and memory consumption will definitely be better with native quarkus applications. This is because quarkus extends graalvm's native image concept.
native-image is a utility that processes all classes of an application
and their dependencies, including those from the JDK. It statically
analyzes these data to determine which classes and methods are
reachable during the application execution. Then it ahead-of-time
compiles that reachable code and data to a native executable for a
specific operating system and architecture.
As the application is processed with ahead-of-time compilation and the JVM used (aka Substrate VM) contains only the essential part, the resulting program has faster startup time and lower runtime memory overhead compared to a JVM.

JMeter - Throughput Shaping Timer does not keep the requests/sec rate

I am using Ultimate Thread Group and fixed 1020 threads count for entire test duration - 520 seconds.
I've made a throughput diagram as follows:
The load increses over 10 seconds so the spikes shouldn't be very steep. Since the max RPS is 405 and max response time is around 25000ms 1020 threads should be enough.
However, when I run the test (jmeter -t spikes-nomiss.jmx -l spikes-nomiss.csv -e -o spikes-nomiss -n) I have the following graph for hits/seconds.
The threads are stopped for few seconds and suddenly 'wake up'. I can't find a reason for it. The final minute has a lot higher frequency of the calls. I've set heap size to 2GBs and resources are available, the CPU usage does not extend 50% during peaks, and memory is around 80% (4Gbs of ram on the machine). Seeking any help to fix the freezes.
Make sure to monitor JMeter's JVM using JConsole as it might be the case JMeter is not capable of create spikes due to insufficient resources. The slowdowns can be caused by excessive Garbage Collection
It might be the case 1020 threads are not enough to reach the desired throughput as it depends mainly on your application response time. If your application response time is higher than 300 milliseconds - you will not be able to get 405 RPS using 1020 threads. It might be a better idea to consider using Concurrency Thread Group which can be connected to the Throughput Shaping Timer via Schedule Feedback function

IIS Performance Issue

Environment: Windows Server 2012 and IIS 8.5
I have been observing slow performing transactions during load tests even if CPU utilization was < 60% and Memory utilization < 50% . I tried the following settings on IIS Manager, but No improvement was observed.
Threads Per Processor Limit - Increased from 25 to 50
minBytesPerSecond - Reduced from 240 to 0
Max Worker Processes - Increased from 1 to 2 and 5
Please help me with any other IIS settings that my improve performance.

issues with consistent speed when using lein test

disclaimer - I am running this on a mid 2012 macbook air i7-3667U and 8gb ram with the 64bit jvm.
Running the test suite for an application lein t is running at what I would consider an abnormally slow speed. Most of the tests involve mongo db (creating and dropping tables/collections). I have moved to monngodb enterprise which allows running in memory. As I assumed that the bottleneck was the db io.
with a mongo.conf
engine: inMemory
dbPath: /Users/beoliver/data/testdb
inMemorySizeGB: 1
mongo is started with the flag --conf ~/path/to/mongo.conf
I added the java flags to the project
:jvm-opts ["-XX:-OmitStackTraceInFastThrow" "-Xmx4g" "-Xms1g"]
to try and avoid extra swaps.
This appeared to fix the issue and the tests ran as:
time lein t
lein t 238.71s user 8.72s system 59% cpu 6:57.92 total
This is reasonable compared with the results from other team members.
But then re-running the tests again the speed is back to the original (half and hour mark).
lein t 252.53s user 13.76s system 16% cpu 26:52.45 total
cpu usage peaks at about 50% but for the most part is around <5% (this includes times when it idles at <1%)
Real memory size: 1.55 GB
Virtual memory size : 8.08 GB
Shared Memory Size: 18.0 MB
Private Memory Size : 1.67 GB
Has anyone had similar experiences? Suggestions? Is there a good way of profiling - better than starting at Activity monitor?
