JMeter - Throughput Shaping Timer does not keep the requests/sec rate - jmeter

I am using Ultimate Thread Group and fixed 1020 threads count for entire test duration - 520 seconds.
I've made a throughput diagram as follows:
The load increses over 10 seconds so the spikes shouldn't be very steep. Since the max RPS is 405 and max response time is around 25000ms 1020 threads should be enough.
However, when I run the test (jmeter -t spikes-nomiss.jmx -l spikes-nomiss.csv -e -o spikes-nomiss -n) I have the following graph for hits/seconds.
The threads are stopped for few seconds and suddenly 'wake up'. I can't find a reason for it. The final minute has a lot higher frequency of the calls. I've set heap size to 2GBs and resources are available, the CPU usage does not extend 50% during peaks, and memory is around 80% (4Gbs of ram on the machine). Seeking any help to fix the freezes.

Make sure to monitor JMeter's JVM using JConsole as it might be the case JMeter is not capable of create spikes due to insufficient resources. The slowdowns can be caused by excessive Garbage Collection
It might be the case 1020 threads are not enough to reach the desired throughput as it depends mainly on your application response time. If your application response time is higher than 300 milliseconds - you will not be able to get 405 RPS using 1020 threads. It might be a better idea to consider using Concurrency Thread Group which can be connected to the Throughput Shaping Timer via Schedule Feedback function

Related

Reduce latency in P99 response time in low TPS application - Jboss application server

I'm looking to find way to reduce latencies / higher response time at P99. The application is running on Jboss application server. Current configuration of the system is 0.5 core and 2 GB memory.
Suspecting low TPS might be the reason for higher P99's because current usages of the application at peak traffic is 0.25 core, averaging "0.025 core". And old gen GC times are running at 1s. Heap setting -xmx1366m -xms512m, metaspace at 250mb
Right now we have parallel GC policy, will G1GC policy help?
What else should I consider?

tail latency and throughput

My understanding is that tail latency is a measure of the high percentiles (95th , 99th) of response times among a set of requests being launched into the system.
My question is that how the tail latency relates to throughput, otherwise said, say I targeted the system with a 100 req/seconds and then with a 1000 req/sec (with a uniform interarrival time), the 95th percentile at 100 req/sec varies largely in comparison to the 95th percentile at 1000 req/sec?
What tail latency value shall be reported? or otherwise tail latency is independent of throughput and shall be reported at each/several target throughput i.e. 100 and 1000 req/sec in my case?
Tail latency is related to throughput. Your problem is closely related to the QoS situation (high-priority latency-critical applications vs. low-priority best-effort applications).
Besides, it only makes sense to test tail latency when the system is at full load.

Is a Native Quarkus Application better than one on the JVM?

I am comparing the same Quarkus application in different executables, regular jar, fast-jar and native executable. To be able to compare them, I run the same performance test.
The results are the following:
Regular Jar, starts in 0.904s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 361.86us 4.48ms 155.85ms 99.73%
Req/Sec 29.86k 4.72k 37.60k 87.83%
3565393 requests in 1.00m, 282.22MB read
Requests/sec: 59324.15
Transfer/sec: 4.70MB
Fast-Jar, starts in 0.590s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 344.38us 3.89ms 142.71ms 99.74%
Req/Sec 27.21k 6.52k 40.67k 73.48%
3246932 requests in 1.00m, 257.01MB read
Requests/sec: 54025.50
Transfer/sec: 4.28MB
Native, start in 0.011s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 303.72us 471.86us 29.00ms 98.05%
Req/Sec 19.03k 3.21k 30.19k 78.75%
2272236 requests in 1.00m, 179.86MB read
Requests/sec: 37867.20
Transfer/sec: 3.00MB
The number of requests processed in a native application is roughly 1 million less than a JVM Quarkus application. However, the starting up time, Avg and Stdev in native application is better than others.
I was wondering why this happens and if a native application is better than one over the JVM.
Start up time and memory consumption will definitely be better with native quarkus applications. This is because quarkus extends graalvm's native image concept.
From https://www.graalvm.org/reference-manual/native-image/
native-image is a utility that processes all classes of an application
and their dependencies, including those from the JDK. It statically
analyzes these data to determine which classes and methods are
reachable during the application execution. Then it ahead-of-time
compiles that reachable code and data to a native executable for a
specific operating system and architecture.
As the application is processed with ahead-of-time compilation and the JVM used (aka Substrate VM) contains only the essential part, the resulting program has faster startup time and lower runtime memory overhead compared to a JVM.

Python 3 multiprocessing: optimal chunk size

How do I find the optimal chunk size for multiprocessing.Pool instances?
I used this before to create a generator of n sudoku objects:
processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.
I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).
If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.
Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.
This answer provides a high level overview.
Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.
You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.
The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.
Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).
Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.
Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.
To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

Resources