tail latency and throughput - performance

My understanding is that tail latency is a measure of the high percentiles (95th , 99th) of response times among a set of requests being launched into the system.
My question is that how the tail latency relates to throughput, otherwise said, say I targeted the system with a 100 req/seconds and then with a 1000 req/sec (with a uniform interarrival time), the 95th percentile at 100 req/sec varies largely in comparison to the 95th percentile at 1000 req/sec?
What tail latency value shall be reported? or otherwise tail latency is independent of throughput and shall be reported at each/several target throughput i.e. 100 and 1000 req/sec in my case?

Tail latency is related to throughput. Your problem is closely related to the QoS situation (high-priority latency-critical applications vs. low-priority best-effort applications).
Besides, it only makes sense to test tail latency when the system is at full load.

Related

Is a Native Quarkus Application better than one on the JVM?

I am comparing the same Quarkus application in different executables, regular jar, fast-jar and native executable. To be able to compare them, I run the same performance test.
The results are the following:
Regular Jar, starts in 0.904s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 361.86us 4.48ms 155.85ms 99.73%
Req/Sec 29.86k 4.72k 37.60k 87.83%
3565393 requests in 1.00m, 282.22MB read
Requests/sec: 59324.15
Transfer/sec: 4.70MB
Fast-Jar, starts in 0.590s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 344.38us 3.89ms 142.71ms 99.74%
Req/Sec 27.21k 6.52k 40.67k 73.48%
3246932 requests in 1.00m, 257.01MB read
Requests/sec: 54025.50
Transfer/sec: 4.28MB
Native, start in 0.011s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 303.72us 471.86us 29.00ms 98.05%
Req/Sec 19.03k 3.21k 30.19k 78.75%
2272236 requests in 1.00m, 179.86MB read
Requests/sec: 37867.20
Transfer/sec: 3.00MB
The number of requests processed in a native application is roughly 1 million less than a JVM Quarkus application. However, the starting up time, Avg and Stdev in native application is better than others.
I was wondering why this happens and if a native application is better than one over the JVM.
Start up time and memory consumption will definitely be better with native quarkus applications. This is because quarkus extends graalvm's native image concept.
From https://www.graalvm.org/reference-manual/native-image/
native-image is a utility that processes all classes of an application
and their dependencies, including those from the JDK. It statically
analyzes these data to determine which classes and methods are
reachable during the application execution. Then it ahead-of-time
compiles that reachable code and data to a native executable for a
specific operating system and architecture.
As the application is processed with ahead-of-time compilation and the JVM used (aka Substrate VM) contains only the essential part, the resulting program has faster startup time and lower runtime memory overhead compared to a JVM.

JMeter - Throughput Shaping Timer does not keep the requests/sec rate

I am using Ultimate Thread Group and fixed 1020 threads count for entire test duration - 520 seconds.
I've made a throughput diagram as follows:
The load increses over 10 seconds so the spikes shouldn't be very steep. Since the max RPS is 405 and max response time is around 25000ms 1020 threads should be enough.
However, when I run the test (jmeter -t spikes-nomiss.jmx -l spikes-nomiss.csv -e -o spikes-nomiss -n) I have the following graph for hits/seconds.
The threads are stopped for few seconds and suddenly 'wake up'. I can't find a reason for it. The final minute has a lot higher frequency of the calls. I've set heap size to 2GBs and resources are available, the CPU usage does not extend 50% during peaks, and memory is around 80% (4Gbs of ram on the machine). Seeking any help to fix the freezes.
Make sure to monitor JMeter's JVM using JConsole as it might be the case JMeter is not capable of create spikes due to insufficient resources. The slowdowns can be caused by excessive Garbage Collection
It might be the case 1020 threads are not enough to reach the desired throughput as it depends mainly on your application response time. If your application response time is higher than 300 milliseconds - you will not be able to get 405 RPS using 1020 threads. It might be a better idea to consider using Concurrency Thread Group which can be connected to the Throughput Shaping Timer via Schedule Feedback function

What is the performance of 10 processors capable of 200 MFLOPs running code which is 10% sequential and 90% parallelelizable?

simple problem from Wilkinson and Allen's Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Working through the exercises at the end of the first chapter and want to make sure that I'm on the right track. The full question is:
1-11 A multiprocessor consists of 10 processors, each capable of a peak execution rate of 200 MFLOPs (millions of floating point operations per second). What is the performance of the system as measured in MFLOPs when 10% of the code is sequential and 90% is parallelizable?
I assume the question wants me to find the number of operations per second of a serial processor which would take the same amount of time to run the program as the multiprocessor.
I think I'm right in thinking that 10% of the program is run at 200 MFLOPs, and 90% is run at 2,000 MFLOPs, and that I can average these speeds to find the performance of the multiprocessor in MFLOPs:
1/10 * 200 + 9/10 * 2000 = 1820 MFLOPs
So when running a program which is 10% serial and 90% parallelizable the performance of the multiprocessor is 1820 MFLOPs.
Is my approach correct?
ps: I understand that this isn't exactly how this would work in reality because it's far more complex, but I would like to know if I'm grasping the concepts.
Your calculation would be fine if 90% of the time, all 10 processors were fully utilized, and 10% of the time, just 1 processor was in use. However, I don't think that is a reasonable interpretation of the problem. I think it is more reasonable to assume that if a single processor were used, 10% of its computations would be on the sequential part, and 90% of its computations would be on the parallelizable part.
One possibility is that the sequential part and parallelizable parts can be run in parallel. Then one processor could run the sequential part, and the other 9 processors could do the parallelizable part. All processors would be fully used, and the result would be 2000 MFLOPS.
Another possibility is that the sequential part needs to be run first, and then the parallelizable part. If a single processor needed 1 hour to do the first part, and 9 hours to do the second, then it would take 10 processors 1 + 0.9 = 1.9 hours total, for an average of about (1*200 + 0.9*2000)/1.9 ~ 1053 MFLOPS.

What can be the reason for CPU load to NOT scale linearly with the number of traffic processing workers?

We are writing a Front End that is supposed to process large volume of traffic (in our case it is Diameter traffic, but that may be irrelevant to the question). As client connects, the server socket gets assigned to one of the Worker processes that perform all the actual traffic processing. In other words, Worker does all the work, and more Workers should be added when more clients get connected.
One would expect the CPU load per message to be the same for different number of Workers, because Workers are totally independent, and serve different sets of client connections. Yet our tests show that it takes more CPU time per message, as the number of Workers grow.
To be more precise, the CPU load depends on the TPS (Transactions or Request-Responses per second) as follows.
For 1 Worker:
60K TPS - 16%, 65K TPS - 17%... i.e. ~0.26% CPU per KTPS
For 2 Workers:
80K TPS - 29%, 85K TPS - 30%... i.e. ~0.35% CPU per KTPS
For 4 Workers:
85K TPS - 33%, 90K TPS - 37%... i.e. ~0.41% CPU per KTPS
What is the explanation for this? Workers are independent processes and there is no inter-process communication between them. Also each Worker is single-threaded.
The programming language is C++
This effect is observed on any hardware, which is close to this one: 2 Intel Xeon CPU, 4-6 cores, 2-3 GHz
OS: RedHat Linux (RHEL) 5.8, 6.4
CPU load measurements are done using mpstat and top.
If either the size of the program code used by a worker or the size of the data processed by a worker (or both) is non-small, the reason could be the reduced effectiveness of the various caches: The locality-over-time of how a single worker accesses its program code and/or its data is disturbed by other workers intervening.
The effect can be quite complicated to understand, because:
it depends massively on the structure of your code's computations,
modern CPUs have about three levels of cache,
each cache has a different size,
some caches are local to one core, others are not,
how often the workers intervene depends on your operating system's scheduling strategy
which gets even more complicated if there are multiple cores,
unless your programming language's run-time system also intervenes,
in which case it is more complicated still,
your network interface is a computer of its own and has a cache, too,
and probably more.
Caveat: Given the relatively coarse granularity of process scheduling, the effect of this ought not to be as large as it is, I think.
But then: Have you looked up how "percent of CPU" is even defined?
Until you reach CPU saturation on your machine you cannot be sure that the effect is actually as large as it looks. And when you do reach saturation, it may not be the CPU at all that is the bottleneck here, so are you sure you need to care about CPU load?
I complete agree with #Lutz Prechelt. Here I just want to add the method about how to investigate on the issue and the answer is Perf.
Perf is a performance analyzing tool in Linux which collects both kernel and userspace events and provide some nice metrics. It’s been widely used in my team to find bottom neck in CPU-bound applications.
the output of perf is like this:
Performance counter stats for './cache_line_test 0 1 2 3':
1288.050638 task-clock # 3.930 CPUs utilized
185 context-switches # 0.144 K/sec
8 cpu-migrations # 0.006 K/sec
395 page-faults # 0.307 K/sec
3,182,411,312 cycles # 2.471 GHz [39.95%]
2,720,300,251 stalled-cycles-frontend # 85.48% frontend cycles idle [40.28%]
764,587,902 stalled-cycles-backend # 24.03% backend cycles idle [40.43%]
1,040,828,706 instructions # 0.33 insns per cycle
# 2.61 stalled cycles per insn [51.33%]
130,948,681 branches # 101.664 M/sec [51.48%]
20,721 branch-misses # 0.02% of all branches [50.65%]
652,263,290 L1-dcache-loads # 506.396 M/sec [51.24%]
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
4,846,815 LLC-loads # 3.763 M/sec [40.18%]
301 LLC-load-misses # 0.01% of all LL-cache hits [39.58%]
It output your cache miss rate with will easy you to tune your program and see the effect.
I write a article about cache line effects and perf and you can read it for more details.

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.
To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

Resources