Is a Native Quarkus Application better than one on the JVM? - quarkus

I am comparing the same Quarkus application in different executables, regular jar, fast-jar and native executable. To be able to compare them, I run the same performance test.
The results are the following:
Regular Jar, starts in 0.904s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 361.86us 4.48ms 155.85ms 99.73%
Req/Sec 29.86k 4.72k 37.60k 87.83%
3565393 requests in 1.00m, 282.22MB read
Requests/sec: 59324.15
Transfer/sec: 4.70MB
Fast-Jar, starts in 0.590s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 344.38us 3.89ms 142.71ms 99.74%
Req/Sec 27.21k 6.52k 40.67k 73.48%
3246932 requests in 1.00m, 257.01MB read
Requests/sec: 54025.50
Transfer/sec: 4.28MB
Native, start in 0.011s. Regarding performance, the result is given below:
Running 1m test # http://localhost:8080/hello
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 303.72us 471.86us 29.00ms 98.05%
Req/Sec 19.03k 3.21k 30.19k 78.75%
2272236 requests in 1.00m, 179.86MB read
Requests/sec: 37867.20
Transfer/sec: 3.00MB
The number of requests processed in a native application is roughly 1 million less than a JVM Quarkus application. However, the starting up time, Avg and Stdev in native application is better than others.
I was wondering why this happens and if a native application is better than one over the JVM.

Start up time and memory consumption will definitely be better with native quarkus applications. This is because quarkus extends graalvm's native image concept.
From https://www.graalvm.org/reference-manual/native-image/
native-image is a utility that processes all classes of an application
and their dependencies, including those from the JDK. It statically
analyzes these data to determine which classes and methods are
reachable during the application execution. Then it ahead-of-time
compiles that reachable code and data to a native executable for a
specific operating system and architecture.
As the application is processed with ahead-of-time compilation and the JVM used (aka Substrate VM) contains only the essential part, the resulting program has faster startup time and lower runtime memory overhead compared to a JVM.

Related

Reduce latency in P99 response time in low TPS application - Jboss application server

I'm looking to find way to reduce latencies / higher response time at P99. The application is running on Jboss application server. Current configuration of the system is 0.5 core and 2 GB memory.
Suspecting low TPS might be the reason for higher P99's because current usages of the application at peak traffic is 0.25 core, averaging "0.025 core". And old gen GC times are running at 1s. Heap setting -xmx1366m -xms512m, metaspace at 250mb
Right now we have parallel GC policy, will G1GC policy help?
What else should I consider?

JMeter - Throughput Shaping Timer does not keep the requests/sec rate

I am using Ultimate Thread Group and fixed 1020 threads count for entire test duration - 520 seconds.
I've made a throughput diagram as follows:
The load increses over 10 seconds so the spikes shouldn't be very steep. Since the max RPS is 405 and max response time is around 25000ms 1020 threads should be enough.
However, when I run the test (jmeter -t spikes-nomiss.jmx -l spikes-nomiss.csv -e -o spikes-nomiss -n) I have the following graph for hits/seconds.
The threads are stopped for few seconds and suddenly 'wake up'. I can't find a reason for it. The final minute has a lot higher frequency of the calls. I've set heap size to 2GBs and resources are available, the CPU usage does not extend 50% during peaks, and memory is around 80% (4Gbs of ram on the machine). Seeking any help to fix the freezes.
Make sure to monitor JMeter's JVM using JConsole as it might be the case JMeter is not capable of create spikes due to insufficient resources. The slowdowns can be caused by excessive Garbage Collection
It might be the case 1020 threads are not enough to reach the desired throughput as it depends mainly on your application response time. If your application response time is higher than 300 milliseconds - you will not be able to get 405 RPS using 1020 threads. It might be a better idea to consider using Concurrency Thread Group which can be connected to the Throughput Shaping Timer via Schedule Feedback function

Improve erlang cowboy performance

We have been using Cowboy in production on our Compute Engine machines on GCP and we started benchmarking and improving the performance of our service to handle more Reqs/sec (in our case since we are in Adtech it is bids/sec).
After isolating and handling a lot of the issues separately we came down to Cowboy optimization, these are our current findings and limitations:
Cowboy setup
We are using Cowboy 2.5 with 200 acceptors and max backlog of 1024
init(Req, _State) ->
T1 = erlang:monotonic_time(),
{ok, BRjson, _} = cowboy_req:read_body(Req),
%% ---- rest of work goes here but is switched off for our test---
erlang:send_after(60, self(), {'RSP', x, no_workers}),
{cowboy_loop, Req, #state{t1 = T1}, hibernate}.
Erlang VM
OTP 21
VM args: -smp auto +P 134217727 +K true +A 64 -rate 1200 +stbt db +scl false +sfwi 500 +spp true +zdbbl 8092
Load
Json requests ~4KB in size. And testing is done using a separate machine on the same internal network (no SSL) using jmeter. All requests are POST with keep-alive
Servers
GCP Compute Engine 10 vcpu cores and 14GB RAM (now and tested before with the 4 vcpu)
Findings
We are able to reach to ~1900 reqs/sec but all CPU cores in htop are showing almost 80% utilization
At 1000 reqs/sec we se cpu utilization at 45-50% per core (still high bearing in mind that no other part of our application is running)
*Note: using the 4 vcpu machine we were able to get close to 700 reqs/sec and memory in all of our tests is barely utilizied or changing with load
QUESTION: How to improve cowboy's performance in terms of cpu usage?
First off, thanks #Pouriya for suggestions--actually, discussing this back and forth made me go back and re-check one of my comments about the right tool for the job. PS: we are on GCP so 72 cores would be out of question at this stage.
Cowboy is great! but it does add a bit of overhead in the critical path of each request--a feature (or issue in my case) that is not needed.
We tested again with Elli (https://github.com/elli-lib/elli) but built a proper testing setup this time and it provided improvement up to 20% ~ exactly what we needed!
If anyone at Cowboy/Ranch team has a way of drastically improving CPU overhead will gladly test since we still use it in our APIs but not the critical path.

jmeter performance analysis

I am running performance test for perf environment.
Below is the results:
CPU Utilization
Server Apdex Resp. time Throughput Error Rate CPU usage Memory
per001205 0.970.5 220 ms 2,670 rpm 0.0009 % 493.00% 2.2 GB
per001206 0.950.5 280 ms 2,670 rpm 0.0043 % 516.00% 2.4 GB
per011079 0.830.5 526 ms 2,670 rpm 0.0034 % 598.00% 2.5 GB
per011080 0.670.5 1,110 ms 2,670 rpm 0.0026 % 639.00% 2.6 GB
Can you comment on how the avergage response time? is it accepted?
I can see CPU usage is more than 100% , is it dangerous ?
How should i improve this? i am running it for 250 users.
First of all check out CPU usage mismatch or usage over 100% article.
Consider other monitoring method, i.e. go to hosts directly and check CPU usage via your operating system built-in commands or use JMeter PerfMon plugin to either confirm the picture or get an alternative view of CPU load. Depending on the result you have 2 options:
Either individual servers CPU usage is acceptable and you can decide whether throughput good or not
Or you need to fix the issue in your application code: using profiling tools for the programming language, your application is written in detect the most CPU intensive functions and refactor them to be less processor-time-hungry

What can be the reason for CPU load to NOT scale linearly with the number of traffic processing workers?

We are writing a Front End that is supposed to process large volume of traffic (in our case it is Diameter traffic, but that may be irrelevant to the question). As client connects, the server socket gets assigned to one of the Worker processes that perform all the actual traffic processing. In other words, Worker does all the work, and more Workers should be added when more clients get connected.
One would expect the CPU load per message to be the same for different number of Workers, because Workers are totally independent, and serve different sets of client connections. Yet our tests show that it takes more CPU time per message, as the number of Workers grow.
To be more precise, the CPU load depends on the TPS (Transactions or Request-Responses per second) as follows.
For 1 Worker:
60K TPS - 16%, 65K TPS - 17%... i.e. ~0.26% CPU per KTPS
For 2 Workers:
80K TPS - 29%, 85K TPS - 30%... i.e. ~0.35% CPU per KTPS
For 4 Workers:
85K TPS - 33%, 90K TPS - 37%... i.e. ~0.41% CPU per KTPS
What is the explanation for this? Workers are independent processes and there is no inter-process communication between them. Also each Worker is single-threaded.
The programming language is C++
This effect is observed on any hardware, which is close to this one: 2 Intel Xeon CPU, 4-6 cores, 2-3 GHz
OS: RedHat Linux (RHEL) 5.8, 6.4
CPU load measurements are done using mpstat and top.
If either the size of the program code used by a worker or the size of the data processed by a worker (or both) is non-small, the reason could be the reduced effectiveness of the various caches: The locality-over-time of how a single worker accesses its program code and/or its data is disturbed by other workers intervening.
The effect can be quite complicated to understand, because:
it depends massively on the structure of your code's computations,
modern CPUs have about three levels of cache,
each cache has a different size,
some caches are local to one core, others are not,
how often the workers intervene depends on your operating system's scheduling strategy
which gets even more complicated if there are multiple cores,
unless your programming language's run-time system also intervenes,
in which case it is more complicated still,
your network interface is a computer of its own and has a cache, too,
and probably more.
Caveat: Given the relatively coarse granularity of process scheduling, the effect of this ought not to be as large as it is, I think.
But then: Have you looked up how "percent of CPU" is even defined?
Until you reach CPU saturation on your machine you cannot be sure that the effect is actually as large as it looks. And when you do reach saturation, it may not be the CPU at all that is the bottleneck here, so are you sure you need to care about CPU load?
I complete agree with #Lutz Prechelt. Here I just want to add the method about how to investigate on the issue and the answer is Perf.
Perf is a performance analyzing tool in Linux which collects both kernel and userspace events and provide some nice metrics. It’s been widely used in my team to find bottom neck in CPU-bound applications.
the output of perf is like this:
Performance counter stats for './cache_line_test 0 1 2 3':
1288.050638 task-clock # 3.930 CPUs utilized
185 context-switches # 0.144 K/sec
8 cpu-migrations # 0.006 K/sec
395 page-faults # 0.307 K/sec
3,182,411,312 cycles # 2.471 GHz [39.95%]
2,720,300,251 stalled-cycles-frontend # 85.48% frontend cycles idle [40.28%]
764,587,902 stalled-cycles-backend # 24.03% backend cycles idle [40.43%]
1,040,828,706 instructions # 0.33 insns per cycle
# 2.61 stalled cycles per insn [51.33%]
130,948,681 branches # 101.664 M/sec [51.48%]
20,721 branch-misses # 0.02% of all branches [50.65%]
652,263,290 L1-dcache-loads # 506.396 M/sec [51.24%]
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
4,846,815 LLC-loads # 3.763 M/sec [40.18%]
301 LLC-load-misses # 0.01% of all LL-cache hits [39.58%]
It output your cache miss rate with will easy you to tune your program and see the effect.
I write a article about cache line effects and perf and you can read it for more details.

Resources