Quarkus. Load test and memory utilisation - quarkus

I created a simple service that performs 4 HTTP calls and 4 db calls to collect some data and pass it to the HTTP response as JSON.
As I start the application (native, no docker), I see it consumes 7MB, sometimes 15MB sometimes 30MB. Good.
As I start loading testing it,
sending 1 request every 10 milliseconds, in total 100 requests.
I noticed the memory consumption goes to 200MB right away. Then after 5-6 more tests to 400MB. (As much is Spring Boot version of it takes).
Question is: is it expected to be like that?
Should it be trying be minimalist (for the native v of it) about the memory and clean after itself, at least after n minutes or so? Is there are settings for that?

I'm no expert on GraalVM native images, but from a quick local test with a "Hello world" app, I'm seeing similar numbers to you. It seems if you don't set a limit on max heap size, one will be chosen automatically (not sure about criteria). You can start your app with -XX:+PrintGC to see memory activity. Then, you can experiment with -Xmx (e.g. -Xmx32m to set max heap size to 32 MB) -- the more memory you give to the app, the less frequent garbage collection cycles will be. This article shows some more interesting GC options: https://e.printstacktrace.blog/graalvm-heap-size-of-native-image-how-to-set-it/

Related

Java buildpack memory calculation

Java buildpack memory calculator with Spring Boot application inside of Docker container with 1GB memory calculates memory as it says in documentation, it takes entire available memory and this are calculated JVM options:
Calculated JVM Memory Configuration: -XX:MaxDirectMemorySize=10M -Xmx747490K -XX:MaxMetaspaceSize=157725K -Xss1M (Total Memory: 1G, Thread Count: 50, Loaded Class Count: 25433, Headroom: 0%)
Question is why does it takes entire available memory and gives it to JVM? It should leave some memory for java process outside of JVM. This can lead to OOM because JVM thinks it has 1GB for itself (747490K for heap), and in reality it has less because some of it's memory is used by native memory, outside of JVM.
Should I not use this calculator and set JVM configuration by myself or I can reconfigure this somehow?
Question is why does it takes entire available memory and gives it to JVM?
The assumption is that the only thing running in your container is your Java application, thus it assigns all of the available memory to be used.
If you do things like shell out and run other processes or run other processes in the container, you need to tell memory calculator so it can take that into account.
This can lead to OOM because JVM thinks it has 1GB for itself (747490K for heap), and in reality it has less because some of it's memory is used by native memory, outside of JVM.
The memory calculator takes into consideration the major memory regions within a Java process. Not just heap. That said, it cannot 100% guarantee that you will never go over your memory limit. That's impossible with a Java app.
There are things you can do as an application developer, like create 10,000 threads or JNI, that cannot be restricted and could potentially consume a whole ton of memory. If you do that, your app will go over its container memory limit and crash.
The memory calculator attempts to give you a reasonable memory configuration for most common Java workloads. Running a web app, running a microservice, running some batch jobs, etc...
If you are doing something that doesn't fit within that pattern, then you can simply tell the memory calculator and it'll adjust things accordingly.
Should I not use this calculator and set JVM configuration by myself or I can reconfigure this somehow?
Even if you need to customize what the calculator is doing it can be helpful. It's additional toil to calculate these values manually, especially when it's so easy to change the memory limits. If your ops team increases the memory limit of the container, you want your application to automatically adjust to that configuration (as well as it can).
Beyond that, memory calculator is also good at detecting problems early. If you configure the JVM manually and you mess it up, let's say you over-allocate memory, the JVM won't necessarily care until it tries to get more memory and can't. At some point down the road, you're going to have a problem but it's not clear when (probably at 3am on a Sat, lol).
With memory calculator, it's doing the math when your container first starts to make sure that memory settings are sane. If there's something off with the configuration, it'll fail and let you know.
TIPS:
You can override a memory calculator-defined value by simply setting that JVM option in the JAVA_TOOL_OPTIONS env variable. For example, if I want to allow for more direct memory, I would set JAVA_TOOL_OPTIONS='-XX:MaxDirectMemorySize=50M'. Then when you restart the container, the memory calculator will shift memory around to accommodate that.
The one thing you don't want to set is -Xmx. The memory calculator should always set this because it will set it to whatever is left after other regions have been accounted for. You can think of it like HEAP = CONTAINER_MEMORY_LIMIT - (all static memory regions).
If you were to set -Xmx, you have to get it exactly right. If it's too low then you're wasting memory. If it's too high then you could exceed the container memory limit and get crashes.
In short, if you think you want to set -Xmx, you should either increase the container memory limit or decrease one of the static memory regions.
If you run other things in the container, you need to set the headroom. This is done with the BPL_JVM_HEAD_ROOM env variable. Give it a percent of the total container memory limit. Ex: BPL_JVM_HEAD_ROOM=20 would use 80% of the container's memory limit for Java and 20 for other stuff.
Setting some headroom can be useful in other cases as well, like if you're troubleshooting a container crash and you want a little extra room, or if you don't like operating at 100% the memory limit. You can leave 5 or 10% unused to match your comfort level.
If you have an application that uses a lot of threads, you'll need to adjust this as well. The default is 250 threads, which works well for many web/servlet-based applications (thread per request model). We do automatically lower to 50 threads if you're specifically using Spring Webflux which does not need so many threads.
For other cases, it's up to you to configure this. For example, if you have a batch application that only needs a thread pool of 10, then you could set this 40 or 50. 40-50 seems weird in this example, but the JVM creates a number of its own threads and you need to account for those in addition to application-specific threads when in doubt look at a thread dump.

Spring Boot High Heap Usage

We have a spring boot application that runs on 20 servers and we have a balancer that redirects the requests to our servers.
Since last week we are having huge problems with CPU usage (100% in all VM's) almost simultaneously without having any noticeable increase in the incoming requests.
Before that, we had 8 VM's without any issue.
In peak hours we have 3-4 thousand users with 15-20k requests per 15 minutes.
I am sure it has something to do with the heap usage since all the CPU usage comes from the GC that tries to free up some memory.
At the moment, we isolated some requests that we thought might cause the problem to specific VM's and proved that those VM's are stable even though there is traffic. (In the picture below you can see the OLD_gen memory is stable in those VM's)
The Heap memory looked something like this
The memory continues to increase and there are two scenarios, it will either reach a point and after 10 minutes it will drop on its own at 500MB or it will stay there cause 100% CPU usage and stay there forever.
From the heap dumps that we have taken, it seems that most of the memory has been allocated in char[] instances.
We have used a variety of tools (VisualVM, JProfiler etc) in order to try to debug the issue without any luck.
I don't know if I am missing something obvious, or something else.
I also tried, to change GC algorithm to G1 from the default and disable hibernate query cache plan since a lot of our queries are using the in parameter for filtering.
UPDATE
We managed to reduce the number of requests in our most heavily used API Call and the OLD gen looks like that now. Is that normal?

Making IO application more CPU efficent

I have an application that takes files from one place and moves them to another place - pretty much all this application does is checks if files are in s3 and downloads ones that are not to another s3. Currently application uses very low amounts of provided CPU. From this post, my understanding that it is to be expected (seeing as my app is pretty much I/O and nothing more).
My initial idea was to lower the number of CPUs provided to the app. However, providing less and less negatively impacted the speed at which my app performs its duties (which according to this article kind of make sense - less CPU means less total clock speed). This is not an option as it needs to run somewhat fast.
I am using kafka messages to start my app. So another idea of mine was to increase the number of partitions in my topic from which my app consumes the messages (so that I can increase the no. of threads that can run concurrently). That allowed me to reduce the number of CPU that I provide to my app (while maintaining the desired processing speed) but my app still uses very low amounts of CPUs.
My app runs in kubernates whose cluster is deployed to EC2s, if that is of any difference. My app is springBoot java. I tried to only give it a minimum number of CPUs, while maxing out the no. of concurrent threads in my app, but again I can see a lot of wasted CPU there.
My question is then as follows: is it possible to somehow make an application to use all available CPU (thus making it more efficient) in this scenario? Is there a config or a method or something that does that? Or for an app that checks data is present and downloads data somewhere else, this is an expected behavior - increasing the number of available resource would improve speed at which my app runs but as a cons of that, there will be waster CPU? (so I am in the classic "good comes with the bad" sort of situation here?)

Uncaught Exception java.lang.OutOfMemoryError: "unable to create new native thread" error occurring while running jmeter in non gui mode

My scenario,
Step1: I have set my thread group for 1000:threads & 500:seconds
Step2:Configure heep space : HEAP=-Xms1024m -Xmx1024m
Step3:Now, running jmeter for non gui mode.
In this scenario,"Uncaught Exception java.lang.OutOfMemoryError: unable to create new native thread" error occuring in my system.
My system configuration
Processor:Intel® Pentium(R) CPU G2010 # 2.80GHz × 2
OS Type:32 bit
Disc:252.6GB
Memory:3.4 GiB
kindly give me a solution for this scenario.
Thanks,
Vairamuthu.
You don't have enough memory in your machine to consume 1000 threads. It is clearly visible from the error that your machine can not create 1000 threads. You should tweak your machine to resolve this situation.
You have to consider these points:
JMeter is a Java tool it runs with JVM. To obtain maximum capability, we need to provide maximum resources to JMeter during execution.First, we need to increase heap size (Inside JMeter bin directory, we get jmeter.bat/sh)
HEAP=-Xms512m –Xmx512m
It means default allocated heap size is minimum 512MB, maximum 512MB. Configure it as per you own PC configuration. Keep in mind, OS also need some amount of memory, so don't allocate all of you physical RAM.
Then, add memory allocation rate
NEW=-XX:NewSize=128m -XX:MaxNewSize=512m
This means memory will be increased at this rate. You should be careful, because, if your load generation is very high at the beginning, this might need to increase. Keep in mind, it will fragment your heap space inside JVM if the range too broad. If so Garbage Collector needs to work harder to clean up.
JMeter is Java GUI application. It also has the non-GUI edition which is very resource intensive(CPU/RAM). If we run Jmeter in non-GUI mode , it will consume less resource and we can run more thread.
Disable all the Listeners during the Test Run. They are only for debugging and use them to design your desired script.
Listeners should be disabled during load tests. Enabling them causes additional overheads, which consume valuable resources (more memory) that are needed by more important elements of your test.
Always try to use the Up-to-date software. Keep your Java and JMeter updated.
Don’t forget that when it comes to storing requests and response headers, assertion results and response data can consume a lot of memory! So try not to store these values on JMeter unless it’s absolutely necessary.
Also, you need to monitor whether your machine's Memory consumption, CPU usages are running below 80 % or not. If these usages exceed 80 % consider those tests as unreliable as report.
After all of these, if you can't generate 1000 threads from your machine, then you must try with the Distributed Load Testing.
Here is a document for JMeter Distributed Testing Step-by-step.
For better and more elaborated understanding these two blogs How many users JMeter can support? and 9 Easy Solutions for a JMeter Load Test “Out of Memory” Failure must help.
I have also found this article very helpful to understand and how to handle them.
The error is due to lack of free RAM.
Looking into your hardware, it doesn't seem you will be able to produce the load of 1k users so I would recommend reconsidering your approach.
For example, you anticipate 1000 simultaneous users working with your application. However it doesn't necessarily mean 100 concurrent users as:
real users don't hammer application non-stop, they need some time to "think" between operations, this "think time" differs depending on application nature, but you should keep it as close to reality as possible
application response time should be added to think time
So given you have 1000 users, each of them "thinks" 10 seconds between operations and application response time is 2 seconds, each user will be able to send 5 requests per minute (60 / (10 + 2)).
Assuming above scenario 1000 users will send 5000 requests per minute which gives us ~83 requests per second and it seems to be achievable with your current hardware.
So if you are not in position to get more powerful hardware or more similar machines to use JMeter in distributed more, the options are in:
Add "think times" between operations using i.e. Constant Timer or Uniform Random Timer
Change your test scenario logic to simulate "requests per second" rather than "concurrent users". You can do it using Constant Throughput Timer or Throughput Shaping Timer.
Your issue is due to using a 32 bit OS, in this mode you are limited both in what you can allocate as Heap (depending on OS you will not be able to exceed 1.6 to 2.1 g) and native threads creation.
I'd suggest switching to 64 Bits OS + 64 bits Jdk.
But if you don't have any other option try setting in jmeter.sh in JVM_ARGS:
-Xss128k
Or if too low:
-Xss256k

How to gain control of a 5GB heap in Haskell?

Currently I'm experimenting with a little Haskell web-server written in Snap that loads and makes available to the client a lot of data. And I have a very, very hard time gaining control over the server process. At random moments the process uses a lot of CPU for seconds to minutes and becomes irresponsive to client requests. Sometimes memory usage spikes (and sometimes drops) hundreds of megabytes within seconds.
Hopefully someone has more experience with long running Haskell processes that use lots of memory and can give me some pointers to make the thing more stable. I've been debugging the thing for days now and I'm starting to get a bit desperate here.
A little overview of my setup:
On server startup I read about 5 gigabytes of data into a big (nested) Data.Map-alike structure in memory. The nested map is value strict and all values inside the map are of datatypes with all their field made strict as well. I've put a lot of time in ensuring no unevaluated thunks are left. The import (depending on my system load) takes around 5-30 minutes. The strange thing is the fluctuation in consecutive runs is way bigger than I would expect, but that's a different problem.
The big data structure lives inside a 'TVar' that is shared by all client threads spawned by the Snap server. Clients can request arbitrary parts of the data using a small query language. The amount of data request usually is small (upto 300kb or so) and only touches a small part of the data structure. All read-only request are done using a 'readTVarIO', so they don't require any STM transactions.
The server is started with the following flags: +RTS -N -I0 -qg -qb. This starts the server in multi-threaded mode, disable idle-time and parallel GC. This seems to speed up the process a lot.
The server mostly runs without any problem. However, every now and then a client request times out and the CPU spikes to 100% (or even over 100%) and keeps doing this for a long while. Meanwhile the server does not respond to request anymore.
There are few reasons I can think of that might cause the CPU usage:
The request just takes a lot of time because there is a lot of work to be done. This is somewhat unlikely because sometimes it happens for requests that have proven to be very fast in previous runs (with fast I mean 20-80ms or so).
There are still some unevaluated thunks that need to be computed before the data can be processed and sent to the client. This is also unlikely, with the same reason as the previous point.
Somehow garbage collection kicks in and start scanning my entire 5GB heap. I can imagine this can take up a lot of time.
The problem is that I have no clue how to figure out what is going on exactly and what to do about this. Because the import process takes such a long time profiling results don't show me anything useful. There seems to be no way to conditionally turn on and off the profiler from within code.
I personally suspect the GC is the problem here. I'm using GHC7 which seems to have a lot of options to tweak how GC works.
What GC settings do you recommend when using large heaps with generally very stable data?
Large memory usage and occasional CPU spikes is almost certainly the GC kicking in. You can see if this is indeed the case by using RTS options like -B, which causes GHC to beep whenever there is a major collection, -t which will tell you statistics after the fact (in particular, see if the GC times are really long) or -Dg, which turns on debugging info for GC calls (though you need to compile with -debug).
There are several things you can do to alleviate this problem:
On the initial import of the data, GHC is wasting a lot of time growing the heap. You can tell it to grab all of the memory you need at once by specifying a large -H.
A large heap with stable data will get promoted to an old generation. If you increase the number of generations with -G, you may be able to get the stable data to be in the oldest, very rarely GC'd generation, whereas you have the more traditional young and old heaps above it.
Depending the on the memory usage of the rest of the application, you can use -F to tweak how much GHC will let the old generation grow before collecting it again. You may be able to tweak this parameter to make this un-garbage collected.
If there are no writes, and you have a well-defined interface, it may be worthwhile making this memory un-managed by GHC (use the C FFI) so that there is no chance of a super-GC ever.
These are all speculation, so please test with your particular application.
I had a very similar issue with a 1.5GB heap of nested Maps. With the idle GC on by default I would get 3-4 secs of freeze on every GC, and with the idle GC off (+RTS -I0), I would get 17 secs of freeze after a few hundred queries, causing a client time-out.
My "solution" was first to increase the client time-out and asking that people tolerate that while 98% of queries were about 500ms, about 2% of the queries would be dead slow. However, wanting a better solution, I ended up running two load-balanced servers and taking them offline from the cluster for performGC every 200 queries, then back in action.
Adding insult to injury, this was a rewrite of an original Python program, which never had such problems. In fairness, we did get about 40% performance increase, dead-easy parallelization and a more stable codebase. But this pesky GC problem...

Resources